Making Kinections Chris Weeden Andrew Stohr Balu Swarna Bhargav Shivkumar Table of Contents Introduction and History.........................................................Page 3 Current Solutions…………………………………………………..Page 4 Goals……………………………………………………………………..Page 6 Quality of Life…………………………………………………………Page 7 Functional Requirements……………………………………….Page 8 User Profile/ Deployment Environment/Future………Page 11 Introduction and History Languages have helped mankind to communicate effectively over centuries. Though oral communication constitutes a major portion of communication, body language helps to emphasize and enhance the expression and thus enriches the experience. But what if body language is your only means of communication? In order for deaf, hard of hearing, or speech impaired people to communicate via telephone with hearing people, they must use a Video Relay Service (VRS.) A VRS is a service that involves relaying through a sign language interpreter, all communications to and from a deaf person. These services are useful but are also expensive to taxpayers (in the United States) because a certified sign language interpreter must be employed for each simultaneous call. An alternative form of communication between deaf and hearing people is a TTY or Teletypewriter. This is a tool that a deaf person can use to type out messages to a hearing person, and an operator can read their messages. The operator then transcribes into text the hearing person’s response. This form of communication is not ideal either as it requires the deaf person to communicate in English, which is not the primary language of most deaf individuals. Imagine a Kinect-based sign language recognition system, which could interpret standard sign language into English text, and perhaps text then to speech. This would be a multi-use application for the deaf, to include many forms of sign language and gesturing. There are many types of users for this system. First and foremost would be individuals who are deaf, hard of hearing, or in some other way speech impaired. Second, the system also helps anyone who wishes to communicate with someone when they don’t understand sign language themselves. For instance, this could be a teacher trying to teach hearing impaired students, or a doctor needing to treat a patient. The system would improve the users’ quality of life by allowing them to communicate with people who can hear, using sign language. It also allows other parties to communicate back with the hearing impaired even when they may not know sign language. Arguably the most valuable outcome, which this system would produce, would be the preservation of the deaf individual’s ability to communicate using ASL. Deaf culture cherishes and is generally incredibly prideful of ASL and when deaf individuals must use English to communicate with hearing individuals, they are being forced to abandon their primary language and in favor of English – a spoken language which is foreign and unnatural to them. Additionally, interactions that would otherwise be cumbersome and would require a human translator would be much easier. The next time a hearing impaired person walks into a doctor’s office, the doctor will turn on their own system, and the patient will be able to communicate freely and smoothly with the doctor. Current Solutions There are already some current solutions to the problem, and teams working on solutions to this communication problem. In the past there have been attempts to use electronic gloves to gather data related to hand gestures. Although these gloves may provide accurate information, they are very expensive for a normal consumer or small business. They are also very inefficient in recognizing facial expressions and it cannot handle multiple person interaction at the same time. It is for this reason that a computer-vision sign language recognition system would be more ideal. At the moment the current best solution has been from Microsoft. Microsoft, the maker of Kinect, have taken the problem into their own hands, and spent their own money, to try to find a solution using their own device. By using the Kinect, there can be an inexpensive and easy way for obtaining depth and skeletal information, along with hand gestures, posture, and facial expressions. Microsoft Research Asia has teamed up with researchers at Institute of Computing Technology at the Chinese Academy of Sciences in order to do so. The project team that is currently working on the problem is called The Kinect Sign Language Translator project. Currently their solution captures the gestures, while machine learning and pattern recognition programming help interpret the meaning. The system is capable of capturing a given conversation from both sides: a deaf person who is showing signs and a person who is speaking. Visual signs are converted to written and spoken translation rendered in real-time while spoken words are turned into accurate visual signs. The system consists of two modes, the first of which is translation mode, which translates sign language into text or speech. This includes isolated word recognition and sentence recognition. The raising and putting down of hands is defined as the “start” and the “end” of each sign language word to be recognized. By tracking the position of the user’s hands, many signs can be detected instantly by using the research group’s 3D trajectory matching algorithm. The second mode is communication mode, in which a nonhearing impaired person can communicate with the signer through an avatar. They speak to the system, and the avatar will sign what they spoke. This can also be done via typing it into the Kinect software. The downsides are that it currently takes five people to establish the recognition patterns for just one word. Only 300 Chinese sign language words have been added out of a total 4000. As you can see there is still a long road ahead towards mastering this problem. The hope is that with an implementation of a system similar to this one, the 360 million people world wide with hearing problems, will be able to use a new system like the ones mentioned above in order to make the quality of life better for everyone who communicates with the hearing impaired, and improve the quality of life of the hearing impaired using it. Goals Some of the goals for this system are: • To create a UI that is simple and easy to use. The hope for this project is that anyone with a hearing disability will have the chance to combat the issue through the use of the Kinect. This means that age and technological skills shouldn’t have an impact on the use of the program, it should be easy for everyone. • To acknowledge not only hand gestures, but facial expressions as well. A lot of emphasis is put on facial expressions and without them; you lose part of the language. Therefore, its important to include these in a conversation in order to fully understand what’s being said. • To quickly and efficiently translate languages to/from ASL. It is difficult to have a conversation when there is a lag in responses. Even more so, it could get frustrating. The purpose of this project is to improve a users quality of life. However, if the program lags, it may make the users life more difficult, even to the point where they wont want to use the system. Quality of Life Of course the system is being created with hopes to improve the quality of life. But what exactly will it improve? There are many applications for the Kinectbased sign language. Take a school for example. It is proven that kids with a hearing disability have a much harder time learning in school. But what if there was something affordable, which could be used by teachers to give these students more of an equal opportunity? This would not only help hearing impaired kids to learn more effectively, but it would also give them more confidence in and outside the classroom. Another situation the Kinect will help in is at a doctor’s office. If someone with a hearing disability were to get sick, it may be difficult for him or her to express the symptoms unless the doctor knows sign language. This may cause a miss diagnosis by the doctor, if the patient was even willing to go into the office. With the Kinect, it would be much easier for the patient to express how they are feeling. It will also help them feel more comfortable at the doctor’s office, which means it is more likely for them to visit. Expected features from the system (Functional Requirements) This system is being built in order to make life easier for the hearing and speech impaired while going about their daily chores. The design of this system must keep in mind the target users, and must be built accordingly so that it is more of a boon than a bane. Some of the features expected out of the system can be as follows: The system must make use of a Microsoft Kinect device or something similar to create an interactive application to convert American Sign Language to Normal speech and vice versa. There must be a provision to convert input text to Sign language as well. This text needs to be input in the system through a touch pad on the screen or an attached keyboard. Input text will be in plain English and emphasis is to be laid on easy and fast input as this must not be a deterrent to use the system. The system must have the capability to detect start of signing either when the user starts to use sign language or when a particular start sign is used. This start sign must be universal so that people new to the system can easily pick it up. There will be two vantage points to the system: a. The point of view of the disabled person. b. The point of view of the unimpaired person who does not know sign language. It follows from this that there must be 2 screens, which relay the working of the application to both parties, and consequently giving them the feedback that the system is working as it should be. The screen, let us call it SignView, that is pointed at the disabled person must relay the live camera feed as well as the text that is being construed from their signing. This feed must be fast enough and not suffer unnecessary lag causing the conversation to be interrupted. This visual display is to be used when the hearing/speech impaired person is the “speaker”. SignView must have a listen mode, which is to be used when the opposite party is saying something. In this mode, the screen must relay, through animations, the relevant signs for what the opposite party is saying. It must also textually display the transcript of what is being said for further ease of use. SignView must have an indicator flag/ animation, which would signal to the impaired user that the opposite party has something to say or signal when the opposite party interrupts the flow for some reason. The screen that is pointed to the unimpaired user, let us call it SpeechView, will also have similar functionalities to help the user understand what the impaired person is trying to communicate. There must be a textual script that is automatically typed on the screen as the impaired user signs the relevant conversational portions. In addition to the textual display, the SpeechView must also have an audio that will be speak out the converted signs. This audio option on the SpeechView must be mutable if the unimpaired user so desires. The speed of sign to audio conversion must be seamless to ensure a smooth conversation. In “Speak” mode, the SpeechView should listen to what the unimpaired person has to say and convert that to text, which is to be displayed in the SpeechView as well as into signs, which will be displayed in the SignView. Toggling between Speak mode and Listen mode in the SpeechView must be based on the voice of the unimpaired user. When the user is silent, the SpeechView must be in listen mode and when the user speaks up, it must automatically toggle speak mode. This toggling between Speak and listen modes in SpeechView must also be correlated to the SignView. When SpeechView is in Listen mode, SignView must be in Speak mode and vice versa. The SpeechView must have an interface to control this toggling as well as input of text in case the unimpaired user wants prefers to have manual control on the operation. The UI design of both these interfaces must be done bearing in mind the simplicity of the operation of the system. Text and signs must be large enough to see clearly and must be presented in a pleasing manner. Bright and jarring colors must not be used except for highlighting certain key areas/focus points as and when the situation calls for it. The Kinect device being used must be able to focus on the hand signs as well as the facial expressions of the impaired user as facial expressions are an integral part of any sign language. Interactive screen displays will make the impaired user much more comfortable with the system and efforts must be taken towards making them feel that the system is here to help them. This includes putting across errors on their part in a more gentle way. There are no budgetary constraints for the system subject to the complete user satisfaction and ease of use. This summarizes the key requirements of the Kinect based Sign Language recognition system. The designers are free to contribute any suggestions/additions to this system based on their personal interactions with hearing / speech impaired people. Valid and useful ideas can be incorporated into the system based on customer approval. User Profile The main end users of the system are people with hearing impairment. The age and technical skills of the users should not be criteria while designing the system. This is because the two main applications of the system are in classroom and in the doctor’s office, which can involve users from various age groups. In the classroom environment, this Kinect based system is used to teach students with hearing impairment; in this case users are at both ends of the system. At a broader perspective the system is always used as a bridge between a person who can understand sign language and a person who cannot understand it. Deployment Environment As of now the target work area of this project is confined to a room (classroom/doctor’s office). The system would require plenty of space in front of the Kinect device in order for there to be space for the hearing impaired person to sign. The lighting of the room also has to be taken into consideration. If the room is poorly lit, the Kinect will not be able to register movements that the signer is making, along with their posture and facial expressions. In order to make portability easier, the system should be designed in such a way that it works on a normal desktop computer. Take into consideration a wide variety of desktop computers available today and compatibility should not be an issue. Future Plans and Expandability The future plan is to transform this stationary system that is confined in a room to a portable device. This opens up doors to application of this project in many new scenarios. Some of them include ticket counters, bill desks at super markets, banks and so on!