Conference Session: C7 Paper # 2338 ADVANCEMENTS IN 3-D SENSING TECHNOLOGY IMPLIMENTED BY THE KINECT Thomas Forsythe (tpf9@pitt.edu), Marvin Green (meg89@pitt.edi) Abstract— On November 4th, 2010, Microsoft released a brand new toy for their Xbox 360 system called the “Kinect.” The Kinect, to the average consumer, was nothing more than a special camera that could detect your motions and reflect them on the television in the form of a game. In reality, the Kinect is a module that houses a color camera, an infrared light camera, an infrared light projector, and an electronic motor. What really makes the Kinect flourish, however, is the software written for it that enables it to capture and interpret the 3-D environment around it. This paper will describe and evaluate the innovative Kinect by Microsoft. It will describe how the employment of 3-D sensing technology in the Kinect offers innovative new ways to interact with a variety of situations ranging from the hospital to the living room. The Kinect, besides being used to interact with one’s Xbox 360 console, has been used by surgeons to map bone structures on patients, allowed doctors to flick through patient pictures without touching a unsterile screen, and been used to monitor elderly patients to detect when they are in pain in their house. The purpose of this paper is to expand upon those subjects so the reader may get a grasp of the different hardware and software components of the Kinect, and who will use them. because there have been innovative uses of the Kinect so far. The Kinect, besides being used to interact with one on the Xbox 360 console, has been used by surgeons to map bone structures on patients, allowed doctors to flick through patient pictures without touching a unsterile screen, and is being researched for use in elderly patients to detect when they are in pain in their house [2]. The purpose of this paper is to expand upon those subjects so the reader may get a grasp of the different hardware and software components of the Kinect, and who will use them. HARDWARE WITH SERIOUS POWER Prior to the Kinect’s release, any device with hardware up to par was considered highly specialized equipment. Since the release of the Kinect, this often-costly equipment is now simply an addition, or add-on, to the Xbox 360. The Xbox Kinect uses a powerful combination of color and infrared (IR) cameras that creates a new approach to 3D sensing technology. In addition to the high-powered cameras, the Kinect employs the use of a stereo microphone to capture the user’s voice with a high level of clarity. Working in conjunction, these few elements form the foundation for the Kinect. From afar this may not seem impressive at all, but each element has made impressive advancements in household. Key Words— Xbox Kinect, PrimeSense, OpenNI, medical uses, Motion sensing technology, Microsoft INTRODUCTION The Kinect itself is structured light camera with a normal camera attached. A structured light camera works by emitting infrared light through the infrared projector, then through complicated mathematical formulas that will expanded upon later, is interpreted by the infrared camera as a 3 dimensional room. When a subject moves, the infrared camera sees the infrared light move and translates that as motion. While the Kinect is a complex camera, it is the software that enables it to see a room as a 3 dimensional environment like a human does. Microsoft worked along with the company PrimeSense to develop the coding for the Kinect. OpenNI Primarily, Primesense’s “Nite” software is what allows the Kinect the view it’s environment as 3 dimensional, but Primesense’s “OpenNI” software is important too as it allows developers to write coding so they can improve the Kinect on their own [1]. It is very important that OpenNI allows developers to write their own code for the Kinect Figure 1 [7] University of Pittsburgh Swanson School of Engineering March 1st, 2012 1 Thomas Forsythe Marvin Green Kinect’s ability to capture the sounds of the environment are impressive, it has a “wide-field, conic audio capture” [3]. This accurate ability is only due to the pair of microphones that lie on the body of the Kinect, capturing a large range of sounds throughout the room. This method of using two microphones together in stereo is not a new technology, but is very new “living-room technology”. The microphones on the Kinect accurately distinguish the ambient sounds throughout the entire room [3]. This is important because the Kinect would only be able to pick of sounds of the television if this were not the case. In addition, the microphone’s precision allows the device to distinguish between multiple human voices simultaneously. This requires very specialized noise-cancellation, that which is not found and most devices. After capturing voices, the microphones convert this information to electrical signals that are then passed on to the PS1080 chip. At this point, the data from the cameras and microphones is synchronized as it gets processed in the PS1080. From a hardware aspect, it may seem like the camera and microphones do the hard work, but it is the PS1080 chip that actually does all the heavy lifting [4]. These devices essentially just capture and transmit data to the chip where it is then processed by the software components of the Kinect. After transferring the data to the chip, the hardware has then completed its job. The PS1080 is still technically hardware in the Kinect, but certainly should be evaluated on its incredible processing power. Building a 3D Environment Behind the sleek black design of the Kinect lies three separate optical devices; a color camera, an infrared camera, and an infrared light sensor. The color camera has a resolution of 640x480 and is finished with an IR lens filter [3]. Figure 2 [3] Both the color and IR camera rely on CMOS (complementary metal oxide semiconductor) image sensors to create information that will then be passed to computer chips in the Kinect. CMOS sensors detect and capturing visible light then converts it into electrical signals [3]. These electrical signals created by the sensors allow the cameras to communicate with the innovative PS1080 chip in the device. The PS1080 is the high-tech chip powering the firmware on the Kinect and will further be discussed later. The IR camera operates in conjunction with the infrared light sensor in a way that was previously used in very specialized 3D equipment. On the metal frame of the Kinect is an IR light sensor, mounted closest to the edge of the device [3]. When the Kinect is in use, the sensor projects a highly dense array throughout the environment in front of the device. The infrared dot arrangement essentially creates a mesh grid that “constantly changes based on the objects that reflect light” [4]. These dots will “change size and position based on how far the objects are away” [4]. The changes are detected and sent to the PS1080 chip by CMOS sensors on the IR camera. With this information, the chip “builds a basic shape of the room it sees through the camera” [4], called a depth map, and then finally begins processing this information. The color camera also passes data to the PS1080 that is processed and used to construct a 3D model of the environment. The camera’s on the Kinect are very powerful and transfer very valuable information on to be processed by the software. THE SOFTWARE SIDE The process used by the Kinect to generate a depth map is not only different, but is also more accurate than most 3D detecting devices. In the past 3D detection commonly relied on the time-of-flight method – “infrared light (or the equivalent: invisible frequencies of light) were sent out into a 3D space, and then the time and wavelengths of light that returned to the specially-tuned cameras would be able to figure out what the space looked like from that” [3]. The Kinect instead uses infrared light projected by the IR sensor, to create a 3D model of then environment. This method analyzes the curves and changes in the map which is much more accurate than time-based calculations used in other devices. However, this new method requires significantly more computing power and would necessitate a powerful processor. Thus the PS1080 was developed, an essential part of Xbox Kinect. THE PS1080 ON-BOARD PROCESSOR The PS1080 is an on-board processor that sits right inside the Kinect [4]. Algorithms process the data from the CMOS sensors and start deciphering the image. Although the infrared light sensor picks up on changes in the infrared dotmap, it is the on-board chip that actual turns this raw data in to usable information. The PS1080 processes this data and is then able to determine changes in the size and location of the Audio Capturing Part of the Xbox Kinect’s novelty comes from allowing the user to interact with the Xbox through voice commands. The 2 Thomas Forsythe Marvin Green projected IR dots. When the chip computes this data it then is able to create a depth map, which can then determine the “location and position of an object with respect to the sensors” [4]. The CMOS sensors on the visible and infrared light camera sit next to each other, which allow the chip to easily merge together a depth and color image [4]. To ensure these two separate images are properly stitched together, the chip performs a registration process that joins the color image (RGB) and depth map (D) to provide the produce RGBD information. The PS1080 chip simultaneously and separately handles information from four external digital audio sources, which is then further processed by “host”. The host refers to the firmware running on the PS1080 that “handles all higherlevel object and action recognition” [4]. OpenNI is the host for the Kinect and is responsible for turning advanced 3D detection to the interactive gaming experience chip. After reaching the chip, data from the IR camera is used to determine location and depth of objects, then is stitched together to form visible video. At this point all the information collected by the PS1080 is then evaluated by OpenNI. Natural Interaction is comprised of various middleware components that work in conjunction to translate real life actions, in to events or actions in a game. With Natural Interaction, the Kinect is able to track hand gestures, interpret body motions, and received vocal commands. Middleware is a component of the firmware that is used to accurately distinguish “human body parts and joints as well as distinguish individual human faces from one another” [1]. From the RGBD information provided by the cameras, higher-end middleware can track the motion of the body, or even movement of an object. Natural Interaction can therefore precisely track movements in each part of the body, while “knowing” exactly who is using the device [7]. NITE NITE middleware, an open source framework for programmers, is “what allows computers or digital devices to perceive the world in 3D” [7]. NITE is the foundation for the whole known as Natural Interaction, whereas is the middleware that powers the OpenNI. Most importantly, NITE is an open source framework that can be used and modified by any programmer [7]. Any programmer working with 3D technology can now work with and improve on the most up to date motion tracking middleware. This in effect will certainly promote people to find new ways to develop and use this technology. For this same reason, many people have already began modifying the Kinect to perform different tasks or actions based on the user’s “natural interaction” with the device. KINECT AND THE ELDERLY With advances in the medical field, the average life expectancy of an American is now around 80 years. This means that there are more and more elderly people in American that need to be supervised in the event they should fall in their house. One way this can be done is by having elderly people live in a nursing home where professionals can watch over them at all times. The problem with this is that abuse can occur in nursing homes without the elderly patient’s family knowing, and it is a solution that requires several medical professionals to work at the nursing home. Another solution to this problem is having elderly people live in their homes, but wear a device around their neck or hand that they can press in the case of an emergency. When they press this device, an ambulance is then dispatched to their house where they can get help. This is great, but not helpful if the elderly person becomes unconscious as they fall, a stroke renders them paralyzed, or they forget to press the button [Rexit]. Figure 3 [5] Natural Interaction Thus far, the Kinect would have sense light, converted it to electrical signals, and passed this raw data to the PS1080 3 Thomas Forsythe Marvin Green These are exactly the reasons why the Kinect is the next generation in health security for elderly people. The Kinect, when placed in the homes of these elderly people will allow for constant, automatic monitoring of the patient. When the Kinect system detects that a person is in trouble, it will alert emergency staff, as well as provide the person’s location and details of the emergency [Rexit]. For now, this Kinect system is just a concept that has just begun to be research. When it was researched, it was only tested with one Kinect camera in one room. Possible future tests involving multiple Kinect’s in multiple rooms could further prove the Kinect’s worth. The fact that it is more reliable than the necklaces and wristbands, and that it can give medical staff an easy workload already proves that its potential is great [Rexit]. Natural Gestures PAIN DETECTION When in emergency situation, like a heart attack or a fall, people generally tend to perform specific gestures. This is called a “Natural Gesture” and they are the basis for the automated Kinect home monitoring system. Basically, a natural gesture would be a person clutching their heart during a heart attack, or a person stumbling on the ground during a fall. The person does not need to be trained to perform a natural gesture, unlike pressing a button, for a natural gesture is encoded in the human brain from birth. With the Kinect able to recognize the severity and location of pain, emergency services can be automatically dispatched with better information about how to help the patient before they even leave the hospital [Rexit]. The purpose of this section is to explain how exactly the Kinect interprets a video feed of a person into natural gestures, and how it determines the location and severity of a patient’s pain as described above. This is essential for emergency services to be able to monitor several patients at the same time with only one medical staff [Dapkus]. The Problem The process of monitoring several of patients is going to involve visualization, which is a way to interpret patient information at the medical staffer’s workstation. Visualization is going to be the process of interpreting the patient’s body in three dimensions as well as determining the severity and location of the patient’s pain. The hard part about this is presenting this information to the medical staffer in a way that will allow him to keep tabs of dozens of patients at once [Dapkis]. The System Explained The way the system performs the above is simple in concept, but complex in design. The Kinect system works by taking many images of the person every second, and checks their position against its database of pre-defined natural gestures. When the Kinect system sees that a person is in a harmful natural gesture for a pre-determined amount of time, it alerts the authorities with information concerning the severity and location of the pain. Some natural gestures do not take as long as others. For example, if a person falls to the ground and is clutching their chest, the Kinect system will alert the medical team immediately. If a person has been clutching their arm for a long time, the Kinect might send a message to the authorities so they can call the patient to check up on them [Rexit]. The process through which the Kinect monitors pain is also very important. Several times a second, the Kinect sends information about the person’s body position, location, and if any parts of their body are in pain to an outside monitoring system. This is useful because it allows one person to monitor dozens of patients from a single workstation, which solves the problem of not enough professional staff being available to monitor patients [Rexit]. The Kinect’s different responses to situations and the amount of time before it performs a response depend on how the medical professionals tweak the Kinect code. This is important because it allows the Kinect to be personalized to a specific patient that may have a unique condition [Rexit]. The Visualization Solution There are different types of three dimensional human body models used to display a specific patient. However, the free “. vtk” format has been selected to be used for this system because it supports a wide variety of visualization algorithms and the fact that it is free means programmers won’t have to waste time getting permission from a company. The Kinect system visualizes pain location and intensity by using spheres. These “pain spheres” are very efficient because they represent different pain levels by changing the size and color of the sphere, making it easy for a medical staffer to notice when a patient is in trouble. The medical staffer is also able to determine the location of the pain since the Kinect system places the sphere on the location of the patient’s three dimensional body representation. The graphs below show how the .vtk format visualizes body location tags and pain levels [Dapkis]. 4 Thomas Forsythe Marvin Green cannot fall into the wrong hands. First, this means that the medical staffer who is monitoring the patients should be trusted and have background checks performed on him or her before they should be allowed such a job. Another possible alternative to this is only allowing the threedimensional model of the person’s body to be accessed by the medical staffer. This would prevent the medical staffer from observing the patient’s house and keep him focused on what is important: the patient. Secondly, this means that if three dimensional camera, regular camera, and microphone data is sent to a medical staffer’s workstation, it needs secure encryption through the Internet. If not, this could rise to a new social problem where hackers and thieves could peer into a person’s home and personal life. All in all, if these pre-cautions are considered and taken into account for, the Kinect has the potential to revolutionize personal healthcare [Rexit]. Figure 4 [Dapkis] KINECT, THE FUTURE IN 3D Kinect was initially released as simply an addition to the Xbox 360. Since then, the engineering community has taken this living-room luxury and turned it into a device that can potentially save lives. The hardware in the Kinect is no breakthrough in technology; but when these components work in conjunction with OpenNI software the Kinect becomes a highly innovative device. In the past using IR light sensors to detect change in motion was only used in highly specialized equipment. Today, science has reached the point where this specialized equipment can be offered and tailored for any unique medical condition. It’s often overlooked how much technology has developed in the last few years but innovations such as the Kinect serve as a reminder of science’s great success. Patient Interaction Figure 5 [Dapkis] When the Kinect system has determined that a patient is in pain, a unique message is sent over the Internet to the medical staffer’s workstation. The workstation screen will then update with patient pain location and severity on the patient’s three-dimensional body. The .vtk format is very convenient because it allows the staffer to explore the patient’s three-dimensional body and use the Kinect’s camera to see the patient’s surroundings and determine what may have caused the pain. From there, the medical staffer has the options of communicating with the patient through the Kinect’s built in microphone and speaker, or calling emergency services with specific information about the patient’s injury [Dapkis]. REFERENCES [1] (2011). “OpenNI.” Primesense. [Online]. Available: http://www.primesense.com/en/openni [2] L. Gallo, A.P. Placitelli, and M. Ciampi. (2011, August 30). “Controller-free Exploration of Medical Image Data: Experiencing the Kinect.” Computer-Based Medical Systems, 2011 24th International Symposium. [Online].Available: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5999138&tag=1 [3] Gil. (2010, November 16). “How does the Kinect really work?.” [Online]. Available: http://gilotopia.blogspot.com/2010/11/how-doeskinect-really-work.html [4] (2011, April 7). “How Microsoft’s Primesense Based Kinect Really Works.” Electronic Design. [Online]. Available: http://web.ebscohost.com/ehost/pdfviewer/pdfviewer?sid=2d4a3ee0-8ebc48e8-bea8-080e3e2bfcd6%40sessionmgr110&vid=4&hid=110p.28-30 [5] H. Fairhead. (2011, March 30). “Kincect’s AI breakthrough explained.” I Programmer. [Online]. Available: http://www.iprogrammer.info/news/105-artificial-intelligence/2176-kinects-aibreakthrough-explained.html [6] J. Huang. (2011, October 24). “Kinerehab: A Kinect-based System for Physical Rehabilitation – A Pilot Study for Young Adults with Motor Disablities.” ACM ASSETS’11. [Online]. Available: http://delivery.acm.org/10.1145/2050000/2049627/p319huang.pdf?ip=130.49.97.207&acc=ACTIVE%2 ETHICS OF THE KINECT So far this paper has done a good job of making the Kinect look like a miracle in a box. With every new engineering feat however, the pros and cons must be carefully taken into account before it is released to the public. This section, therefore, will explain the ethical concern of invading the privacy of the patients it is designed to protect . The Kinect, as described before, could have the potential to be set up in every room of a patient’s house. Since the Kinect has a three-dimensional camera, a regular camera, and a microphone on it, the information it collects 5 Thomas Forsythe Marvin Green 0SERVICE&CFID=63879703&CFTOKEN=75141256&__acm__=132771 9365_5a0b35cf291d995e27b5f765e7140a2f [7] (2012). “Introducing Kinect for Xbox 360.” Xbox 360 + Kinect. [Online]. Available: http://www.xbox.com/en-US/kinect [8] E.E. Stone and M. Skubic. (2011, May 23). “Evaluation of an inexpensive depth camera for passive in-depth home fall risk assessment.” PervasiveHealth 2011. [Online]. Available: http://www.engineeringvillage2.org/controller/servlet/Controller?SEARCHI D=1eb566613510d760912f31prod3data1&CID=quickSearchAbstractForma t&DOCINDEX=3&database=7&format=quickSearchAbstractFormat ADDITIONAL RESOURCES R. Rexit. (2011, December 15). “Visualiztaion of Posture (for Kinenct Pain Recognizer).” [Online]. Available: https://docs.google.com/viewer?url=http%3A%2F%2Fwww.cs.pitt.edu%2F ~chang%2F231%2Fy11%2Fproj11%2Ffinalruh.pdf M. Dapkus. ( 2011, December 15). “Natural Gesture Recognition using the Microsoft Kincect System.” [Online]. Available: https://docs.google.com/viewer?url=http%3A%2F%2Fveryoldwww.cs.pitt. edu%2F~chang%2F231%2Fy11%2Fproj11%2Ffinalmyko.pdf ACKNOWLEDGEMENTS We want to thank Bill Neiczpiel for helping us come up with a topic we enjoyed writing about. We really had no idea the Kinect could be used in so many different ways and truly appreciate your help along the way. 6