Enhancing User Immersion and Natural interaction in HMD based Virtual

advertisement
Enhancing User Immersion and Natural interaction in HMD based Virtual
Environments with Real Time Visual Body Feedback using Multiple
Microsoft Kinects
Srivishnu Satyavolu∗
Pete Willemsen†
Dept. of Computer Science
Univ. of Minnesota Duluth
Dept. of Computer Science
Univ. of Minnesota Duluth
A BSTRACT
This paper presents an augmented VR system that provides real
time visual body feedback with the help of multiple Microsoft
Kinects. The advent of Microsoft Kinect provided the research
community with an inexpensive but an extremely invaluable piece
of equipment that can be used to obtain real time 3D information
of the scene. We use this information to let users see their own
self-representations in virtual environments. We also introduce in
this paper, an IR-based position tracking system implemented using just the Kinects. We present here an analysis of the jitter and
range of such a tracking system and the effects of interference on
its performance for a common VR lab setup.
1 I NTRODUCTION
The level of presence of a user in virtual environments can be enhanced by real time visual body feedback. Though, different kinds
of visual body feedback [2] like computer generated self-avatars,
self-body feedback based on simple skin segmentation are possible, these are not true representations of the users. For an accurate
user representation, we need real time 3D user information, so that
we can reconstruct the user back in the virtual environments. This
type of self representation is also very essential in VR applications
that involve multiple users, because, only true representations will
allow users to identify one another with their real identities. The
traditional depth sensors turned out to be very expensive for the
purpose. But, the introduction of the Kinect sensor, instigated much
interest across the VR community, primarily due to the rich array
of sensors it has for capturing 3D scene information, at a very low
cost of approximately U.S.$150.
We use the color and depth data from multiple Kinects to reconstruct 3D user in the virtual environments. We also plan to use the
skeletal joint data provided by Microsoft Kinect SDK or OpenNI to
let users naturally interact with gestures across large 3D lab spaces.
The problem with these though, is that the tracking range is roughly
1-3 meters, which is very small, when compared to the VR lab
spaces, which are typically of the order of 10-12 meters. We plan to
overcome this problem using multiple Kinects arranged across the
lab, not only to get a 360◦ view of the scene for a 3D representation
of it, but also to get 3D skeletal tracking of the user across all the
space.
Another potential problem with the Microsoft Kinects is that,
when used in conjunction with optical position tracking systems
like WorldViz PPTH, the tracking systems will not be able to distinguish the IR light from the Kinect sensor and the actual IR marker,
meaning that we wont be able to use Kinects with the existing tracking systems for VR applications. In an attempt to solve this problem, we analyzed the ability of the Kinect to act as a tracking sys∗ e-mail:
saty0010@d.umn.edu
† e-mail:willemsn@d.umn.edu
Figure 1: Multiple Kinects are arranged in a circular manner. Circled
Kinect is the primary tracker. Kinects circled with dotted lines are
providing IR interference.
tem on its own. The Kinect tracking system we propose, works by
tracking an IR-marker attached to a target in the tracked lab space.
In this paper, we present an analysis of jitter and tracking range
of such a tracking system, and also the effects of IR interference
from other Kinects in the scene. Section 2 talks about the related
work, Section 3 discusses the implementation of the augmented
Kinect-based VR system, starting with various issues involved and
the approaches used to resolve them, Section 4 discusses the IRbased position tracking system and about the experimental setup of
its analysis, Section 5 discusses the results and limitations of the
system. Section 6 concludes with a discussion on future scope of
the current system.
2
R ELATED W ORK
The Kinect has seen wide-spread adoption well outside the traditional game console market. Since its launch in late 2010, the
Kinect has been used in numerous projects integrating with either the Microsoft-provide Kinect SDK or through the open-source
Linux driver that provides access to the RGB camera and depthrelated information. Specifically, a Kinect sensor is a motionsensing input device that comes with the Microsoft Xbox 360 console. The sensor contains a RGB camera, a structured infrared (IR)
light projector, an infrared camera, and microphones. Some background information on the device is available from both Microsoft
and other sources [7]. Kinects need to be accurately calibrated to
acheive robust depth estimates [5, 3] and correct RGBD mapping.
Microsoft’s KinectFusion system demonstrated robust acquisition
of depth data for 3D reconstruction [4].
3
V ISUAL B ODY F EEDBACK
The objective is to provide user with real time visual body feedback in HMD-based virtual environments. But, there are multiple
issues involved to do so. One by one, we discuss each issue and the
approach followed to resolve it.
3.1 RGBD Mapping and 3D Point Cloud
For a given scene, Kinect’s RGB and depth sensors provide separate streams ( 640x480 pixels ), that by default, are not mapped
together. This is because the RGB and IR camera are not calibrated
together, and the images obtained by these are at a small offset that
varies from one Kinect to another. Also, the RGB camera has a
slightly larger field of view than the IR camera, resulting the images to be in different scales of one another. To get an accurate
RGBD mapping, we used RGBDemoV0.4, Nicholas Burrus’ implementation of OpenCV Chessboard Recognition techniques for
stereo camera calibration. Yet, another problem is that the values
reported by the depth camera are raw depth disparity values and are
not real world depths. Raw depth values typically range from [0,
2047], with 2047 representing invalid depths. So, we used Stephane
Magnetat’s depth conversion function that converts these raw depth
disparity values into meters. This allows us to represent the entire
scene as a colored 3D point cloud.
The following equations obtained from Nicholas Burrus and
Stephane Magnetat’s posts help us get the point cloud:
zworld = 0.1236 ∗ tan(dim /2842.5 + 1.1863)
xworld = (ximage − cxd ) ∗ zworld / f xd
yworld = (yimage − cyd ) ∗ zworld / f yd
where f xd , f yd , cxd and cyd are the intrinsics of the depth cam0
era. Then, reprojected 3D point on the color image Pworld
is:
0
Pworld
= R ∗ Pworld + T
0
xrgb = (xworld
∗ f xrgb /z0world ) + cxrgb
yrgb = (y0world ∗ f yrgb /z0world ) + cyrgb
where R and T are the rotation and translation parameters estimated during the stereo calibration.
3.2
User Representation and Segmentation using multiple Kinects
From the 3D point cloud of the scene, we aim to extract user(s) from
the background. Color-based segmentations wont work very well
here because, ideally, we would want to extract the user information
under any lighting conditions. Hence, we employed depth-based
background subtraction techniques instead. This includes applying a series of OpenCV image morphological operations like erode,
close etc., to remove any noise that might be present in the source
and destination depth images. The resulting depth image contains
depths of only the foreground objects which in our case would be
the user(s). This image is then used as a mask to obtain just those
3D points that correspond to the user(s) in the scene. These 3D
points are used to obtain triangles that finally represent the user in
the scene.
But, the user representation obtained above corresponds just to a
single view. To obtain a 360◦ view of the user, the point clouds from
multiple Kinects should be merged together into the same virtual
environment. In order to do that, multiple Kinects can be calibrated
with one another using any of the various calibration programs like
Nicholas Burrus’ RGBDemoV0.6, Oliver Kreylos’ KinectViewer
or manual calibration.
Oliver Kreylos’ KinectViewer is semi-automated, in the sense
that, it requires manual identification of points that should be given
to a point alignment program. We used our Kinect based position
tracking system to capture the points for us and then used those
points to be aligned using Oliver Kreylos’ point alignment program.
Figure 2 shows two different views of the segmented user using two
Kinects.
Figure 2: Merged User Representation based on two Kinects.
4
T RACKING P OSITIONS
WITH
K INECTS
We analyze the Kinects ability to robustly track an IR marker across
the space of a 7m x 10m lab, focusing on the jitter of position data
acquired over distance and in the presence of multiple Kinect sensors that potentially can cause severe IR interference. The overall
objective of this work is to understand the efficacy of using multiple Microsoft Kinects as tracking devices for monitoring a users
position and a users skeletal system in a VR application. While
a single Kinect is capable of tracking an IR light within the lab,
multiple Kinects afford different views of users within the tracked
space that are not possible using a single Kinect. Combining multiple views can strengthen skeletal tracking mechanisms provided
by APIs such as Microsofts Kinect SDK or the OpenNI framework. However, these APIs effective ranges are limited to about
2 to to 3 meters, which could limit usable mobility in VR applications, where more typical tracking ranges from 10 to 12 meters
in size. Understanding the tracking range of the Kinect is an important factor to consider for using Kinects for position tracking.
The operational range of skeletal tracking APIs is limited, likely
because the depth resolution of the Kinect decreases with increase
in distance from the Kinect. Hence, it becomes difficult to estimate
skeletal joint positions as users get farther away. The tracking system described in this paper exploits the fact that, even though its
difficult to estimate the entire set of skeletal positions after a certain range, its still possible to get reliable depth values enough to
get an estimate of the users position across a large lab space. To
get reliable position tracking, we use a single IR marker attached
to the users head that can be tracked over a large space. While this
section focuses on IR tracking, we specifically test the sensitivity of
the tracking to IR interference from multiple Kinects. The reason
why interference from multiple Kinects may be an issue involves
using multiple Kinects to improve skeletal tracking. With multiple
Kinects, IR projectors will potentially project structured IR light
into the scene and possibly into other Kinect IR sensors. In theory,
this interference could cause problems with the depth values from
the Kinect. Some research has been done to reduce interferences by
using a Kinect-shuttering approach [6, 1]. Our experiments specifically focus on testing the Kinect-based IR tracking system against
interference from multiple Kinects.
4.1
Tracking IR Marker
There are a couple of issues that need to be resolved inorder to obtain the maker world position. First, the IR marker creates its own
small circle of IR interference that makes it difficult to precisely
identify it in the IR image. Moreover, interference from the other
Kinects does cause problems for detecting the IR marker. In these
situations, our software can get confused between the actual IR
marker and the bright IR interference from the additional Kinects.
To overcome these issues, we apply locality of neighborhood and
static background separation techniques.
Background depth information of the entire scene is captured by
merging depth images of the same scene obtained over time. The
acquisition of the background depth image is done once upon starting the tracker system. The background image is used to extract
foreground images in real time by a simple background separation
Figure 3: Top left image is from RGB camera; top right image is
depth information collected about static background; bottom left is
the current depth image with the tracked object; and bottom right is
foreground depth information subtracted from the depth image.
technique using OpenCV. A series of OpenCV image morphological operations are performed on the extracted foreground depth and
IR images to remove any external noise. These images can now be
used to extract the posi tion and depth information of the IR marker.
Each IR image from the Kinect is analyzed in real time to find the
brightest pixel in the foreground (in this case, our IR marker). Figure 3 illustrates the segmented depth information during this process. Pixel depth is calculated using a locality of neighborhood
principle which estimates the IR marker depth value by examining
the surrounding pixels outside the small circle of interference of
the IR marker in the depth image. Using a 15 by 15 pixel neighbor
hood window, to calculate the mean of all the valid depths, worked
well for this implementation.
Real world IR marker positions are then computed from the extracted raw values, using equations listed in Section 3.1.
4.2 Experimental Setup
Our experimental setup consists of 5 Kinects mounted on chairs of
same height, with one being the primary Kinect (Figure 1) while
the remaining introduce interference. They are all arranged to provide tracking centered around a 7m by 7m space. We used a WorldViz IR marker mounted on a cart as the tracked point. The primary
Kinect is calibrated as given in Section 2, and the tracked space
consists of 30 manually placed points, that are roughly a meter apart
from each other. At each point, 1000 samples of the IR marker positions are collected which are then averaged to get the final estimate
of the tracked position. This is done both with and without interference from the other Kinects. The entire step is repeated for all the
30 points. The jitter in these IR marker positions is then analyzed
over all the 1000 samples with and without interference.
5 R ESULTS AND D ISCUSSION
Figure 4 compares the tracked positions with and without interference. The green circles represent the estimates of the points measured without interference, while the red crosses represent the estimates of the points measured with interference from other Kinects.
Standard deviations of reported pixel positions were nearly zero in
all cases except for some points show in the figure, where the interference is maximum. The mean and standard deviations of a
single point with maximized in terference was a mean depth value
of 1016.0 (SD=89.6723). Without interference at this location, the
mean depth value was 1039.6 (SD=1.0484). Additionally, when
compared with our WorldViz PPTH system the Kinect was roughly
off by 3cm on average.
The above results suggest that effect of interference is almost
negligible. The few points in the Figure 4, where the interference
was maximum, are those points that are very close to the Kinects
Figure 1. Also, the jitter of the positions reported by the Kinect are
uniform through all the depths.
There are a few limitations with our current system though. For
instance, it’s very difficult to track more than one IR marker and
thereby multiple users. One possible way of overcoming this limitation to some extent would be to use skeletal joints to compensate
Figure 4: Comparison of the kinect tracking data aligned with data
acquired from a World Viz PPTH Tracking system.
the need for multiple markers and by using spatio-temporal information of those joints/markers to distinguish between IR markers
on different users. Another limitation is that, as the current Kinect
hardware supports just one of RGB or IR streams to be used, it’s
not possible to provide visual body feedback and also tracking with
the help of same Kinect. Therefore, some Kinects have to be used
solely for tracking while the others can be used to provide visual
body feedback.
6 C ONCLUSION
We observed that interference is not a major concern for IR tracking
with multiple Kinects. Thus, we believe that, based on our initial
experiments, the Kinect can serve as a low cost tracking system.
Also, with the help of Microsoft SDK or OpenNI skeletal tracking capability multiple Kinects can be used to implement a more
functional tracking system that allows multiple users to not only
see themselves in real time but also interact with one another using
natural gestures across VR lab spaces. In addition, it should even
be possible to get orientation data of the user based on the skeletal
joint information.
R EFERENCES
[1] K. Berger, K. Ruhl, C. Brümmer, Y. Schröder, A. Scholz, and M. Magnor. Markerless motion capture using multiple color-depth sensors. In
Proceedings of Vision, Modeling and Visualization (VMV), pages 317–
324, 2011.
[2] K. R. K. H. Gerd Bruder, Frank Steinicke. Enhancing presence in headmounted display environments by visual body feedback using headmounted cameras.
[3] C. D. Herrera, J. Kannala, and J. Heikkilä. Accurate and practical
calibration of a depth and color camera pair. In Proceedings of International Conference on Computer Analysis of Images and Patterns
(CAIP), pages 437–445, 2011.
[4] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli,
J. Shotton, S. Hodges, D. Freeman, A. Davison, and A. Fitzgibbon.
Kinectfusion: Real-time 3d reconstruction and interaction using a moving depth camera. In Proceedings of ACM Symposium on User Interface Software and Technology (UIST), pages 559–568, 2011.
[5] K. Khoshelham. Accuracy analysis of kinect depth data. In ISPRS
Workshop on Laser Scanning 2011, 2011.
[6] Y. Schröder, A. Scholz, K. Berger, K. Ruhl, S. Guthe, and M. Magnor. Multiple kinect studies. Technical Report 09–15, ICG, TU Braunschweig, 2011.
[7] J. Smisek, M. Jancosek, and T. Pajdla. 3d with kinect. Technical report,
CTU Prague, 2011.
Download