>>: Welcome everyone and today we are honored to have... Zhengyou is very good, has great experience, is well known,...

advertisement
>>: Welcome everyone and today we are honored to have Dr. Zhengyou Zhang come here.
Zhengyou is very good, has great experience, is well known, he is an IEEE Fellow and is a
manager here, research manager of multimedia interaction and experience. He also published
200 papers. He is an IEEE Fellow and ACM Fellow. He is the Founding Editor-in-Chief of the
IEEE Transactions on Autonomous Mental Development, and has served on the editorial board
of IEEE TPAMI, IEEE TCSVT, IEEE TMM and many, many conferences.
Also he has served as a program chair, a general chair, and a program committee member of
numerous international conferences in the areas of computer vision, audio and speech signal
processing, multimedia and human-computer interaction. He a General Chair of International
Conference on Multimodal Interaction (ICMI) 2015, and a General Chair of IEEE Conference on
Computer Vision and Pattern Recognition (CVPR) 2017. He received the IEEE Helmholtz Test
of Time Award for his paper published in 1999. So let’s welcome distinguished Zhengyou
Zhang.
[Applause]
>> Zhengyou Zhang: Thank you Sheri for the introduction and good evening. Thank you for
coming to my talk. This is part of the IEEE [indiscernible] Tech Talk series. Today I will talk
about how to combine Surface Hub and Kinect to achieve a more natural interaction with a big
display as well as Immersive Collaboration between people who are distributed geographically.
So Surface Hub is a large display. I will talk about it a little bit later. It is like a giant iPad, but
it’s better. Kinect is a RGBD Sensor. It will directly capture the depth information.
So if we take a look at the office space you a lot of whiteboard. Whiteboard is a great tool for
collaboration because it is big so everyone can focus their attention to the shared space and they
can write freely, they can erase it if they want, but most importantly when they collaborate, when
they brainstorm, they can see each other’s facial expression, hand gesture, gaze awareness,
intention, etc. So this is an important tool and that’s why in the modern office space you see
quite nice whiteboards in the office.
At the same time more and more electronic displays are installed in the office because they are
getting cheaper and cheaper. Those displays could be large and could also be touch enabled. So
how to leverage this trend is an interesting thing for collaboration and interaction. Microsoft
announced its new line of product called Surface Hub. It will be released on January 1 next year.
Surface Hub is, as I mentioned earlier, like a giant iPad, but its better. It provides the best pen
and touch experience on a large display.
So there will be 2 versions: one is 55 inches and the other is 85 inches. The 85 inch Surface Hub
as the full key display. So the display quality will be very good. And I also mentioned we will
use the Kinect Sensor to enhance the Surface Hub. So the Kinect Sensor directly captures
solidity information. This means we can now see the surrounding environment, not just from the
viewpoint of the sensor, but we can now move out of the viewpoint of the sensor to see the
environment. This is a video where the person tried to see from different angles.
[Video]
>> Zhengyou Zhang: So traditionally in computer vision we use RGBs camera, it’s a color
camera to do the computer vision tasks. And depth sensor now offers an advantage, for example
the Kinect Sensor can work in low-light or in a complete dark room because it emits infrared
lights and RGB will now work in low-light conditions. The Kinect Sensor can chart immediately
the depth information. So the person in front will automatically pop out from the background.
And if we use RGB, if the background is colored then it is very hard to segment the foreground
from the background. Also because we directly get to the depth information the skill is known
with the depth sensor in RGB, because the 3D [indiscernible] is projected to the 2D sensor so the
depth is lost. So we don’t have the scale, but the tools are really complementary. So that’s why
even in the Kinect Sensor we also offer an RGB camera. So that’s why the Kinect is RGBD
sensor. Yes?
>>: Is the sensing mode of a depth sensor ultrasonic?
>> Zhengyou Zhang: It is infrared. There are two versions: the original Kinect is random
thoughts and sends to the [indiscernible] and the second version is using the pulse different
frequency.
For us actually here is a diagram we think could be a new Surface Hub by integrating the depth
sensor into the touch screen. So we can have a depth sensor on the wing. So from this one we
can see all the environment on this side and from this one we can see from the left side. The
depth sensor on top can see the whole room at the back.
>>: [inaudible].
>> Zhengyou Zhang: Yes, I will tell you.
>>: [inaudible].
>> Zhengyou Zhang: So yeah, I will tell you later. So in our experimental setup we don’t have
the [indiscernible] the device yet. So we just put the Kinect Sensor next to the large display and
you can see there is a little bit of a gap because the Kinect Sensor has a minimum sensing
distance. So we need to put them a little further away. And by combining them we will get a
more natural and immersive interaction with Touch Board.
So the ViiBoard stands for Vision-enhanced Immersive Interaction with Touch Board. The
system consists of two subsystems. The first one is called VTouch and it’s about human/
computer interaction. The second one called ImmerseBoard is about human/human
collaboration across distance. We will show you in more details.
So in VTouch we leverage the cues of the user from the Kinect Sensory, for example: how far
the user is from the Surface Hub, or where is he standing, or who is the person, whether the
person is left hand or right hand and what gesture is it? So, to have a better interaction, for
ImmerseBoard because the depth sensor can [indiscernible]. So we can manipulate the person in
[indiscernible] and render it appropriately. So we can achieve real-time immersive experience to
feel that we see the reference point of the remote person and share the feelings that we are
standing in the same space and being aware for the case of the remote person predicting the
intention. So we really try to achieve the experience that the two people are standing side by
side in front of the same whiteboard.
So now let’s take a look at VTouch, the first subsystem, Vision Enhanced Touch Experience. So
as I mentioned earlier we really want to leverage the Kinect Sensor to attract cues about the user:
position, proximity, person ID, hand ID, gesture ID and intention. Here is a system diagram.
From the Kinect Sensor we apply a number of computation based algorithms and attract to the
cues. Then we use the cues of the user to design interactive applications to work with a larger
display. And the full key vision technologies are the sensor-display calibration, human skeletal
tracking, hand gesture recognition and person recognition. So I will explain them one by one
shortly.
So the first technology is sensor-display calibration. So we use the Kinect to sense the user, but
as a sensing result is [indiscernible] in the coordinate system of the Kinect [indiscernible]
system. And the interaction is done in the coordinate system of the display. So two need to be
brought together and set the calibration process. So we needed to determine the rotation metric
and the translation vector between the two. And this can be easily done by typing a few points
on the display, because by typing on the display we get the touch point. So that’s in the
coordinate system of the display. And the fingertip [indiscernible] time x, y, z is obtained from
the Kinect Sensory. So then we can compute the rotation metrics and translation. So the
calibration can be done fairly easily.
So the second technology is about human skeletal tracking. A human in our case is represented
by a few joints, like the head, neck, shoulder, etc. Each joint is represented by the x, y, z
coordinates and from the x, y, z we can compute the angles, etc. And Kinect Sensor tracks 20
body joints in real-time. And here is the pipeline of how it works. So from the depth image each
depth pixel is classified into body parts. So for example if we take this pixel, we say, “Ok this is
probably the right shoulder and this pixel is probably belonging to the right hand,” etc. So that is
a per pixel inference for each depth pixel.
But this can be very noisy because it is determined per pixel. So we then aggregate to the
information in the neighborhood and this will generate a hypothesis of the body joints. And this
is still too noisy. At the last stage we apply the Kinect constraint as well as the [indiscernible]
and this will give us tracking result. Here I will show you an example here. This is an input
depth image and that is the inferred body parts. This is a different view of the joints.
The full human computer interaction hand gesture is very important. So here I will talk about
how to do hand gesture recognition. In our current system we need only a few gestures. There
are 2 gestures actually: one is a gesture when the person is very close to the board. This includes
the color pallet in order to choose different colors, or typing and pointing. And for the far mode
when the person is pretty far from the board then pointing or clicking, etc. is needed.
And what do we do? We first from depth image to segment the hand, cut out the hand region.
And then we build this local occupancy pattern around the hand. And based on whether we
divide into a number of cells and count the number of points in each cell. Then we do the multiclass support vector machine to determine the near-mode left hand, the near-mode right hand and
different gestures. The training data we have now is only from four people. We are working
with more people now and each person is asked to work in front of the larger display with each
gesture. And at the run time, based on the position of the person, we can determine whether we
need to apply the near-mode gesture or far-mode gesture and then we apply the classifier. That’s
for the hand gesture.
For person recognition I’m sure everyone knows we do the face recognition, but we do actually a
little bit more [indiscernible]. We combine face plus the body appearance like clothing. The
reason is when you work in front of the Surface Hub sometimes you are not in facing the Kinect
Sensor so you don’t get the front view of the person, but the body appearance, the clothing you
can see from everywhere and this is usually stable, at least for the whole day, you know you
don’t change clothes during the day. And even if you don’t see the face the clothing will help
you. And if sometimes you see a video tape between different people, once you see the face they
can be easily [indiscernible].
So we really tried to combine the advantage of the two modalities, two parts. The face gives you
accurate recognition for front view, but very bad for side views and occlusions. Body
appearance can be confused when people are wearing different clothing, similar clothing, but
these robust are for side views. And for the face recognition part we use –. So usually you get to
the features and then you have the classifier. In our system we use multi-scale LBP, local binary
pattern, descriptors. We choose a lot of features, but we use a PCA to reduce to 2,000 and then
we use a Joint Bayesian Method to determine whether this is the user or not.
>>: [indiscernible].
>> Zhengyou Zhang: For the body appearance we are tracking the person. So we have the upper
body, low body, hands, etc. So we just use a few parts of the body and for each body part we
also have different angels based on the poor orientation with respect to the Kinect Sensor. Then
for each part and each orientation we have the color histogram. And based on the color
histogram we can contract a model of the appearance. So here I will show you some result. So
the person is coming and initially unknown, but as soon as the front view is seen then the person
is recognized. They are unknown and then quickly it is recognized, I think and it is a robust
occlusion. Now the person is trying to defeat the system by changing the clothes. So now he is
coming in with different clothes and initially it is unknown, but as soon as the person is seen
from the front view then it will be associated with the name face with the appearance model.
>>: So what’s the size of population of the faces that you want to distinguish?
>> Zhengyou Zhang: So in our system usually it is a team member, so 10 people.
>>: Oh, so small.
>>: How frequently are –?
>> Zhengyou Zhang: Pretty robust, yeah.
>>: So you have 100,000 dense sample points for facial recognition using PCA to reduce the
rank of that problem to 2,000. That’s a great big matrix.
>> Zhengyou Zhang: It’s a big matrix.
>>: How often do you have to invert that? Is it once per appearance of a new face or is it
regularly done or periodically done?
>> Zhengyou Zhang: So that’s pretty fast.
>>: 100,000 to 100,000 is big.
>> Zhengyou Zhang: No, no that’s fast. But in our case actually we don’t run every frame. It’s
about 5 or 6 frames, because in the middle time we can track, right. So it is run in multi-thread
execution.
>>: [inaudible].
>> Zhengyou Zhang: So we enroll everyone. It will be in the database.
>>: [inaudible].
>> Zhengyou Zhang: [indiscernible].
>>: No the main problem for this biometric is that you want to identify whether this guy
[inaudible].
>> Zhengyou Zhang: So if the database is infinity then yeah, a lot of confusion.
>>: [inaudible].
>> Zhengyou Zhang: So when you have a smaller space in the database that you want to
recognize it’s much easier.
>>: So you probably don’t make any mistakes.
>> Zhengyou Zhang: Still in the one side view it is harder. So now I will show you how the
VTouch works. Here are a few applications. For example it can bring up menu without touch.
It can display menu wherever you are. It can augment touch with HandID, PersonID and
GestureID and hover I will explain here, pointing and auto lock of the display. So when you
walk away from the display it can automatically lock the screen for the privacy of the content
and when you come back, if you ware part of the [indiscernible] it can unlock for you to continue
the session and here are a few screen shots. I will show you the video.
So like showing the palm open it will bring automatically the color palate menu. So you don’t
need to go to the start menu, drawing, color, etc. You just show you the palm open. The same
thing for typing and we recognize different people. So we can determine who wrote what, like
this case is user one and this is two. And here when the two people play a game automatically
we know who is who. So you don’t need to change the color. It will automatically assign
different colors for different people and here it the screen logging and etc. And when the person
is pretty far away we can have the pointing and we distinguish whether you are using left hand to
point or right hand to point. We also distinguish whether you are writing with the left hand, right
hand or with a pen.
Another thing which usually is missed in the touch display is the hover. So in the hover, when
you move the mouse in front of the object then automatically some content menu will come out.
But touch usually you have to touch in order to do anything here. But the VBoard we know
whether you are close to the object on the screen. Then, in this case, we are moving close to
Chicago, then the content menu will come out and other part that we know there is no content
menu. Then you can touch the content menu, etc. Also the menu can follow you. So wherever
you go the menu will come in handy. So you don’t need to run to the start part to get the menu,
especially when you have an 85 inch display.
So now let me play a short video. And the feel here is he is part of the project and he is now
meriting the video.
[Video]
>>: [inaudible]. Thanks to a Kinect pointed across the display VTouch can understand where the
user is, who the user is and what the user is doing even before the user touches the display.
Gestures such as an open palm may bring up a color palate from which the left hand may select
erase, the pen may select green and the right hand may select red. These selections are
remembered; erase with the left hand, right green with the pen and write red with the right hand.
Another gesture, a pair of high fives, opens a keyboard, making it easy to enter text anywhere on
the screen. Hovering over the text brings up a menu of attributes which may be selected to
change the color or size of the text.
Different users can also be distinguished using vision. Another user may select his own color
and stroke width. When the first user returns his settings are remembered. The strokes and texts
are tagged as belonging to user 1 or user 2. VTouch knows the users location. Not only can it
bring up a menu near the user, but it can keep the menu near the user so he does not have to hike
back to the far end of a large display. Pointing for interaction when the user is away from the
board is also possible. Pointer gesture detection is robust against other gestures. The right hand
is distinguished from the left. Selection is also possible.
VTouch requires the Kinect to be calibrated with the display, but calibration is easy, simply by
touching 4 points on the board. VTouch uses vision to enhance touch by detecting gestures
whether far or near to the screen, distinguishing between hands when they touch the screen,
restoring hover to touch screens, distinguishing between users and tracking their positions.
VTouch, vision enhanced interaction for large touch displays.
[End Video]
>> Zhengyou Zhang: So we have conducted and people really like it, most people, just 1 person
in a few says it is distracting and one person feels that left hand/right hand is too difficult to
remember. They just prefer touch always with one hand rather than left/right. But other than
that people feel it is very good.
So let’s move onto the second subsystem called ImmerseBoard, vision-enhanced immersive
remote collaboration. So before we talk about remote collaboration let’s look at the co-located
collaboration. So for example 2 people sanding in front of a physical whiteboard what you have
you see different people in front of you. You see the facial expression, eye gaze, gesture, etc. So
you have the full bandwidth across people and also you see immediately the content creation
because it is in the same space. Now let’s look at the current situation about remote
collaboration.
So what do you usually have? You have a window of the remote person, the talking head of the
remote person. And you have a shared space to collaborate with remote people, but the two are
disjoined, they are not connected with each other. So what do you have? Between people you
only see the face that is seen, bandwidth [indiscernible]. And for the content you don’t know
how the content was created actually. You just see the final resolved popup on the screen,
because you don’t see the gestures, pointing, etc.
So what we want to do is really try to bring the two spaces together and create a much richer
bandwidth between the faces and between the content creations. So here is one example, the
person points here and now the person can move out of the box now and [indiscernible] can see.
I will show you this later. So again just to repeat, the Kinect has solidity information. So it can
manipulate the solidity information of the remote person to render it in a proper way on screen
and then we can see the reference point of the remote person, people feel we are sharing the
same space, aware of the gaze and predicting intention.
And we have implemented two metaphors: the first one is on a whiteboard. You see how this is
done. The second one is about two people writing on a mirror, so rather than looking to the side
you can look at the other person through the mirror, so without turning your head. So this is a
little bit different. If you have seen the class war video then you have to write backward and
here in our case we don’t need to write backward because you write in the same side. You just
look at the person on the other side. It is [indiscernible] in the mirror. We have implemented
actually 3: so that’s two people writing side by side, that’s writing in front of the mirror and here
is exactly the same thing as the current system. There is a talking head and shared space, but
now we manipulate the hand to touch the reference point.
Now let’s play the video.
[Video]
>>: There is a system for remote collaboration through an electronic whiteboard with immersive
forms of video. Thanks to a Kinect camera mounted on the side of a large touch display we
augment shared writing with important shared social cues, almost as if participants were in the
same room. This 2.5D hybrid between standard 2D video and full 3D immersion shows the
remote participants 2D video on the side, but his hand is able to extend out of the video frame in
a non-physical way to write or to point to a place on the shared writing surface. This
visualization shows the remote participant in full 3D as if the participants were standing shoulder
to shoulder in front a physical whiteboard. Eye gaze direction is improved as well as gesture
direction, body proximity and overall sense of presence.
>> Zhengyou Zhang: So what we do here is the Kinect captures the [indiscernible] shape of the
person, plus the color, etc. Then we have this whiteboard in [indiscernible] space and essentially
the Surface Hub surface and then the person. Then we move [indiscernible] to the whole thing
inside and project on the screen. That’s why we get this [indiscernible]. So you can see now the
eye gaze of the remote person.
>>: The participants are able to look at each other to check for understanding. Finally this
visualization shows the remote participant in 3D as if reflected in a mirror on which the
participants are writing. Eye contact and gaze direction are further improved as well as overall
sense of presence. ImmerseBoard makes remote collaboration almost as good as being there,
natural, remote collaboration.
[End video]
>> Zhengyou Zhang: So the final one, the visual quality is a little bit bad because you observe
the person from the side, but you render really from the front and you see only half of the person.
But the eye gaze, etc, the interaction will be more fluent. So we are now trying to improve the
result by adding another Kinect. And this one, people really feel good, it’s pretty cool and this
one has the highest visual quality because the video is exactly the same as the original one, but
we distort the hand. People actually initially feel very awkward to see this long stretch of the
arm, but after awhile people feel comfortable. Also you see the long stretched arm from your
point of view, but for the person standing here, because of the view it’s not very long actually.
So it is still acceptable. When you work a little bit in front you forget the stretched arm, but we
still need to make the hand more stable, etc.
So to summarize I just presented a system called the ViiBoard, which combines the Surface Hub,
so bigger Touch Board, with the Kinect sensory, RGB-D sensor and the vision-enhanced
immersive interaction with Touch Board. So by leveraging the cues of the user using Kinect
now we can interact with the large display beyond touch. Even before you touch we know a lot
of things about the user and using gesture or person ID is very natural to reach. Also because the
Kinect captures solidity information by rendering appropriately we really feel more people can
work together as if they are standing side by side.
The future research will be to try to unify the experience from large display, to surface, to full
maybe. This is joint work with Yinpeng, Phil Chou, he is here and Zicheng. Okay, thank you.
[Applause]
>> Zhengyou Zhang: Questions? Yes?
>>: You said something about using multiple Kinect’s? How are you going to do that without
interfering with each other?
>> Zhengyou Zhang: So the interference, actually if you can synchronize you can avoid the
interference. And the Kinect Sensor is not really for the search purpose, right, so there is no
synchronization to make the [indiscernible]. Actually the interference was pretty bad for the
older Kinect when you put a random dot. So when you had the other Kinect the random dots
will be confused. With the new one actually it’s better. There is still some interference.
Actually you tried the randomization and collaboration, right? So when he was at UNC, well
you can explain.
>>: [inaudible].
>> Zhengyou Zhang: That’s very cute.
>>: [inaudible].
>>: How much bandwidth? Oh, sorry.
>>: How far did you collaborate, from here to China or how far?
>> Zhengyou Zhang: No, this is –. We want to see the other persons experience so we really put
them side by side and separated them by a curtain.
>>: Oh, separated by a curtain. How many points –.
>> Zhengyou Zhang: But the idea is to really try to have the system depart from China to here,
right.
>>: How many points can you do on the body?
>>: You said 20 joints, right?
>> Zhengyou Zhang: No, for the second subsystem you need to render. So it’s about 500 by
400 and then you have a smaller portion of the person. So I don’t know exactly.
>>: So your camera resolution is what?
>> Zhengyou Zhang: So depth sensor, the second one IBE, is 500 by 400 something.
>>: So just the RGB.
>> Zhengyou Zhang: No, I’m talking about the depth sensor.
>>: Yeah.
>> Zhengyou Zhang: So smaller. So the RGB is HD.
>>: All right.
>> Zhengyou Zhang: HD quality.
>>: Yeah, because you need a lot of pixels on a face in order to –.
>> Zhengyou Zhang: Yeah, yeah you want them to have the good view. The geometry part you
can still use a triangle patch and the color will be mapped on the triangle. You have a question?
>>: So I am not clear on how the Kinect depth sensor works. Are you projecting fiducials,
infrared fiducials?
>> Zhengyou Zhang: No that’s for the first Kinect generation. The first generation is projecting
the random fiducial points and that pattern is known for the Kinect. So then you capture the
projected one from the camera and then you need to match the observed one with the other one.
Then you do the triangulation in [indiscernible] space. So the second one is –.
>>: The camera and the projector are off set by [inaudible].
>>: The [indiscernible]. But that would be in a plane right? So given the third dimension
requires some auto plane sensing?
>> Zhengyou Zhang: What?
>>: No, no they are off set. So here’s the projector over here projecting out that way and here’s
the camera over here. So it’s like a stereo camera.
>>: Sure, but that’s only in the plane, right. How do you get the auto plane resolution when your
interferometric in a plane? There is a whole locust of ambiguity.
>> Zhengyou Zhang: No, the pattern is really random to very unique.
>>: Okay.
>> Zhengyou Zhang: And for the second one maybe Phil you can explain the depth sensor.
>>: The second Kinect version is a time of flight camera.
>>: Okay, it’s pulsed.
>>: And there are different kinds of time of flight cameras, one is pulsed, pulse is gated. This
one is actually a paradigm.
>>: Okay, oh yeah.
>>: Modulated sinusoidal and you measure the phase inference [inaudible].
>>: [inaudible].
>>: But still infrared right?
>> Zhengyou Zhang: Still infrared, yeah.
>>: Okay.
>>: Then my question was: How much bandwidth do you need with this thing? I mean how good
of a Kinect –?
>> Zhengyou Zhang: So it depends on how you implement. For flexibility you want to send the
solidity information to the other side and render. In that case you need the depth information
plus the color. So you need additional bandwidth, but another way to do it is you render, if you
know the [indiscernible] is fixed for the remote side you can render from this side and then send
the 2D images. It’s almost the same. So this is mostly for 2 people and currently the system
works only for 2 people actually. But if you think about attending to more people then the
rendering will be different for different people because each one will have their own location,
their own eye gaze, etc. In that case it’s better to send the broadcast solidity information to the
remote side and then they will be rendered in real-time according to user’s location and the
pulse. So the bandwidth is not much higher in that case, but it would be a little more.
>>: Yeah, if your dub sensor is like 500 by 500 and your color is HD it’s not going to make that
much difference.
>> Zhengyou Zhang: Actually the most bandwidth demanding is probably the noise in the depth
sensor. So we need to [indiscernible].
>>: Okay, thank you.
>> Zhengyou Zhang: Thank you.
[Applause]
Download