>>: Welcome everyone and today we are honored to have Dr. Zhengyou Zhang come here. Zhengyou is very good, has great experience, is well known, he is an IEEE Fellow and is a manager here, research manager of multimedia interaction and experience. He also published 200 papers. He is an IEEE Fellow and ACM Fellow. He is the Founding Editor-in-Chief of the IEEE Transactions on Autonomous Mental Development, and has served on the editorial board of IEEE TPAMI, IEEE TCSVT, IEEE TMM and many, many conferences. Also he has served as a program chair, a general chair, and a program committee member of numerous international conferences in the areas of computer vision, audio and speech signal processing, multimedia and human-computer interaction. He a General Chair of International Conference on Multimodal Interaction (ICMI) 2015, and a General Chair of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017. He received the IEEE Helmholtz Test of Time Award for his paper published in 1999. So let’s welcome distinguished Zhengyou Zhang. [Applause] >> Zhengyou Zhang: Thank you Sheri for the introduction and good evening. Thank you for coming to my talk. This is part of the IEEE [indiscernible] Tech Talk series. Today I will talk about how to combine Surface Hub and Kinect to achieve a more natural interaction with a big display as well as Immersive Collaboration between people who are distributed geographically. So Surface Hub is a large display. I will talk about it a little bit later. It is like a giant iPad, but it’s better. Kinect is a RGBD Sensor. It will directly capture the depth information. So if we take a look at the office space you a lot of whiteboard. Whiteboard is a great tool for collaboration because it is big so everyone can focus their attention to the shared space and they can write freely, they can erase it if they want, but most importantly when they collaborate, when they brainstorm, they can see each other’s facial expression, hand gesture, gaze awareness, intention, etc. So this is an important tool and that’s why in the modern office space you see quite nice whiteboards in the office. At the same time more and more electronic displays are installed in the office because they are getting cheaper and cheaper. Those displays could be large and could also be touch enabled. So how to leverage this trend is an interesting thing for collaboration and interaction. Microsoft announced its new line of product called Surface Hub. It will be released on January 1 next year. Surface Hub is, as I mentioned earlier, like a giant iPad, but its better. It provides the best pen and touch experience on a large display. So there will be 2 versions: one is 55 inches and the other is 85 inches. The 85 inch Surface Hub as the full key display. So the display quality will be very good. And I also mentioned we will use the Kinect Sensor to enhance the Surface Hub. So the Kinect Sensor directly captures solidity information. This means we can now see the surrounding environment, not just from the viewpoint of the sensor, but we can now move out of the viewpoint of the sensor to see the environment. This is a video where the person tried to see from different angles. [Video] >> Zhengyou Zhang: So traditionally in computer vision we use RGBs camera, it’s a color camera to do the computer vision tasks. And depth sensor now offers an advantage, for example the Kinect Sensor can work in low-light or in a complete dark room because it emits infrared lights and RGB will now work in low-light conditions. The Kinect Sensor can chart immediately the depth information. So the person in front will automatically pop out from the background. And if we use RGB, if the background is colored then it is very hard to segment the foreground from the background. Also because we directly get to the depth information the skill is known with the depth sensor in RGB, because the 3D [indiscernible] is projected to the 2D sensor so the depth is lost. So we don’t have the scale, but the tools are really complementary. So that’s why even in the Kinect Sensor we also offer an RGB camera. So that’s why the Kinect is RGBD sensor. Yes? >>: Is the sensing mode of a depth sensor ultrasonic? >> Zhengyou Zhang: It is infrared. There are two versions: the original Kinect is random thoughts and sends to the [indiscernible] and the second version is using the pulse different frequency. For us actually here is a diagram we think could be a new Surface Hub by integrating the depth sensor into the touch screen. So we can have a depth sensor on the wing. So from this one we can see all the environment on this side and from this one we can see from the left side. The depth sensor on top can see the whole room at the back. >>: [inaudible]. >> Zhengyou Zhang: Yes, I will tell you. >>: [inaudible]. >> Zhengyou Zhang: So yeah, I will tell you later. So in our experimental setup we don’t have the [indiscernible] the device yet. So we just put the Kinect Sensor next to the large display and you can see there is a little bit of a gap because the Kinect Sensor has a minimum sensing distance. So we need to put them a little further away. And by combining them we will get a more natural and immersive interaction with Touch Board. So the ViiBoard stands for Vision-enhanced Immersive Interaction with Touch Board. The system consists of two subsystems. The first one is called VTouch and it’s about human/ computer interaction. The second one called ImmerseBoard is about human/human collaboration across distance. We will show you in more details. So in VTouch we leverage the cues of the user from the Kinect Sensory, for example: how far the user is from the Surface Hub, or where is he standing, or who is the person, whether the person is left hand or right hand and what gesture is it? So, to have a better interaction, for ImmerseBoard because the depth sensor can [indiscernible]. So we can manipulate the person in [indiscernible] and render it appropriately. So we can achieve real-time immersive experience to feel that we see the reference point of the remote person and share the feelings that we are standing in the same space and being aware for the case of the remote person predicting the intention. So we really try to achieve the experience that the two people are standing side by side in front of the same whiteboard. So now let’s take a look at VTouch, the first subsystem, Vision Enhanced Touch Experience. So as I mentioned earlier we really want to leverage the Kinect Sensor to attract cues about the user: position, proximity, person ID, hand ID, gesture ID and intention. Here is a system diagram. From the Kinect Sensor we apply a number of computation based algorithms and attract to the cues. Then we use the cues of the user to design interactive applications to work with a larger display. And the full key vision technologies are the sensor-display calibration, human skeletal tracking, hand gesture recognition and person recognition. So I will explain them one by one shortly. So the first technology is sensor-display calibration. So we use the Kinect to sense the user, but as a sensing result is [indiscernible] in the coordinate system of the Kinect [indiscernible] system. And the interaction is done in the coordinate system of the display. So two need to be brought together and set the calibration process. So we needed to determine the rotation metric and the translation vector between the two. And this can be easily done by typing a few points on the display, because by typing on the display we get the touch point. So that’s in the coordinate system of the display. And the fingertip [indiscernible] time x, y, z is obtained from the Kinect Sensory. So then we can compute the rotation metrics and translation. So the calibration can be done fairly easily. So the second technology is about human skeletal tracking. A human in our case is represented by a few joints, like the head, neck, shoulder, etc. Each joint is represented by the x, y, z coordinates and from the x, y, z we can compute the angles, etc. And Kinect Sensor tracks 20 body joints in real-time. And here is the pipeline of how it works. So from the depth image each depth pixel is classified into body parts. So for example if we take this pixel, we say, “Ok this is probably the right shoulder and this pixel is probably belonging to the right hand,” etc. So that is a per pixel inference for each depth pixel. But this can be very noisy because it is determined per pixel. So we then aggregate to the information in the neighborhood and this will generate a hypothesis of the body joints. And this is still too noisy. At the last stage we apply the Kinect constraint as well as the [indiscernible] and this will give us tracking result. Here I will show you an example here. This is an input depth image and that is the inferred body parts. This is a different view of the joints. The full human computer interaction hand gesture is very important. So here I will talk about how to do hand gesture recognition. In our current system we need only a few gestures. There are 2 gestures actually: one is a gesture when the person is very close to the board. This includes the color pallet in order to choose different colors, or typing and pointing. And for the far mode when the person is pretty far from the board then pointing or clicking, etc. is needed. And what do we do? We first from depth image to segment the hand, cut out the hand region. And then we build this local occupancy pattern around the hand. And based on whether we divide into a number of cells and count the number of points in each cell. Then we do the multiclass support vector machine to determine the near-mode left hand, the near-mode right hand and different gestures. The training data we have now is only from four people. We are working with more people now and each person is asked to work in front of the larger display with each gesture. And at the run time, based on the position of the person, we can determine whether we need to apply the near-mode gesture or far-mode gesture and then we apply the classifier. That’s for the hand gesture. For person recognition I’m sure everyone knows we do the face recognition, but we do actually a little bit more [indiscernible]. We combine face plus the body appearance like clothing. The reason is when you work in front of the Surface Hub sometimes you are not in facing the Kinect Sensor so you don’t get the front view of the person, but the body appearance, the clothing you can see from everywhere and this is usually stable, at least for the whole day, you know you don’t change clothes during the day. And even if you don’t see the face the clothing will help you. And if sometimes you see a video tape between different people, once you see the face they can be easily [indiscernible]. So we really tried to combine the advantage of the two modalities, two parts. The face gives you accurate recognition for front view, but very bad for side views and occlusions. Body appearance can be confused when people are wearing different clothing, similar clothing, but these robust are for side views. And for the face recognition part we use –. So usually you get to the features and then you have the classifier. In our system we use multi-scale LBP, local binary pattern, descriptors. We choose a lot of features, but we use a PCA to reduce to 2,000 and then we use a Joint Bayesian Method to determine whether this is the user or not. >>: [indiscernible]. >> Zhengyou Zhang: For the body appearance we are tracking the person. So we have the upper body, low body, hands, etc. So we just use a few parts of the body and for each body part we also have different angels based on the poor orientation with respect to the Kinect Sensor. Then for each part and each orientation we have the color histogram. And based on the color histogram we can contract a model of the appearance. So here I will show you some result. So the person is coming and initially unknown, but as soon as the front view is seen then the person is recognized. They are unknown and then quickly it is recognized, I think and it is a robust occlusion. Now the person is trying to defeat the system by changing the clothes. So now he is coming in with different clothes and initially it is unknown, but as soon as the person is seen from the front view then it will be associated with the name face with the appearance model. >>: So what’s the size of population of the faces that you want to distinguish? >> Zhengyou Zhang: So in our system usually it is a team member, so 10 people. >>: Oh, so small. >>: How frequently are –? >> Zhengyou Zhang: Pretty robust, yeah. >>: So you have 100,000 dense sample points for facial recognition using PCA to reduce the rank of that problem to 2,000. That’s a great big matrix. >> Zhengyou Zhang: It’s a big matrix. >>: How often do you have to invert that? Is it once per appearance of a new face or is it regularly done or periodically done? >> Zhengyou Zhang: So that’s pretty fast. >>: 100,000 to 100,000 is big. >> Zhengyou Zhang: No, no that’s fast. But in our case actually we don’t run every frame. It’s about 5 or 6 frames, because in the middle time we can track, right. So it is run in multi-thread execution. >>: [inaudible]. >> Zhengyou Zhang: So we enroll everyone. It will be in the database. >>: [inaudible]. >> Zhengyou Zhang: [indiscernible]. >>: No the main problem for this biometric is that you want to identify whether this guy [inaudible]. >> Zhengyou Zhang: So if the database is infinity then yeah, a lot of confusion. >>: [inaudible]. >> Zhengyou Zhang: So when you have a smaller space in the database that you want to recognize it’s much easier. >>: So you probably don’t make any mistakes. >> Zhengyou Zhang: Still in the one side view it is harder. So now I will show you how the VTouch works. Here are a few applications. For example it can bring up menu without touch. It can display menu wherever you are. It can augment touch with HandID, PersonID and GestureID and hover I will explain here, pointing and auto lock of the display. So when you walk away from the display it can automatically lock the screen for the privacy of the content and when you come back, if you ware part of the [indiscernible] it can unlock for you to continue the session and here are a few screen shots. I will show you the video. So like showing the palm open it will bring automatically the color palate menu. So you don’t need to go to the start menu, drawing, color, etc. You just show you the palm open. The same thing for typing and we recognize different people. So we can determine who wrote what, like this case is user one and this is two. And here when the two people play a game automatically we know who is who. So you don’t need to change the color. It will automatically assign different colors for different people and here it the screen logging and etc. And when the person is pretty far away we can have the pointing and we distinguish whether you are using left hand to point or right hand to point. We also distinguish whether you are writing with the left hand, right hand or with a pen. Another thing which usually is missed in the touch display is the hover. So in the hover, when you move the mouse in front of the object then automatically some content menu will come out. But touch usually you have to touch in order to do anything here. But the VBoard we know whether you are close to the object on the screen. Then, in this case, we are moving close to Chicago, then the content menu will come out and other part that we know there is no content menu. Then you can touch the content menu, etc. Also the menu can follow you. So wherever you go the menu will come in handy. So you don’t need to run to the start part to get the menu, especially when you have an 85 inch display. So now let me play a short video. And the feel here is he is part of the project and he is now meriting the video. [Video] >>: [inaudible]. Thanks to a Kinect pointed across the display VTouch can understand where the user is, who the user is and what the user is doing even before the user touches the display. Gestures such as an open palm may bring up a color palate from which the left hand may select erase, the pen may select green and the right hand may select red. These selections are remembered; erase with the left hand, right green with the pen and write red with the right hand. Another gesture, a pair of high fives, opens a keyboard, making it easy to enter text anywhere on the screen. Hovering over the text brings up a menu of attributes which may be selected to change the color or size of the text. Different users can also be distinguished using vision. Another user may select his own color and stroke width. When the first user returns his settings are remembered. The strokes and texts are tagged as belonging to user 1 or user 2. VTouch knows the users location. Not only can it bring up a menu near the user, but it can keep the menu near the user so he does not have to hike back to the far end of a large display. Pointing for interaction when the user is away from the board is also possible. Pointer gesture detection is robust against other gestures. The right hand is distinguished from the left. Selection is also possible. VTouch requires the Kinect to be calibrated with the display, but calibration is easy, simply by touching 4 points on the board. VTouch uses vision to enhance touch by detecting gestures whether far or near to the screen, distinguishing between hands when they touch the screen, restoring hover to touch screens, distinguishing between users and tracking their positions. VTouch, vision enhanced interaction for large touch displays. [End Video] >> Zhengyou Zhang: So we have conducted and people really like it, most people, just 1 person in a few says it is distracting and one person feels that left hand/right hand is too difficult to remember. They just prefer touch always with one hand rather than left/right. But other than that people feel it is very good. So let’s move onto the second subsystem called ImmerseBoard, vision-enhanced immersive remote collaboration. So before we talk about remote collaboration let’s look at the co-located collaboration. So for example 2 people sanding in front of a physical whiteboard what you have you see different people in front of you. You see the facial expression, eye gaze, gesture, etc. So you have the full bandwidth across people and also you see immediately the content creation because it is in the same space. Now let’s look at the current situation about remote collaboration. So what do you usually have? You have a window of the remote person, the talking head of the remote person. And you have a shared space to collaborate with remote people, but the two are disjoined, they are not connected with each other. So what do you have? Between people you only see the face that is seen, bandwidth [indiscernible]. And for the content you don’t know how the content was created actually. You just see the final resolved popup on the screen, because you don’t see the gestures, pointing, etc. So what we want to do is really try to bring the two spaces together and create a much richer bandwidth between the faces and between the content creations. So here is one example, the person points here and now the person can move out of the box now and [indiscernible] can see. I will show you this later. So again just to repeat, the Kinect has solidity information. So it can manipulate the solidity information of the remote person to render it in a proper way on screen and then we can see the reference point of the remote person, people feel we are sharing the same space, aware of the gaze and predicting intention. And we have implemented two metaphors: the first one is on a whiteboard. You see how this is done. The second one is about two people writing on a mirror, so rather than looking to the side you can look at the other person through the mirror, so without turning your head. So this is a little bit different. If you have seen the class war video then you have to write backward and here in our case we don’t need to write backward because you write in the same side. You just look at the person on the other side. It is [indiscernible] in the mirror. We have implemented actually 3: so that’s two people writing side by side, that’s writing in front of the mirror and here is exactly the same thing as the current system. There is a talking head and shared space, but now we manipulate the hand to touch the reference point. Now let’s play the video. [Video] >>: There is a system for remote collaboration through an electronic whiteboard with immersive forms of video. Thanks to a Kinect camera mounted on the side of a large touch display we augment shared writing with important shared social cues, almost as if participants were in the same room. This 2.5D hybrid between standard 2D video and full 3D immersion shows the remote participants 2D video on the side, but his hand is able to extend out of the video frame in a non-physical way to write or to point to a place on the shared writing surface. This visualization shows the remote participant in full 3D as if the participants were standing shoulder to shoulder in front a physical whiteboard. Eye gaze direction is improved as well as gesture direction, body proximity and overall sense of presence. >> Zhengyou Zhang: So what we do here is the Kinect captures the [indiscernible] shape of the person, plus the color, etc. Then we have this whiteboard in [indiscernible] space and essentially the Surface Hub surface and then the person. Then we move [indiscernible] to the whole thing inside and project on the screen. That’s why we get this [indiscernible]. So you can see now the eye gaze of the remote person. >>: The participants are able to look at each other to check for understanding. Finally this visualization shows the remote participant in 3D as if reflected in a mirror on which the participants are writing. Eye contact and gaze direction are further improved as well as overall sense of presence. ImmerseBoard makes remote collaboration almost as good as being there, natural, remote collaboration. [End video] >> Zhengyou Zhang: So the final one, the visual quality is a little bit bad because you observe the person from the side, but you render really from the front and you see only half of the person. But the eye gaze, etc, the interaction will be more fluent. So we are now trying to improve the result by adding another Kinect. And this one, people really feel good, it’s pretty cool and this one has the highest visual quality because the video is exactly the same as the original one, but we distort the hand. People actually initially feel very awkward to see this long stretch of the arm, but after awhile people feel comfortable. Also you see the long stretched arm from your point of view, but for the person standing here, because of the view it’s not very long actually. So it is still acceptable. When you work a little bit in front you forget the stretched arm, but we still need to make the hand more stable, etc. So to summarize I just presented a system called the ViiBoard, which combines the Surface Hub, so bigger Touch Board, with the Kinect sensory, RGB-D sensor and the vision-enhanced immersive interaction with Touch Board. So by leveraging the cues of the user using Kinect now we can interact with the large display beyond touch. Even before you touch we know a lot of things about the user and using gesture or person ID is very natural to reach. Also because the Kinect captures solidity information by rendering appropriately we really feel more people can work together as if they are standing side by side. The future research will be to try to unify the experience from large display, to surface, to full maybe. This is joint work with Yinpeng, Phil Chou, he is here and Zicheng. Okay, thank you. [Applause] >> Zhengyou Zhang: Questions? Yes? >>: You said something about using multiple Kinect’s? How are you going to do that without interfering with each other? >> Zhengyou Zhang: So the interference, actually if you can synchronize you can avoid the interference. And the Kinect Sensor is not really for the search purpose, right, so there is no synchronization to make the [indiscernible]. Actually the interference was pretty bad for the older Kinect when you put a random dot. So when you had the other Kinect the random dots will be confused. With the new one actually it’s better. There is still some interference. Actually you tried the randomization and collaboration, right? So when he was at UNC, well you can explain. >>: [inaudible]. >> Zhengyou Zhang: That’s very cute. >>: [inaudible]. >>: How much bandwidth? Oh, sorry. >>: How far did you collaborate, from here to China or how far? >> Zhengyou Zhang: No, this is –. We want to see the other persons experience so we really put them side by side and separated them by a curtain. >>: Oh, separated by a curtain. How many points –. >> Zhengyou Zhang: But the idea is to really try to have the system depart from China to here, right. >>: How many points can you do on the body? >>: You said 20 joints, right? >> Zhengyou Zhang: No, for the second subsystem you need to render. So it’s about 500 by 400 and then you have a smaller portion of the person. So I don’t know exactly. >>: So your camera resolution is what? >> Zhengyou Zhang: So depth sensor, the second one IBE, is 500 by 400 something. >>: So just the RGB. >> Zhengyou Zhang: No, I’m talking about the depth sensor. >>: Yeah. >> Zhengyou Zhang: So smaller. So the RGB is HD. >>: All right. >> Zhengyou Zhang: HD quality. >>: Yeah, because you need a lot of pixels on a face in order to –. >> Zhengyou Zhang: Yeah, yeah you want them to have the good view. The geometry part you can still use a triangle patch and the color will be mapped on the triangle. You have a question? >>: So I am not clear on how the Kinect depth sensor works. Are you projecting fiducials, infrared fiducials? >> Zhengyou Zhang: No that’s for the first Kinect generation. The first generation is projecting the random fiducial points and that pattern is known for the Kinect. So then you capture the projected one from the camera and then you need to match the observed one with the other one. Then you do the triangulation in [indiscernible] space. So the second one is –. >>: The camera and the projector are off set by [inaudible]. >>: The [indiscernible]. But that would be in a plane right? So given the third dimension requires some auto plane sensing? >> Zhengyou Zhang: What? >>: No, no they are off set. So here’s the projector over here projecting out that way and here’s the camera over here. So it’s like a stereo camera. >>: Sure, but that’s only in the plane, right. How do you get the auto plane resolution when your interferometric in a plane? There is a whole locust of ambiguity. >> Zhengyou Zhang: No, the pattern is really random to very unique. >>: Okay. >> Zhengyou Zhang: And for the second one maybe Phil you can explain the depth sensor. >>: The second Kinect version is a time of flight camera. >>: Okay, it’s pulsed. >>: And there are different kinds of time of flight cameras, one is pulsed, pulse is gated. This one is actually a paradigm. >>: Okay, oh yeah. >>: Modulated sinusoidal and you measure the phase inference [inaudible]. >>: [inaudible]. >>: But still infrared right? >> Zhengyou Zhang: Still infrared, yeah. >>: Okay. >>: Then my question was: How much bandwidth do you need with this thing? I mean how good of a Kinect –? >> Zhengyou Zhang: So it depends on how you implement. For flexibility you want to send the solidity information to the other side and render. In that case you need the depth information plus the color. So you need additional bandwidth, but another way to do it is you render, if you know the [indiscernible] is fixed for the remote side you can render from this side and then send the 2D images. It’s almost the same. So this is mostly for 2 people and currently the system works only for 2 people actually. But if you think about attending to more people then the rendering will be different for different people because each one will have their own location, their own eye gaze, etc. In that case it’s better to send the broadcast solidity information to the remote side and then they will be rendered in real-time according to user’s location and the pulse. So the bandwidth is not much higher in that case, but it would be a little more. >>: Yeah, if your dub sensor is like 500 by 500 and your color is HD it’s not going to make that much difference. >> Zhengyou Zhang: Actually the most bandwidth demanding is probably the noise in the depth sensor. So we need to [indiscernible]. >>: Okay, thank you. >> Zhengyou Zhang: Thank you. [Applause]