>> Phil Chou: So it's a great pleasure for me to introduce Wonwoo Lee. He is from the Ubiquitous Virtual Reality Lab at the Gwangju Institute of Science and Technology in Gwangju, Korea. He will be talking about his work in mobile augmented reality. Wonwoo received his bachelor's degree in mechanical engineering from Hanyang University in Seoul and then his master's degree in information and communication from GIST. He is currently about to get his PhD at GIST on this topic and he is here visiting as a gratis visitor for three days, yesterday, today and tomorrow along with his advisor Professor Woo, who is a consulting researcher here for this week and next week, so if you like what you're hearing today and want to talk with either of them say this afternoon or tomorrow afternoon or some other time, please see me in person right after this talk or you can send me an e-mail to Phil Chou to set up a meeting time with them. So Wonwoo’s current interests are in silhouette segmentation for 3-D reconstruction, realtime object detection for augmented reality and GP GPUs for high-performance computing and we will let him talk about his work in mobile augmented reality. >> Wonwoo Lee: Thank you Phil. Good afternoon. I think you may be a little bit sleepy after lunch, but please follow my presentation. Today I'm going to talk about videobased Situ Tagging for Mobile Augmented Reality. I have worked on this doing my PhD studies, so the outline of this talk is showing you this slide, I will show some research overview of what I have done and I will give you some introduction about augmented reality and then I will go through some details of my work and I conclude the presentation with some future work. One of my ongoing research projects is texture less object detection using depth information. Using depth information you can do many things as you know so providing depth information can help us detect some complex 3-D objects which do not have much texture. We use RGB D. We combine the RGB D and depth information for target detection and then estimate poles using [inaudible] shape of target object. This video shows some detection examples. By overlaying the name of the target object on top of them and then this part shows the advantages of using depth information when we use RGB D image only, it is barely affected by lighting conditions. In this case it's sunlight, but we combine both information with depth we can do robust detection and also I think detection problem happens when there are serious shadows on the object like this, so we could do the same thing by combining RGB and depth information. So this is one of my ongoing research projects and this is a part of the topic today I present and the interesting thing is using depth information we can know the size of the target object and then even though there are two objects which have the same textures but have different scales, using depth information we can identify each one from the other, so this is the interesting part of using depth information. During my PhD course I focused on some computer vision techniques that are for the reconstruction of 3-D objects and then the reconstruction of 3-D content augmented, used for augmenting real scenes by virtual, by augmented reality applications. So basically I assume that users take photos with their mobile phones and then we have collective the photos from users and they can be constructed from 3-D objects using the photos and then three motor and from the texture we can create realistic 3-D contents and the realistic 3-D contents is overlaid or [inaudible] and in the real scene through augmented reality techniques and we really get the some kind of more realistic results from the AAR application. So from this viewpoint the systems look like this. The user captures photos from mobile phones and then the photos are transmitted to the modeling server and then in the modeling server we have to do some 3-D reconstruction, but before that we have to identify the target object from a set of photos which is taken from multiple viewpoints. So we first do multiview silhouette segmentation and here we are using some color and special consistency measured in 3-D to identify the programmed object from the images. So this video shows how the work, how the method works. So here there are photos taken from multiple viewpoints and then we first initialize the silhouette of target objects by [inaudible] doing volumes over each cameras and then we iteratively optimize the target object silhouette in all of the views simultaneously. And after doing some iterations, we can see that the programmed object’s regions are defined, and then finally we get the silhouettes of the programmed object. After finding the silhouettes of target object we build a silhouette object from the silhouette and color images so the next part is silhouette construction and here I worked on building a smooth [inaudible] from silhouettes and color images. So in this case we have images and silhouettes and then we construct visual hull, but the visual hull is quite simpler reconstruction method, but we aim to build smooth surfaces by intersecting silhouettes to use in augmented reality applications. So we iteratively defined the silhouette model to get smooth edges and like we have to some textures. So we have a 3-D model over the one object in the photo and then that 3-D construction is transmitted to again to the user’s mobile phone again and then it is used for augmenting various things through mobile phones. This shows how we do it. Usually we take a shot from a target object and then we do some automatic learning on mobile phones and then the target is instantly detected over from video sequences taken from mobile phone’s camera, so we reconstruct this 3-D model from the phone and then you can see that the reconstruction of the model is then overrated [inaudible] with a proper [inaudible] [inaudible]. So this was my research introduction and then today I will talk about the rest part of my work I showed previous slide. So people I get into details I first mention about the concept of [inaudible] virtual-reality which aims for [inaudible] everywhere so [inaudible] reality concept the real and virtual world is modeled in augmented reality space where the real entities are mirrored to target entities. So the connection of the real and virtual entities is important in this concept to making interaction between the virtual and real world. So to connect the real and virtual entities we have to recognize the 3-D object or the target object from image sequences and my work is about that. This figure shows one kind of a visual of mobile augmented reality, which can be regarded as a subset of [inaudible] concept. In this figure the real object is connected to [inaudible] entities. Usually we call AR annotations so the [inaudible] information is over rated to the real scenes through camera and then the [inaudible] entity is geometrically registered with a virtual real object. But the problem arises when we do these things in outdoors scenes where the scenes are unprepared. Unprepared scenes means we have not much information about the scene. I mean like a 3-D geometry of the scenes. So in this case even if we want to do some annotations on a specific object we have no way to do it usually using computer vision techniques because we don't know the target object at that moment. So what we need is some kind of online learning and detection of target objects in situ to make a user to be able to interact with a target object without prior knowledge. So we propose a novel augmentation method with minimalist user interaction which is very simple point-and-shoot approach. This video shows the overview of my method. The user takes a shot with a mobile phone camera. As you see the target object is, can be detected from, without any difficult interaction and then users can add [inaudible] augmentation on target object. The advantage of this method is that it is very simple and it doesn't do any complex 3-D reconstruction of the scene. And also users can detect target object from a different viewpoint or their input is not as many from the cameras. My method has some assumptions. The input algorithm is image of the target object. Usually I assume here that the target object is planar and the output is patch data and is associated camera poses to retrieve [inaudible] of freedom pose. And the assumption here we have known camera parameters like a focal links and specific points and also I assume that the target object is either horizontal or vertical, feature very common in the real world. So from these assumptions target learning procedure learns in as shown in this slide. From the input image I approach the computer from a fronto-parallel view which is normal viewpoint, features of imagery over target object taken from numerous viewpoints. I will explain that in detail later. The target object is learned from, is learned from the fronto-parallel view by warping the input patches and applying some blurrings and I do some post processing. So the fronto-parallel view generation step is warping the source image to a fronto-parallel view and template learning step is for building template data from the input images. So the first step of learning the target is the fronto-parallel view generation and some of you may wonder what the fronto-parallel view means. So the fronto-parallel view is pictures like this so it is taken from the normal viewpoint. The object in the camera has no orientation changes. So usually in computer vision based in objects detection methods its fronto-parallel view is required to learn the planar surfaces, so the situation is the user’s camera is at the same height as the target object. However, in the real world this is not always happen so this situation is more common. The target object is lower than the user’s viewpoint or higher than the user’s viewpoint. In this case the images acquired by the camera have a perspective distortion because of the camera’s characteristics. Not always the frontal views are available in practical situations. Especially for horizontal surfaces you may not get, you may not want to get a picture of an object from this viewpoint. We just usually see from, have some angles relative to the target object. From these photos it is not possible to retrieve the correct template data which can provide for detection and estimation. So the objective of fronto-parallel view generation is warping the source image to make it a scene from the frontal view. So the approach here is exploiting the mobile phone's built-in sensors, especially the accelerometer which provides the direction of gravity, so we combine adding some computer vision techniques and the phone sensors and we do this very easily in test. So let me talk about a little bit more about accelerometer sensor. It provides the direction of gravity and points to local coordinates and the gravity is normal to horizontal surfaces and parallel to the vertical surfaces. Then the direction of gravity provides various strong, good information about horizontal and vertical surfaces which I want to interact with. So let me talk about first the horizontal surface cases. In horizontal surface cases we can assume that there is only 1° of freedom relative in the orientation relative to orientation between the camera and the target surface. Usually, generally it is not, it does not have, it has more than 1° of freedom but I am making the assumption that there is only pitch rotation. So from the known camera metrics we can set the frontal view camera which is proper to identity polls and let me define the capture that the camera has some rotation and translation. We can compute the rotation from the accelerometer [inaudible] and then its translation [inaudible] can also be computed from the rotation and some distance which is predefined as the proper length. And from the known rotation and translation parameters we can compute the homography to warp the input imagery to the fronto-parallel view and the H is the homography here. Between that by simply from the input imagery we can rectify the imagery that, to make the original have a right fronto-parallel view as shown in this slide. So you can see that the where the reconstruction is quite good even though we don't do any image processing here. So horizontal surfaces are simple but how about the vertical surfaces? Okay. We can also make an assumption of 1° of freedom rotation to vertical surfaces like an image as shown in the top, but in general, more complex cases happening in the real world as shown in this here. So you just have somewhere orientation make with some orientations using his hand and also the vertical surfaces can have orientation relative to the user. So what about these cases? So we, the sensors cannot solve this problem so now we add some computational techniques to make this problem easier. So in vertical surface cases our approach is using the vanishing point which is very straightforward to find the orientation of the vertical surface relative to the camera. So here the accelerometer again helps vanishing point estimation. So by estimating the vanishing point from random segments, the orientation is retrieved from them and then the [inaudible] is done by the same as we do in the horizontal surface case. Let me explain how the accelerometer can help this procedure. Here the vanishing points in vertical direction can be expressed as the projection of a point of infinity to the camera and then the projection procedure is multiplying just [inaudible] parameters [inaudible] parameters and then applying the rotation and translation to the original point of infinity is giving us the point of infinity is giving us the point to the camera coordinate system and this is the coordinate so the point of infinity is in camera coordinate system is the same as the gravity direction as measured by the phone accelerometer because the vertical direction is the same and each direction is measured by the phone accelerometer and the local coordinate system, so by just projecting the gravity values to the camera’s coordinate system gives us a lot estimation over vanishing point in possible directions. This helps estimating the vertical vanishing point because this local estimation is quite good actually. So from line segments we have some line segments in the imagery and then we also have the local estimation of vertical vanishing point by projecting the accelerometer values to the camera and then using that rough estimation we have to do some refinement with Ransac optimization approaches. We identify some vertical lines using distance function from the vertical vanishing point because every vertical line should pass the vanishing point in the image coordinate system and we do some refinement iteratively and then we will get the good estimation of the vertical vanishing point here. Another vanishing point I have defined is the vanishing point in horizontal direction. The horizontal direction, the vanishing point in horizontal direction can be found from using the vanishing point we previously estimated in vertical. There are some orthogonality constraints between vanishing points here like in this equation shows that, and we made some hypotheses using this orthogonality constraints and then using line clustering using Jaccard distance which is defined as something like this but I will not explain this in detail. You can refer to the reference here. And then do some rank clustering and merging the cluster sets and do some iterative estimation we will get a horizontal vanishing point from the best cluster. So now I get the two vanishing points from, two vanishing points in horizontal and vertical and then from this the orientation of the surface, planar surface can be retrieved and this is not a difficult part. So the advantage of using accelerometer here is the speed and robustness. In case of the speed if we do the vanishing point estimation in conventional way which is just, most of them are used to some line clustering, it takes so much time for mobile phones and you know the mobile phones are still have a less computational power compared to the PCs, so it becomes very slow for mobile phones. However, using vanishing point as I explained you can directly estimate the vertical vanishing point in quite good accuracy and so it makes the problem easier. As you can see in these figures using the accelerometer of the speed of vanishing point estimation becomes very fast. And the other is robustness. When there is a very complex, when anything is very complex some vanishing point estimation using line clustering sometimes pays because there's not much horizontal or vertical lines sometimes. So, however, again, we know where the vertical vanishing point is with accuracy so that we can find the vanishing point even in very complex cases. In this case there are not many horizontal and vertical lines but we still can do the job very well using the accelerometer. Until now, I explained how we get the fronto-parallel view of the target object using accelerometer sensors and then now we are ready to acquire some template data from the target for detection. So the next part is template based running using blurred patches. The objective of this template-based learning is to acquire some data from the textures of the fronto-parallel view we made in the previous step. And we here we will adopt the approach of patch learning which was proposed in 2009 in CVPR which is patch learning by linearizing warping procedure and it uses a mean patch as a patch descriptor. However, the problem with this method when you are applying this to a mobile phone is the memory requirements of the method, the original method requires about 90 MB here to load precomputed data for fast running, so the performance problem on mobile phone CPU at the time. So instead of using mean patch as search, we try to mimic the original algorithm by applying some floating method. So how to compute mean patch is from the input patch, the input patch is warped from several different viewpoints and these patches are iterated and it gives the mean patch. But our method is just to apply some floating to the original patches and then get the similar resulting patches and I correlate the blurred patch. Let's see how it does it. So applying the set of blurs to the image requires some time on mobile phone CPUs so we try to exploit mobile phones GPU to make it faster. Our blurred patches computed through a multipass rendering scheme shown in this figure. Let me explain this in detail. The first step is for the input patch which is warped to fronto-parallel view. We warp the input patch to another viewpoint to make a detection pairing viewpoints. And this warping is replaced by rendering, you know, plane [inaudible] on CPU because it is much faster than warping on CPU. And then the blurring patch come here so in the third pass the Gaussian--no. In the second pass we apply radial blurring to the warped patch and this radial blurring allows the blurred patch to cover a range of poses close to the exact pose which means the original mean patch algorithm does several warps through the several times, to warping several times and then average it, but we skip the warping procedure and we replace it with radial blurring. Then we apply a Gaussian blur to make the blurred patch robust to image noise. And then the fourth pass we accumulate the blurred patches into texture unit, and the reason why is it is reading a set of blurred patches from GPU is and you reduce the number of read back and the number of times required for copying data from GPU to CPU. And finally, we do some post processing like down sampling and normalization and we finally get a set of blurred patches and it is also shaded 60 of freedom pose. And then now we are ready to detect the target object. So until now we have some data and we want to detect a target object from incoming video streams of mobile phone camera, and again, we use the gravity information for template matching here. The problem, the template detection method is very good for object detection. We've got a list of all of the textures and shapes, however the thing is if we use more templates, we can detect a target object from [inaudible] set of different viewpoints, but if we used more templates it makes the detection slower because we have to compare more and more templates with the input image sequences. So if we have too many templates the performance on smart mobile phones becomes very bad. To address this problem we again use the gravity information. So let me explain how the gravity works here. So we assume that the real world objects are aligned with gravity from gravity direction. For example, like this, the horizontal and vertical surfaces I mentioned, you know, the correct direction is either normal or parallel to those surfaces and for the 3-D object, we can assume that the upright direction of the object is usual parallel with to the gravity direction. So here I want to introduce the gravity aligned imagery, which is where the vertical vanishing point is either 0,1,0, or 0,-1,0; that means that it is either up or down direction. This means that to make the upright direction of the target object in image parallel to the gravity shown here. Let me explain the more about this. In the original image, this is taken from normal viewpoint. If the gravity and the upright direction is parallel here and then when a user makes some orientation changes when pointing the camera, the captured image the target object’s upright direction is no more parallel to the gravity direction and then the gravity aligned image means we warped the captured image to make it upright direction parallel to the original gravity detections. So the advantage to this gravity aligned image in template detection is you can reduce the number of orientations to consider when building templates. This means as shown in this figure using a single template to detect a target object we just build a single template to detect the object in different orientations like this. If we did not use the gravity aligned image we have to build a template in all of these cases and it increases the number of templates here. Let me explain how the gravity aligned image is computed. It is quite easy. Let's assume that the image is captured by a camera there and this is the gravity and what you want to do is make the blue and red arrows parallel like this and this can be done by simple rotation transformation in between all the set and the transformation can be computed. The problem is how do we know the angles that are here. And the thing is the blue line is the line connecting the vertical vanishing point and the center of the image. So from this fact if we know the vertical vanishing point we can easily compute the angle set and we can warp the original image towards the gravity aligned image. As I mentioned that conventional vanishing point estimation methods are very slow, especially in this case we have to do template matching in real time so even though the vanishing point estimation takes a few hundred milliseconds, it is very slow in template matching process. So our approach is using the accelerometer here and I explained that the accelerometer can give us a good estimation of the vertical vanishing point before, R theta can be directly obtained without any image processing, because we know the position of the vanishing point from the accelerometer directly, and applying this rotational transformation is very simple and we also do it on mobile phone’s GPU for fast warping process. So this video gives you a very clear idea of how the gravity aligned image works. You can see that in the original algorithm the target object upright direction changes as the user rotates the camera, but in gravity aligned image you can see that it keeps aligned direction always here. And another video gives you more clear idea of this. You can see that it is always kept in its upright direction in the warped imaging. After using template matching for target detection, we do some tracking using ESM blur algorithm which will be introduced by another colleague who will come next week and then so we retrieve a [inaudible] video freedom pose from the detected target surface and here we use some neon instructions which is like a sending instruction like an SSE on inter-CPU. It is for mobile CPUs. So I explained all the theories about my work and I gave you some experimental visuals. Here are some parameters here and here. Our method requires the data of the targeted object is always about 900 kB which is very few for target detection relative to the original algorithm. >>: [inaudible] image has 225 views? >> Wonwoo Lee: Yes. >>: So when you take a picture of that many views, how do you know which object to learn? Do you actually [inaudible]? >> Wonwoo Lee: Ah, these 225 views generated from the input fronto-parallel view. >>: At certain times? >> Wonwoo Lee: Yeah. This video shows how to work with horizontal view target. It is similar to what you see at the start of this thing. Learning reference templates just takes a few seconds here and the user can start detection of target object by selecting, not selecting, so the target of it poses augmented on the image of the object and again the user can select an object which is related to the contents of the box and then render a [inaudible] object. And this one is interacting with the particular surfaces. Again the user takes a photo of a vertical surface from arbitrary viewpoints and then it is selected by using vanishing point. Template data is then generated for mobile phones at the moment. Then users can start the detection and from different viewpoints here. This one shows another experimental result, detection in different viewpoints for vertical and horizontal targets, and this one is for different scales. Because we build the templates considering scale we can take the target object in different scales. This one shows the targets frontal views are unavailable usually vertical surfaces on a building. And this one is horizontal surface on features are very far from the user and definitely though it is frontal parallel view is not available to use in this situation. By estimating six [inaudible] freedom pose the [inaudible] content is already in the right orientation. >>: Would you say about scales, how to learn the scale? >> Wonwoo Lee: The scale is that the distance from the target object and the camera. The input image is just a single scale and we warp the input image to different distances. >>: I mean you don't, you can't recover, you don't really know, given an input image you don't really know… >> Wonwoo Lee: Yeah, the real distance is not known. >>: So… >> Wonwoo Lee: This is a relative distance. >>: Okay. But then if you want to put a sign on a building to match the size of the doors, let's say… >> Wonwoo Lee: The size of the content should be [inaudible] should have been known prior or the user has to give some input to determine the scale of the contents. >>: So once you've labeled, once you've annotated the size, the right size on the door and somebody comes up later and looks at the same door or maybe from a much different distance, will they see the sign? >> Wonwoo Lee: Will it be the right… Okay? Yeah. It is because the--it is just that the matter of rendering what size of the rectangle, I mean in the sign case. >>: Yeah, or that, for that person who is standing on the ground, obviously he was very big in some sense, but if you wanted to… >> Wonwoo Lee: Ah, you mean this case? >>: Yeah. >> Wonwoo Lee: Okay. If I see this region like, okay, this is the loop of a building and if I see this region maybe like a second-story well it becomes very big, because the scale is maintained within two applications. And then this is the target in a building which is much higher than the user's viewpoint. >>: Sometimes I get confused about which one you use to mirror and which one you use [inaudible] potential building and the building [inaudible] because the pictures… >> Wonwoo Lee: Yeah. >>: So which, I mean… >> Wonwoo Lee: So from your side the leftmost image is the input image captured by a user, and then the other two is just showing detection images in the slide. >>: So it has been trained by some other, the target has been learned by some other [inaudible], right? >> Wonwoo Lee: No. The leftmost image is used as the input and the templates are built from that image and then these are another screenshot of detection. >>: These are different images of the first one? >> Wonwoo Lee: Yes. They look similar, but they are different. >>: What do you correct for mostly rotating on the camera? That's not very common. Why would they hold their cell phone like this? >> Wonwoo Lee: Yeah. But it happens and usually we use mobile phone like this and then do not see contents like this, but when you use this over AR application, there is always some orientation changes because you just want to see the contents in different directions and it makes some orientation changes on the phone. It really depends on… >>: I guess I am asking why do you [inaudible] I assume that really [inaudible] testing [inaudible] user takes a [inaudible] trying to detect the picture. To get a detection, of course, there is no prior knowledge of what the picture is. >> Wonwoo Lee: Yeah, right. >>: So do you have like a scan everything? >> Wonwoo Lee: Yeah. Our property is doing some kind of a scanning… >>: You scan the real work? >> Wonwoo Lee: Yeah, there are two possibilities using some [inaudible] detection to make a, let me say the target object there is a corner… >>: But you don't do corners [inaudible] detection [inaudible]. >> Wonwoo Lee: We try the corner detection and plus also scanning approaches and in the case of corner detection, it is affected by the textures. If the corner is not detected the template is not comparable to the original even though the object is there, and so we adopted a scanning approach. >>: Okay. So really when you are scanning, you how do you know what are the sides to scan? >> Wonwoo Lee: Yeah. It was predetermined. >>: Okay. >> Wonwoo Lee: Yeah. See there is a predetermined window size, given like a sliding window moves the patches, the window to something like this. >>: [inaudible] so you said the size of the window is predetermined, right? >> Wonwoo Lee: Yeah. I mean… >>: So how do you know? >> Wonwoo Lee: The patches along this rectangle the center of the first image and only that part is used for running. You don't use the entire image. >>: I understand. Suppose through the learning [inaudible] suppose the patch size, the size of the pictures [inaudible] but on the image in the middle, let's say the picture the size is [inaudible]. >> Wonwoo Lee: You mean the--the input image, not used for detection, in that image the target can be smaller or larger so that we build some templates by applying some scaled vectors there and then we be warping the images, synthesizing the images used for template building. >>: I will ask you later. >> Wonwoo Lee: So the thing is if the input image is here, no the target image is here then you take the fronto-parallel view, the camera is here. If we pose, if we move the camera to the front, now the image of the object becomes larger and then we will do templates from the original image and they give us templates for… >>: [inaudible] cover, the templates cover [inaudible] sizes but [inaudible] window, each window you recover a portion of that image, so that even though your template has different scales, it doesn't help in this case. >> Wonwoo Lee: Yeah. In this case too much scalability makes a problem in detection. Okay. I did some more examples to capture the real work [inaudible]. So we can do with this method is turn to segmentation which was similar to the previous video and this one had some interesting the ceiling. And another video is [inaudible] with the shaded region. I made this video for fun. And here comes a [inaudible] here. Then this sheet shows the sharing of augmentations between two mobile phones. Here the target is acquired by phone A and it is transmitted to phone B through [inaudible] connection and then phone B start the detection of target object seeing the proper messages on the target. It was very hard to control two phones in one hand at the time. So this is some [inaudible] connection. [inaudible] connection and it is transmitted over to phone B. And I give you some results about the proponents and timing. This was the timing was measured on and iPhones, so this shows the rate of learning speed, how much time each step takes. So it, the time increases as the number of images we consider increases because we have more warping and blurring for more viewpoints, and on PCs it is very fast, just a few milliseconds. On iPhone or iPhone 4 it takes a few seconds but too many viewpoints make it slower. Here the most time-consuming step is the radial step which is the, it is slow because it accesses the textures on the GPU in random, not random but it accesses not horizontal, not vertically, so in mobile phones if you case those kind of access is very slow, so the radial takes much time and, but it can be implemented faster if we optimize my code. >>: When you do the [inaudible] and then you [inaudible]? >> Wonwoo Lee: Yes, we do that. >>: Did you try doing it the other way around, for example, [inaudible] how much worse? >> Wonwoo Lee: The thing is what we wanted to know is the gray value over pixel are at the same position which is in the original patches. [inaudible] the [inaudible] and then we blur in radial. And then the blur in pixels is different in that case. So that is what I didn't want so I didn't try it here. Okay. This slide shows the comparison with the original algorithm in detection performance and you can see that ours is a little bit, has a lower performance, but it still is comparably shows good detection performance but except these last three cases. These images are very, have limited textures like maybe grass or maybe board or something of [inaudible] but there are so many similar textures so if you are applying learning to that, is shows poor performance and you can see that also the original algorithm shows little bit lower detection performance. This is comparison in memory usage, so as I told you the original algorithm requires a large amount of precomputed data but we don't want to do that so we reduce it to be, to take another approach and you can see our memory consumption is much less than the original one. So we lose some detection performance but we can have a large amount of memory saved here. So let me conclude my presentation here. We propose some computer vision-based approaches for in-situ AR tagging in the real world environment and we explored the sensor information which is very popular on modern smart phones for doing computer vision works and then finding orientations or vanishing points and template matching. So using our approaches, you just can do some in situ detection for the real world by learning and detection of the target object. So what I explained until now was about interaction with the [inaudible] for personally but augmented realities in not--okay--let me say augmented reality is not limited to the personal interaction so the users, two users can interact with using augmented learning applications by connecting, by sharing their environment by building augmented reality spaces and their one locations. So AR will be can be used for interaction between users as well as to [inaudible] in real worlds. So this is the future work concept of mine. So, okay, this is what I prepared for this presentation. Thank you for listening to my work. >> Phil Chou: Thank you. [applause]. >>: Regarding the scenario where [inaudible] learning, it seems like the scenario is that you take a picture and then you [inaudible] it and then you're going to recognize the same picture again, right? So it doesn't seem to be very useful since you already just take a picture why would you want to recognize the same picture in the same place again? >> Wonwoo Lee: Well, this is about the--okay. Personally, this is not that much meaningful because yeah, I don't want to say again, but considering some kind of a social applications, what, if I make an annotation about an object and say a picture in the museum, I can make the annotation about the, my impression about the picture can, the others maybe who visit the museum first time may want to know how the other cares about these pictures and then it can be seen by their mobile phone so this is not for just personal use; the application is for sharing these augmentations among the other people, with other people, like to be features on smart phones. Yes? >>: The accumulation step, I just want the clarification on that. What exactly was that? It looked like, I mean when you say accumulation, it sounds like you are completing like a main event or something but the graphic you had looked like [inaudible]. >> Wonwoo Lee: The term accumulation is a little bit misleading, so my intention was that just putting the [inaudible] in textures like, okay let me say packing it. >>: So it's a texture app, is what you're saying? >> Wonwoo Lee: Yes. >> Phil Chou: Any more questions? Let's thank the speaker again. [applause]. >> Wonwoo Lee: Thank you very much.