>> Phil Chou: So it's a great pleasure for... Ubiquitous Virtual Reality Lab at the Gwangju Institute of Science...

advertisement
>> Phil Chou: So it's a great pleasure for me to introduce Wonwoo Lee. He is from the
Ubiquitous Virtual Reality Lab at the Gwangju Institute of Science and Technology in
Gwangju, Korea. He will be talking about his work in mobile augmented reality.
Wonwoo received his bachelor's degree in mechanical engineering from Hanyang
University in Seoul and then his master's degree in information and communication from
GIST. He is currently about to get his PhD at GIST on this topic and he is here visiting
as a gratis visitor for three days, yesterday, today and tomorrow along with his advisor
Professor Woo, who is a consulting researcher here for this week and next week, so if
you like what you're hearing today and want to talk with either of them say this afternoon
or tomorrow afternoon or some other time, please see me in person right after this talk or
you can send me an e-mail to Phil Chou to set up a meeting time with them. So
Wonwoo’s current interests are in silhouette segmentation for 3-D reconstruction, realtime object detection for augmented reality and GP GPUs for high-performance
computing and we will let him talk about his work in mobile augmented reality.
>> Wonwoo Lee: Thank you Phil. Good afternoon. I think you may be a little bit sleepy
after lunch, but please follow my presentation. Today I'm going to talk about videobased Situ Tagging for Mobile Augmented Reality. I have worked on this doing my PhD
studies, so the outline of this talk is showing you this slide, I will show some research
overview of what I have done and I will give you some introduction about augmented
reality and then I will go through some details of my work and I conclude the
presentation with some future work.
One of my ongoing research projects is texture less object detection using depth
information. Using depth information you can do many things as you know so providing
depth information can help us detect some complex 3-D objects which do not have much
texture. We use RGB D. We combine the RGB D and depth information for target
detection and then estimate poles using [inaudible] shape of target object. This video
shows some detection examples. By overlaying the name of the target object on top of
them and then this part shows the advantages of using depth information when we use
RGB D image only, it is barely affected by lighting conditions. In this case it's sunlight,
but we combine both information with depth we can do robust detection and also I think
detection problem happens when there are serious shadows on the object like this, so we
could do the same thing by combining RGB and depth information.
So this is one of my ongoing research projects and this is a part of the topic today I
present and the interesting thing is using depth information we can know the size of the
target object and then even though there are two objects which have the same textures but
have different scales, using depth information we can identify each one from the other, so
this is the interesting part of using depth information. During my PhD course I focused
on some computer vision techniques that are for the reconstruction of 3-D objects and
then the reconstruction of 3-D content augmented, used for augmenting real scenes by
virtual, by augmented reality applications. So basically I assume that users take photos
with their mobile phones and then we have collective the photos from users and they can
be constructed from 3-D objects using the photos and then three motor and from the
texture we can create realistic 3-D contents and the realistic 3-D contents is overlaid or
[inaudible] and in the real scene through augmented reality techniques and we really get
the some kind of more realistic results from the AAR application.
So from this viewpoint the systems look like this. The user captures photos from mobile
phones and then the photos are transmitted to the modeling server and then in the
modeling server we have to do some 3-D reconstruction, but before that we have to
identify the target object from a set of photos which is taken from multiple viewpoints.
So we first do multiview silhouette segmentation and here we are using some color and
special consistency measured in 3-D to identify the programmed object from the images.
So this video shows how the work, how the method works. So here there are photos
taken from multiple viewpoints and then we first initialize the silhouette of target objects
by [inaudible] doing volumes over each cameras and then we iteratively optimize the
target object silhouette in all of the views simultaneously. And after doing some
iterations, we can see that the programmed object’s regions are defined, and then finally
we get the silhouettes of the programmed object.
After finding the silhouettes of target object we build a silhouette object from the
silhouette and color images so the next part is silhouette construction and here I worked
on building a smooth [inaudible] from silhouettes and color images. So in this case we
have images and silhouettes and then we construct visual hull, but the visual hull is quite
simpler reconstruction method, but we aim to build smooth surfaces by intersecting
silhouettes to use in augmented reality applications. So we iteratively defined the
silhouette model to get smooth edges and like we have to some textures.
So we have a 3-D model over the one object in the photo and then that 3-D construction
is transmitted to again to the user’s mobile phone again and then it is used for augmenting
various things through mobile phones. This shows how we do it. Usually we take a shot
from a target object and then we do some automatic learning on mobile phones and then
the target is instantly detected over from video sequences taken from mobile phone’s
camera, so we reconstruct this 3-D model from the phone and then you can see that the
reconstruction of the model is then overrated [inaudible] with a proper [inaudible]
[inaudible]. So this was my research introduction and then today I will talk about the rest
part of my work I showed previous slide.
So people I get into details I first mention about the concept of [inaudible] virtual-reality
which aims for [inaudible] everywhere so [inaudible] reality concept the real and virtual
world is modeled in augmented reality space where the real entities are mirrored to target
entities. So the connection of the real and virtual entities is important in this concept to
making interaction between the virtual and real world. So to connect the real and virtual
entities we have to recognize the 3-D object or the target object from image sequences
and my work is about that.
This figure shows one kind of a visual of mobile augmented reality, which can be
regarded as a subset of [inaudible] concept. In this figure the real object is connected to
[inaudible] entities. Usually we call AR annotations so the [inaudible] information is
over rated to the real scenes through camera and then the [inaudible] entity is
geometrically registered with a virtual real object. But the problem arises when we do
these things in outdoors scenes where the scenes are unprepared. Unprepared scenes
means we have not much information about the scene. I mean like a 3-D geometry of the
scenes. So in this case even if we want to do some annotations on a specific object we
have no way to do it usually using computer vision techniques because we don't know the
target object at that moment. So what we need is some kind of online learning and
detection of target objects in situ to make a user to be able to interact with a target object
without prior knowledge.
So we propose a novel augmentation method with minimalist user interaction which is
very simple point-and-shoot approach. This video shows the overview of my method.
The user takes a shot with a mobile phone camera. As you see the target object is, can be
detected from, without any difficult interaction and then users can add [inaudible]
augmentation on target object. The advantage of this method is that it is very simple and
it doesn't do any complex 3-D reconstruction of the scene. And also users can detect
target object from a different viewpoint or their input is not as many from the cameras.
My method has some assumptions. The input algorithm is image of the target object.
Usually I assume here that the target object is planar and the output is patch data and is
associated camera poses to retrieve [inaudible] of freedom pose. And the assumption
here we have known camera parameters like a focal links and specific points and also I
assume that the target object is either horizontal or vertical, feature very common in the
real world.
So from these assumptions target learning procedure learns in as shown in this slide.
From the input image I approach the computer from a fronto-parallel view which is
normal viewpoint, features of imagery over target object taken from numerous
viewpoints. I will explain that in detail later. The target object is learned from, is learned
from the fronto-parallel view by warping the input patches and applying some blurrings
and I do some post processing. So the fronto-parallel view generation step is warping the
source image to a fronto-parallel view and template learning step is for building template
data from the input images.
So the first step of learning the target is the fronto-parallel view generation and some of
you may wonder what the fronto-parallel view means. So the fronto-parallel view is
pictures like this so it is taken from the normal viewpoint. The object in the camera has
no orientation changes. So usually in computer vision based in objects detection methods
its fronto-parallel view is required to learn the planar surfaces, so the situation is the
user’s camera is at the same height as the target object. However, in the real world this is
not always happen so this situation is more common. The target object is lower than the
user’s viewpoint or higher than the user’s viewpoint. In this case the images acquired by
the camera have a perspective distortion because of the camera’s characteristics. Not
always the frontal views are available in practical situations. Especially for horizontal
surfaces you may not get, you may not want to get a picture of an object from this
viewpoint. We just usually see from, have some angles relative to the target object.
From these photos it is not possible to retrieve the correct template data which can
provide for detection and estimation. So the objective of fronto-parallel view generation
is warping the source image to make it a scene from the frontal view. So the approach
here is exploiting the mobile phone's built-in sensors, especially the accelerometer which
provides the direction of gravity, so we combine adding some computer vision techniques
and the phone sensors and we do this very easily in test.
So let me talk about a little bit more about accelerometer sensor. It provides the direction
of gravity and points to local coordinates and the gravity is normal to horizontal surfaces
and parallel to the vertical surfaces. Then the direction of gravity provides various
strong, good information about horizontal and vertical surfaces which I want to interact
with. So let me talk about first the horizontal surface cases. In horizontal surface cases
we can assume that there is only 1° of freedom relative in the orientation relative to
orientation between the camera and the target surface. Usually, generally it is not, it does
not have, it has more than 1° of freedom but I am making the assumption that there is
only pitch rotation. So from the known camera metrics we can set the frontal view
camera which is proper to identity polls and let me define the capture that the camera has
some rotation and translation.
We can compute the rotation from the accelerometer [inaudible] and then its translation
[inaudible] can also be computed from the rotation and some distance which is
predefined as the proper length. And from the known rotation and translation parameters
we can compute the homography to warp the input imagery to the fronto-parallel view
and the H is the homography here. Between that by simply from the input imagery we
can rectify the imagery that, to make the original have a right fronto-parallel view as
shown in this slide. So you can see that the where the reconstruction is quite good even
though we don't do any image processing here. So horizontal surfaces are simple but
how about the vertical surfaces?
Okay. We can also make an assumption of 1° of freedom rotation to vertical surfaces
like an image as shown in the top, but in general, more complex cases happening in the
real world as shown in this here. So you just have somewhere orientation make with
some orientations using his hand and also the vertical surfaces can have orientation
relative to the user. So what about these cases? So we, the sensors cannot solve this
problem so now we add some computational techniques to make this problem easier. So
in vertical surface cases our approach is using the vanishing point which is very
straightforward to find the orientation of the vertical surface relative to the camera. So
here the accelerometer again helps vanishing point estimation. So by estimating the
vanishing point from random segments, the orientation is retrieved from them and then
the [inaudible] is done by the same as we do in the horizontal surface case.
Let me explain how the accelerometer can help this procedure. Here the vanishing points
in vertical direction can be expressed as the projection of a point of infinity to the camera
and then the projection procedure is multiplying just [inaudible] parameters [inaudible]
parameters and then applying the rotation and translation to the original point of infinity
is giving us the point of infinity is giving us the point to the camera coordinate system
and this is the coordinate so the point of infinity is in camera coordinate system is the
same as the gravity direction as measured by the phone accelerometer because the
vertical direction is the same and each direction is measured by the phone accelerometer
and the local coordinate system, so by just projecting the gravity values to the camera’s
coordinate system gives us a lot estimation over vanishing point in possible directions.
This helps estimating the vertical vanishing point because this local estimation is quite
good actually. So from line segments we have some line segments in the imagery and
then we also have the local estimation of vertical vanishing point by projecting the
accelerometer values to the camera and then using that rough estimation we have to do
some refinement with Ransac optimization approaches. We identify some vertical lines
using distance function from the vertical vanishing point because every vertical line
should pass the vanishing point in the image coordinate system and we do some
refinement iteratively and then we will get the good estimation of the vertical vanishing
point here.
Another vanishing point I have defined is the vanishing point in horizontal direction. The
horizontal direction, the vanishing point in horizontal direction can be found from using
the vanishing point we previously estimated in vertical. There are some orthogonality
constraints between vanishing points here like in this equation shows that, and we made
some hypotheses using this orthogonality constraints and then using line clustering using
Jaccard distance which is defined as something like this but I will not explain this in
detail. You can refer to the reference here. And then do some rank clustering and
merging the cluster sets and do some iterative estimation we will get a horizontal
vanishing point from the best cluster. So now I get the two vanishing points from, two
vanishing points in horizontal and vertical and then from this the orientation of the
surface, planar surface can be retrieved and this is not a difficult part.
So the advantage of using accelerometer here is the speed and robustness. In case of the
speed if we do the vanishing point estimation in conventional way which is just, most of
them are used to some line clustering, it takes so much time for mobile phones and you
know the mobile phones are still have a less computational power compared to the PCs,
so it becomes very slow for mobile phones. However, using vanishing point as I
explained you can directly estimate the vertical vanishing point in quite good accuracy
and so it makes the problem easier. As you can see in these figures using the
accelerometer of the speed of vanishing point estimation becomes very fast. And the
other is robustness. When there is a very complex, when anything is very complex some
vanishing point estimation using line clustering sometimes pays because there's not much
horizontal or vertical lines sometimes. So, however, again, we know where the vertical
vanishing point is with accuracy so that we can find the vanishing point even in very
complex cases. In this case there are not many horizontal and vertical lines but we still
can do the job very well using the accelerometer.
Until now, I explained how we get the fronto-parallel view of the target object using
accelerometer sensors and then now we are ready to acquire some template data from the
target for detection. So the next part is template based running using blurred patches.
The objective of this template-based learning is to acquire some data from the textures of
the fronto-parallel view we made in the previous step. And we here we will adopt the
approach of patch learning which was proposed in 2009 in CVPR which is patch learning
by linearizing warping procedure and it uses a mean patch as a patch descriptor.
However, the problem with this method when you are applying this to a mobile phone is
the memory requirements of the method, the original method requires about 90 MB here
to load precomputed data for fast running, so the performance problem on mobile phone
CPU at the time. So instead of using mean patch as search, we try to mimic the original
algorithm by applying some floating method. So how to compute mean patch is from the
input patch, the input patch is warped from several different viewpoints and these patches
are iterated and it gives the mean patch.
But our method is just to apply some floating to the original patches and then get the
similar resulting patches and I correlate the blurred patch. Let's see how it does it. So
applying the set of blurs to the image requires some time on mobile phone CPUs so we
try to exploit mobile phones GPU to make it faster. Our blurred patches computed
through a multipass rendering scheme shown in this figure. Let me explain this in detail.
The first step is for the input patch which is warped to fronto-parallel view. We warp the
input patch to another viewpoint to make a detection pairing viewpoints. And this
warping is replaced by rendering, you know, plane [inaudible] on CPU because it is much
faster than warping on CPU. And then the blurring patch come here so in the third pass
the Gaussian--no. In the second pass we apply radial blurring to the warped patch and
this radial blurring allows the blurred patch to cover a range of poses close to the exact
pose which means the original mean patch algorithm does several warps through the
several times, to warping several times and then average it, but we skip the warping
procedure and we replace it with radial blurring. Then we apply a Gaussian blur to make
the blurred patch robust to image noise. And then the fourth pass we accumulate the
blurred patches into texture unit, and the reason why is it is reading a set of blurred
patches from GPU is and you reduce the number of read back and the number of times
required for copying data from GPU to CPU. And finally, we do some post processing
like down sampling and normalization and we finally get a set of blurred patches and it is
also shaded 60 of freedom pose.
And then now we are ready to detect the target object. So until now we have some data
and we want to detect a target object from incoming video streams of mobile phone
camera, and again, we use the gravity information for template matching here. The
problem, the template detection method is very good for object detection. We've got a
list of all of the textures and shapes, however the thing is if we use more templates, we
can detect a target object from [inaudible] set of different viewpoints, but if we used more
templates it makes the detection slower because we have to compare more and more
templates with the input image sequences. So if we have too many templates the
performance on smart mobile phones becomes very bad. To address this problem we
again use the gravity information. So let me explain how the gravity works here. So we
assume that the real world objects are aligned with gravity from gravity direction. For
example, like this, the horizontal and vertical surfaces I mentioned, you know, the correct
direction is either normal or parallel to those surfaces and for the 3-D object, we can
assume that the upright direction of the object is usual parallel with to the gravity
direction. So here I want to introduce the gravity aligned imagery, which is where the
vertical vanishing point is either 0,1,0, or 0,-1,0; that means that it is either up or down
direction. This means that to make the upright direction of the target object in image
parallel to the gravity shown here. Let me explain the more about this. In the original
image, this is taken from normal viewpoint. If the gravity and the upright direction is
parallel here and then when a user makes some orientation changes when pointing the
camera, the captured image the target object’s upright direction is no more parallel to the
gravity direction and then the gravity aligned image means we warped the captured
image to make it upright direction parallel to the original gravity detections. So the
advantage to this gravity aligned image in template detection is you can reduce the
number of orientations to consider when building templates. This means as shown in this
figure using a single template to detect a target object we just build a single template to
detect the object in different orientations like this. If we did not use the gravity aligned
image we have to build a template in all of these cases and it increases the number of
templates here.
Let me explain how the gravity aligned image is computed. It is quite easy. Let's assume
that the image is captured by a camera there and this is the gravity and what you want to
do is make the blue and red arrows parallel like this and this can be done by simple
rotation transformation in between all the set and the transformation can be computed.
The problem is how do we know the angles that are here. And the thing is the blue line is
the line connecting the vertical vanishing point and the center of the image. So from this
fact if we know the vertical vanishing point we can easily compute the angle set and we
can warp the original image towards the gravity aligned image.
As I mentioned that conventional vanishing point estimation methods are very slow,
especially in this case we have to do template matching in real time so even though the
vanishing point estimation takes a few hundred milliseconds, it is very slow in template
matching process. So our approach is using the accelerometer here and I explained that
the accelerometer can give us a good estimation of the vertical vanishing point before, R
theta can be directly obtained without any image processing, because we know the
position of the vanishing point from the accelerometer directly, and applying this
rotational transformation is very simple and we also do it on mobile phone’s GPU for fast
warping process. So this video gives you a very clear idea of how the gravity aligned
image works. You can see that in the original algorithm the target object upright
direction changes as the user rotates the camera, but in gravity aligned image you can see
that it keeps aligned direction always here. And another video gives you more clear idea
of this. You can see that it is always kept in its upright direction in the warped imaging.
After using template matching for target detection, we do some tracking using ESM blur
algorithm which will be introduced by another colleague who will come next week and
then so we retrieve a [inaudible] video freedom pose from the detected target surface and
here we use some neon instructions which is like a sending instruction like an SSE on
inter-CPU. It is for mobile CPUs.
So I explained all the theories about my work and I gave you some experimental visuals.
Here are some parameters here and here. Our method requires the data of the targeted
object is always about 900 kB which is very few for target detection relative to the
original algorithm.
>>: [inaudible] image has 225 views?
>> Wonwoo Lee: Yes.
>>: So when you take a picture of that many views, how do you know which object to
learn? Do you actually [inaudible]?
>> Wonwoo Lee: Ah, these 225 views generated from the input fronto-parallel view.
>>: At certain times?
>> Wonwoo Lee: Yeah. This video shows how to work with horizontal view target. It
is similar to what you see at the start of this thing. Learning reference templates just
takes a few seconds here and the user can start detection of target object by selecting, not
selecting, so the target of it poses augmented on the image of the object and again the
user can select an object which is related to the contents of the box and then render a
[inaudible] object. And this one is interacting with the particular surfaces. Again the
user takes a photo of a vertical surface from arbitrary viewpoints and then it is selected by
using vanishing point. Template data is then generated for mobile phones at the moment.
Then users can start the detection and from different viewpoints here. This one shows
another experimental result, detection in different viewpoints for vertical and horizontal
targets, and this one is for different scales. Because we build the templates considering
scale we can take the target object in different scales.
This one shows the targets frontal views are unavailable usually vertical surfaces on a
building. And this one is horizontal surface on features are very far from the user and
definitely though it is frontal parallel view is not available to use in this situation. By
estimating six [inaudible] freedom pose the [inaudible] content is already in the right
orientation.
>>: Would you say about scales, how to learn the scale?
>> Wonwoo Lee: The scale is that the distance from the target object and the camera.
The input image is just a single scale and we warp the input image to different distances.
>>: I mean you don't, you can't recover, you don't really know, given an input image you
don't really know…
>> Wonwoo Lee: Yeah, the real distance is not known.
>>: So…
>> Wonwoo Lee: This is a relative distance.
>>: Okay. But then if you want to put a sign on a building to match the size of the
doors, let's say…
>> Wonwoo Lee: The size of the content should be [inaudible] should have been known
prior or the user has to give some input to determine the scale of the contents.
>>: So once you've labeled, once you've annotated the size, the right size on the door and
somebody comes up later and looks at the same door or maybe from a much different
distance, will they see the sign?
>> Wonwoo Lee: Will it be the right… Okay? Yeah. It is because the--it is just that the
matter of rendering what size of the rectangle, I mean in the sign case.
>>: Yeah, or that, for that person who is standing on the ground, obviously he was very
big in some sense, but if you wanted to…
>> Wonwoo Lee: Ah, you mean this case?
>>: Yeah.
>> Wonwoo Lee: Okay. If I see this region like, okay, this is the loop of a building and
if I see this region maybe like a second-story well it becomes very big, because the scale
is maintained within two applications. And then this is the target in a building which is
much higher than the user's viewpoint.
>>: Sometimes I get confused about which one you use to mirror and which one you use
[inaudible] potential building and the building [inaudible] because the pictures…
>> Wonwoo Lee: Yeah.
>>: So which, I mean…
>> Wonwoo Lee: So from your side the leftmost image is the input image captured by a
user, and then the other two is just showing detection images in the slide.
>>: So it has been trained by some other, the target has been learned by some other
[inaudible], right?
>> Wonwoo Lee: No. The leftmost image is used as the input and the templates are built
from that image and then these are another screenshot of detection.
>>: These are different images of the first one?
>> Wonwoo Lee: Yes. They look similar, but they are different.
>>: What do you correct for mostly rotating on the camera? That's not very common.
Why would they hold their cell phone like this?
>> Wonwoo Lee: Yeah. But it happens and usually we use mobile phone like this and
then do not see contents like this, but when you use this over AR application, there is
always some orientation changes because you just want to see the contents in different
directions and it makes some orientation changes on the phone. It really depends on…
>>: I guess I am asking why do you [inaudible] I assume that really [inaudible] testing
[inaudible] user takes a [inaudible] trying to detect the picture. To get a detection, of
course, there is no prior knowledge of what the picture is.
>> Wonwoo Lee: Yeah, right.
>>: So do you have like a scan everything?
>> Wonwoo Lee: Yeah. Our property is doing some kind of a scanning…
>>: You scan the real work?
>> Wonwoo Lee: Yeah, there are two possibilities using some [inaudible] detection to
make a, let me say the target object there is a corner…
>>: But you don't do corners [inaudible] detection [inaudible].
>> Wonwoo Lee: We try the corner detection and plus also scanning approaches and in
the case of corner detection, it is affected by the textures. If the corner is not detected the
template is not comparable to the original even though the object is there, and so we
adopted a scanning approach.
>>: Okay. So really when you are scanning, you how do you know what are the sides to
scan?
>> Wonwoo Lee: Yeah. It was predetermined.
>>: Okay.
>> Wonwoo Lee: Yeah. See there is a predetermined window size, given like a sliding
window moves the patches, the window to something like this.
>>: [inaudible] so you said the size of the window is predetermined, right?
>> Wonwoo Lee: Yeah. I mean…
>>: So how do you know?
>> Wonwoo Lee: The patches along this rectangle the center of the first image and only
that part is used for running. You don't use the entire image.
>>: I understand. Suppose through the learning [inaudible] suppose the patch size, the
size of the pictures [inaudible] but on the image in the middle, let's say the picture the
size is [inaudible].
>> Wonwoo Lee: You mean the--the input image, not used for detection, in that image
the target can be smaller or larger so that we build some templates by applying some
scaled vectors there and then we be warping the images, synthesizing the images used for
template building.
>>: I will ask you later.
>> Wonwoo Lee: So the thing is if the input image is here, no the target image is here
then you take the fronto-parallel view, the camera is here. If we pose, if we move the
camera to the front, now the image of the object becomes larger and then we will do
templates from the original image and they give us templates for…
>>: [inaudible] cover, the templates cover [inaudible] sizes but [inaudible] window, each
window you recover a portion of that image, so that even though your template has
different scales, it doesn't help in this case.
>> Wonwoo Lee: Yeah. In this case too much scalability makes a problem in detection.
Okay. I did some more examples to capture the real work [inaudible]. So we can do with
this method is turn to segmentation which was similar to the previous video and this one
had some interesting the ceiling. And another video is [inaudible] with the shaded
region. I made this video for fun. And here comes a [inaudible] here. Then this sheet
shows the sharing of augmentations between two mobile phones. Here the target is
acquired by phone A and it is transmitted to phone B through [inaudible] connection and
then phone B start the detection of target object seeing the proper messages on the target.
It was very hard to control two phones in one hand at the time. So this is some
[inaudible] connection. [inaudible] connection and it is transmitted over to phone B.
And I give you some results about the proponents and timing. This was the timing was
measured on and iPhones, so this shows the rate of learning speed, how much time each
step takes. So it, the time increases as the number of images we consider increases
because we have more warping and blurring for more viewpoints, and on PCs it is very
fast, just a few milliseconds. On iPhone or iPhone 4 it takes a few seconds but too many
viewpoints make it slower. Here the most time-consuming step is the radial step which is
the, it is slow because it accesses the textures on the GPU in random, not random but it
accesses not horizontal, not vertically, so in mobile phones if you case those kind of
access is very slow, so the radial takes much time and, but it can be implemented faster if
we optimize my code.
>>: When you do the [inaudible] and then you [inaudible]?
>> Wonwoo Lee: Yes, we do that.
>>: Did you try doing it the other way around, for example, [inaudible] how much
worse?
>> Wonwoo Lee: The thing is what we wanted to know is the gray value over pixel are
at the same position which is in the original patches. [inaudible] the [inaudible] and then
we blur in radial. And then the blur in pixels is different in that case. So that is what I
didn't want so I didn't try it here. Okay. This slide shows the comparison with the
original algorithm in detection performance and you can see that ours is a little bit, has a
lower performance, but it still is comparably shows good detection performance but
except these last three cases. These images are very, have limited textures like maybe
grass or maybe board or something of [inaudible] but there are so many similar textures
so if you are applying learning to that, is shows poor performance and you can see that
also the original algorithm shows little bit lower detection performance. This is
comparison in memory usage, so as I told you the original algorithm requires a large
amount of precomputed data but we don't want to do that so we reduce it to be, to take
another approach and you can see our memory consumption is much less than the
original one. So we lose some detection performance but we can have a large amount of
memory saved here.
So let me conclude my presentation here. We propose some computer vision-based
approaches for in-situ AR tagging in the real world environment and we explored the
sensor information which is very popular on modern smart phones for doing computer
vision works and then finding orientations or vanishing points and template matching. So
using our approaches, you just can do some in situ detection for the real world by
learning and detection of the target object. So what I explained until now was about
interaction with the [inaudible] for personally but augmented realities in not--okay--let
me say augmented reality is not limited to the personal interaction so the users, two users
can interact with using augmented learning applications by connecting, by sharing their
environment by building augmented reality spaces and their one locations. So AR will be
can be used for interaction between users as well as to [inaudible] in real worlds. So this
is the future work concept of mine. So, okay, this is what I prepared for this presentation.
Thank you for listening to my work.
>> Phil Chou: Thank you.
[applause].
>>: Regarding the scenario where [inaudible] learning, it seems like the scenario is that
you take a picture and then you [inaudible] it and then you're going to recognize the same
picture again, right? So it doesn't seem to be very useful since you already just take a
picture why would you want to recognize the same picture in the same place again?
>> Wonwoo Lee: Well, this is about the--okay. Personally, this is not that much
meaningful because yeah, I don't want to say again, but considering some kind of a social
applications, what, if I make an annotation about an object and say a picture in the
museum, I can make the annotation about the, my impression about the picture can, the
others maybe who visit the museum first time may want to know how the other cares
about these pictures and then it can be seen by their mobile phone so this is not for just
personal use; the application is for sharing these augmentations among the other people,
with other people, like to be features on smart phones. Yes?
>>: The accumulation step, I just want the clarification on that. What exactly was that?
It looked like, I mean when you say accumulation, it sounds like you are completing like
a main event or something but the graphic you had looked like [inaudible].
>> Wonwoo Lee: The term accumulation is a little bit misleading, so my intention was
that just putting the [inaudible] in textures like, okay let me say packing it.
>>: So it's a texture app, is what you're saying?
>> Wonwoo Lee: Yes.
>> Phil Chou: Any more questions? Let's thank the speaker again.
[applause].
>> Wonwoo Lee: Thank you very much.
Download