>> Zhengyou Zhang: Okay. Let's get started. ... pleasure to introduce Qi Tian. Qi is the associate...

advertisement
>> Zhengyou Zhang: Okay. Let's get started. Good morning, everyone. It's my
pleasure to introduce Qi Tian. Qi is the associate professor at the University of
Texas in Antonio.
Three years ago he spent one-year sabbatical at the MSR Asia, and he has been
one of the experts in multimedia information retrieval. He has done a lot of
research for the low level feature extraction to indexing retrieval.
Today he will talk about some new researches tell entitled the Spatial Coding for
Large-scale Partial-duplicate Image Retrieval. So, Qi.
>> Qi Tian: Yes, thanks for a nice introduction.
So this is recent -- actually last year's work for this large-scale partial-duplicate
image retrieval.
So during my talk, if you have any questions, just feel free to interrupt me any
time.
So first of all, this is joint work with my cosupervisors students, Wengang Zhou.
And now Wengang is here. He is my current post-doc at UTSA. And also
collaborate with Professor Houqiang Li from UST China. And Lucy Lu from
Texas State University.
So here is our talk outline. The first introduction to the problem. Then I'll talk
about the motivation for geometric verification for this partial-duplicate
partial-duplicate image retrieval. Then I propose this spacial coding scheme and
how to construct spacial coding map. We also have four enhancements beyond
this spacial coding map. And finally I show the experiments and two demos.
One is an online demo with my laptop; another is this video demo, okay, for
determining image database.
What's the problem? So our goal is to search images with partial-duplicated
patches in its large-scale image basis. So what are the partial-duplicate images?
So in our in definition these partial-duplicate images are defined by editing the
original images with some changes, for example in scale. Okay. Cropping or
partial occlusion. Okay.
There is I guess a website called TinEye which does the very similar thing to our
work. And it claimed it has indexed over one billion images. So some of results
compared to -- examples compared to their cases.
We have partial-duplicate image retrieval is different from general image based
object retrieval or general object recognition because the latter, okay, is more
challenging, okay, and has more variations due to 3 point viewpoint change,
object-class variability. So partial-duplicate image retrieval is more mature.
In these slides released some of the potential applications. So first of course to
de-dup, to save storage space and for copy violation detection. And maybe it
can be used on some mobile devices, okay, to search for landmark, artwork, logo
and some like product search. And this is -- release this application because
depends on the feature we use for this work. We used safety based features.
So that works well -- works better on the rigid object than the non-rigid object.
>>: Is there any [inaudible] what is [inaudible].
>> Qi Tian: Quantity ->>: [inaudible] duplicate, what [inaudible], I mean, there must be some kind of a
criteria where [inaudible].
>> Qi Tian: So some examples in my slides later. And so for partial duplicate
images we assume local -- some local patch, spatial patch are similar.
>>: Similar meaning exact same?
>> Qi Tian: Maybe there's some like due to various like transformation or some
can be [inaudible] transformation.
>>: Okay.
>> Qi Tian: Some can be the change in the viewpoint or something like that.
>>: Okay.
>> Qi Tian: Okay. So this slide shows general pipeline for like this image
retrieval. So first you have -- you collect large-scale image database. So next
step is perform feature extraction. In earlier days extract global features, color,
texture, shape. But in the last 10 years local features become more popular. So
these are some of the local features, SIFT feature, SURF features, MSER based
feature.
And second step is you have out of this local feature extracting from image
database. So in average one image could extract several hundreds to a few
thousand local features. So, therefore, the next step is to build discriminative -descriptive visual codebook. In the last two years we also have some -- two
papers. One is build a descriptive visual word, a descriptive visual phrase. And
another work is context actual visual vocabulary construction to publish in
[inaudible] multimedia 2009 and 2010.
And third step is how do you index, how do build efficient indexing structure. So
for this work, we use inverted file -- index file.
And finally, okay, to the image retrieval. So that's the general key steps for the
image retrieval. Okay?
Let's talk about -- briefly talk about each step.
So for the local features we used to hear some of the desired properties of local
features. For example, it showed it has a high repeatability. So therefore, make
them invariant to illumination, rotation, or scale change. And it should be unique.
Each feature has a distinctive description. And for example has to be compact
and efficient at compute and also preserve some locality property. For example
occupies a relatively small patch of image, robust to occlusion or clutter.
To extract local features there are two steps. The first step is interest local
feature detection called the interest point detector. And second step is interest
point descriptor, okay? Here are just a few [inaudible] popular use interest point
detector. It says Harris and DoG based, Harris-Affine, Hessian-Affine, and
MSER.
And these are some popular interest point descriptor. So most well known is
SIFT, PCA-SIFT, okay. Or SIFT context based.
So in this work since we used SIFT as the descriptor as our local features. So
these slides give a brief introduction. I know many of you may already have
been very familiar with this -- with this context.
So for SIFT detector, either a difference of Gaussian based or MSER based. So
to describe this SIFT, okay, first okay, centered on this interest point location we
get 16 by 16 patch for each pixel in this 16 by 16 patch we compute its
orientation. Then we find to the dominant intent orientation for all the pixels in
this local patch.
The next step is we rotate this local patch to the dominant orientation found in
this local patch. Then we further cut this 16 by 16 patch into 4 by 4, these
patches. Then for each patch we use a 8-bin, this histogram to do their
orientation, their histogram. So, therefore, for each patch it's 8 dimension. So
we have 14 by 14, 16. 16 times 8 is 128 dimension. So for one local feature.
The second step is as I said one image that could result in hundreds of or even a
few thousands of local features. And each feature is high dimensional -- is a
vector, one dimensional -- one to the eighth dimension.
Now, consider large-scale database. Example, millions of images or even
billions of images. So this is a very, very large space of this local features. So
this step people usually perform feature clustering, K-means or use of
hierarchical K-means or other clustering like affinity propagation.
And they use these class centers, okay. And the visual code words, so code
visual word, okay. So this is construction of visual codebook.
So third step is once visual codebook is constructed, we need to this -- to the
feature assignment, feature quantization which means map a high-dimensional
local feature vector to visual words in the codebook. Okay. And usually this
labor assignment is used for this purpose.
And next step 3 is build inverted indexing file. So this is a list of the visual code
words in the visual codebook. And so for imaging our database, first we extract
local features through this feature quantization map its local feature to one of the
visual word in this list. Then it link this image to this visual word.
For example, this image has two local feature. So link the -- map the two visual
words. So you link these two images along this list here.
>>: [inaudible].
>> Qi Tian: Yes, good question. So when they -- so, yes, we expect it has more
semantical meaning but from what we have got, it's still kind of low level
description. So men at work or in the past has been devoted to how to model the
semantic meaning, how to model the spatial context in the visual codebook. So
a lot of features on this direction.
So it's -- so it cannot be really compared to the text domain, like keywords, okay?
It's just not that high level.
So for the second image, okay, we do the same thing. And link the image to the
features to map the visual words it contains. Okay. So after we done for all the
images in a database, okay, so we build this so-called in words the index file
structure.
So what can we do with this inverted index file? Simply we can use this for to do
the retrieval. For example, given our current image, okay, extract local features.
The map to the visual words. And we simply can vote each image, okay,
containing this visual word. Okay? Like voting by term of frequency. And we
can return images with votes, okay, to achieve the image retrieval. Okay.
So visual word, a concept that was proposed in video Google, ICCV 2003. And
there's another popular work by David Nister so called constructive visual
vocabulary tree in CVPR 2006. And after that a lot of work in this direction.
And I just list a few related. So how do construct higher-order visual word, okay.
How to construct collocating pattern, bundled features by [inaudible] from MSRA.
I'll talk about it later in the slides. And how to build a visual synsets, okay, and
our work, how to construct descriptive video words, descriptive video phrase.
And how to build a contextual visual vocabulary.
Okay. So this -- so next slides I'll talk about the motivation ->>: [inaudible] of the image and visual word.
>> Qi Tian: You mean how to construct the video ->>: [inaudible] because you prefer the visual word and you social image with
visual word. I think essentially this is a bi-clustering problem. One dimension is
the number of the images. Another dimension is the features which [inaudible]
the bi-clustering trying to do something like this rather than just clustering over
the features?
>> Qi Tian: We haven't tried that. So nowadays, okay, because of simplicity like
[inaudible] especially hierarchical [inaudible] adopted to generate this so-called
visual words for the images. Okay? And certainly there can be different ways to
generate like visual codebook. So even if in our work, okay, instead of
considering each single local feature, we consider features in pairs. And we
consider the co-occurrence context, spatial context to build this visual word pair
and the basic description element. So I just feel like maybe there are lots of way
to do that, yeah. Okay. So this is talk about why this apology metric verification
is important in this scenario, partial duplicate image retrieval.
So our contribution for this work lies here, okay. So after these three steps we
construct a spatial coding map and to do the spatial verification. And the spatial
verification are two ways. One is a local spatial verification and a global
verification. And our method is global spatial geometric verification in this
scenario. And finally perform the image retrieval.
So the goal is originally when two images are matched to each other, there are a
lot of force matched to SIFT features, okay. Our contribution is efficiently remove
force matched SIFT features. And of course this forced match is in terms of the
geometric consistency. Okay. We preserve the SIFT to match the features by
checking their geometric consistence. Okay.
>>: [inaudible] is done in the indexing stage or in the retrieval stage?
>> Qi Tian: Retrieval stage.
>>: Retrieval stage.
>> Qi Tian: So in the indexing stage we're going -- I will talk about later in the
slides. We are going to save some information for the matched features. So, for
example, image ID, this feature comes from which image? For each SIFT
feature what's its scale, one number, what's orientation? And what's its location
to XY location in image.
So that's information we store in the indexing stage. And during the retrieval
stage we're going to use this information to build a spatial coding map online.
And to do the spatial verification.
So this is how example, okay. This shows the SIFT point matches between
these two relevant called partial duplicate images.
And you can see this matches, okay. Consists of two parts called the true
matches and the false matches. And the false matches mainly because they're
geometrically inconsistent. For example, some hair point on here match to here.
Okay. So our goal is design a filtering method to preserve these true matches
and fail to this forced matched features. Okay.
So what cause this problem? Okay. There are a few, okay, factors. So first the
local features. They do not preserve enough spatial information. And also, like
they are not stable due to affine transformation. Sometimes corrupted by noises.
In the feature quantization, because it's from high-dimensional feature vector to a
single ID, so this introduce a large quantization error.
And in this very long bag-of-visual words model, it's orderless. No spatial
information.
And in the indexing, in the past this scale and orientation for the local features
are not used. And we found orientation and scale of local features are very
important for partial-duplicate image retrieval. And we use it in our work.
>>: [inaudible].
>> Qi Tian: Uh-huh.
>>: [inaudible].
>> Qi Tian: Yeah. From no images. And also I have to point out is when we
take down the image from the Internet, we reduce image size to 400 by 400.
And so in average per image about 300 local features.
>>: [inaudible].
>> Qi Tian: Uh-huh. So far we haven't. Yes. We can discuss off -- later.
Because we haven't done anything for the compressed -- in a compressed
domain. So we use low features from this image. Okay. So this is a geometric
verification. The goal is to -- first to be effective to remove this false matched
features. Again false is -- means they are geometrically inconsistent, okay.
And first secondly has to be fast, efficient in implementation. Considering like if
you do retrieval over billions of images, 10 billions, hundred images has to be
realtime response. Okay. So it has to be very fast.
So there's two verification. One is local verification. There's two of -- there
plenty of work. One is bundled feature from CVPR 2009 and locally nearest
neighbor approach. So global verification, the [inaudible] RANSAC and our
spatial coding map. So this work is based on a peer's publication multimedia
2010 plus some extensions.
So for the local geometric verification this is Video Google, ICCV 2003. The idea
is simple. So considering there are two points matched in image 1, image 2, A
match to B. In order to accept the true matches, we also consider each -- the
neighborhood of each feature, okay? If the neighbors are also mapped to the
same visual word, we consider this pair has local reaching support. If no
neighborhood matched to each other it's low region support and we're going to
reject them. That's the idea.
However, this is -- drawback to this approach is kind of sensitive to the cluttered
background. Okay.
So second approach is bundled featured. This approach, instead of consider
each individual feature alone, it consider features in groups. The group made
from regions. And so this shows two bundled features. So this bundle has four
local feature. This has five. So they matched -- four of them matched to the
same visual words, okay.
So this is another example -- there are two bundles. Also they have four features
mapped to the same visual word. So next step is they use spatial consistency for
the bundled feature. This information is used to weight the visual words. So
besides the -- this traditional, okay, this TF-ID weighting for each visual word, he
add this item N, okay, and this spatial consistency also consist two parts.
The first part is how many of the shared visual words between these two bundle.
So for example both of these two cases -- paces, they share same four visual
words, okay.
But the second part is a spatial consistency between these two. So in the first
example, they checked spatial consistency in two directions. One is X direction
horizontal. One is vertical, Y direction. For example, in the X horizontal direction
increasing order is, okay, circle is triangle, cross and is square, okay, for this
bundle.
And also for second bundle in increasing order in the X direction, also circle this
triangle, this cross and square, which means the order of this bundled features in
the X direction are consistent -- is consistent here. Okay. Also this is consistent
in the Y direction, okay. But for [inaudible] case like in the X direction, okay, so
this is a circle, this a triangle, cross, and this one. But it got incorrect matching
order. So therefore the matching order is false on the first one and third one. So
you can see how two degree of inconsistency. Okay.
And they showed good performance for partial-duplicate image detection in over
one million database. However, this method has a drawback. So it's going to be
infeasible if this bundle's rotated here, okay, because the XY, this order will be
changed. Okay.
So for global verification, okay, the most [inaudible] is this RANSAC algorithm.
And the RANSAC stands for Random Sample Consensus. Iterative procedure.
So it's iteratively remove outliers inlier classification. Our inlier defined as true
matched features. Outlier is false matched features.
So we start with randomly sample some features. Consider them as inliers. And
estimate the affine transformation model based on these matched features.
Then use this model to test against all the other features in the image. Then
classify them as inliers and outliers.
And then based on the increased inlier data set, do the sampling again and
estimate the model begin and so on.
And the drawback of this one is computational expensive, therefore it's not
scalable. Usually in image retrieval, when we have initial retrieval image list, it
takes 300 -- or 500 to do the RANSAC check. Okay. So Bates it's any reranked
images within the top 300 images, so 500 images, okay.
So in our case, we basically checked the -- all the images returned by this
inverted file indexing.
So next is how to construct a spatial coding map. The key idea is year going to
construct spatial coding map which can record relative spatial positions between
matched features.
In the first work of this direction we construct two map. One is called Xmap and
the second is Ymap to record to the relative spatial direction in X and Y direction.
And for example, okay, this image has four matched features, okay. Now, we
use say in this Xmap we use the reference point as I. We consider a relative
position J with respect to I. If J is right to I we record it as 0, otherwise record as
1, okay. And similarly so if we consider reference point I if J is above I record as
0, otherwise record as 1.
So let me show you this toy example. So first we start with reference point 1,
okay. Because 2 and 3 are right to 1. So for the first point 1, the first one and a
2 and 3 position they're recorded as 0, okay. The rest are recorded as 1, okay.
And then consider Ymap okay for the point 1 because all the point are below it
here, okay, including itself. So it's recorded as all 1s in the Ymap, okay. So this
is down for the Xmap, Ymap for the first feature.
Then we move to the second feature. For reference 2, reference point 2. This 3
is on right side of 2 so it's 0 in Xmap, okay, and in Y direction since point 1 is
positive point 2, okay, so at the position 1 it's recorded as 0, the rest are 1s.
They are down for the second 1. So we just continue until we finish this coding
for order this matched features in the image. The constructor, this so-called this
Xmap and the Ymap for each image, okay.
Okay. This is for the last one. So what can we do with this Xmap and the
Ymap? So this illustrative very simple example. So this is a quadrant image, this
is matched image in a database. And you can see there are five pairs of
matched features. It either can be verified. Four of them are geometrically
consistent except point 5, okay?
And this other -- this other Xmap and the Ymap of the quadrant image is X map
and the Ymap of the matched images. So the first four are geometrically
consistent. So you can see they have the same Xmap and the Ymap for, okay.
Xmap and the Ymap for.
So if it takes exclusive all operation between these two map and it would be all 0,
okay, on these locations and be 1s when there is the inconsistency. So it takes
summation for each row. We get the overall inconsistent disagree for each
feature. If you can identify the map in X direction or Y direction, point 5 has the
largest inconsistent degree. So we can identify the largest one and remove that
feature, remove that column and row and continue this second process for the
rest of the features until this inconsistent degree to be -- all to be 0. Okay. So
this is a very strict condition.
Okay. But in the previous case, it's a very simple one. Each quadrant has any
one part. Now, consider each quadrant. Now it's uniformly divided into two
parts. Because a point located in this left corner is still different not from this
located here, okay. So refer divided into two parts.
Now, this can be considered as a combination of two division. The first one and
the second one. Okay. Here we can construct XY map for the first one. But
how to construct XY map for this second one? Because it's not like horizontal
vertical, its position. Now, we can rotate, okay, 40 minimize degrees
counterclockwise. This will take the features, okay. Now, we can construct
Xmap and Ymap for this one, okay.
In general case if we divided each quadrant uniformly into R parts, so we have
this layout can be considered the combination of the R different division. And if
we rotate each division by different angle, okay, because it's rotated here, this
position and we can construct Xmap or Ymap for each of them. So put them
together, have a generalized Xmap and a Ymap, okay.
Then we just generalized Xmap, Ymap can do this spatial verification. So here
this Q is a quadrant image. This M is matched image. Okay. IJ are the local
features index, okay. From 1 to N. K is the R parts divided how many parts from
0 to R minus 1. N is the total number of matched features.
So for each again we take exclusive R if they are consistent they will be either -they'll be same. If their inconsistent will be 1. Finally sum up for the inconsistent
degree for each feature in the X direction and Y direction and we identify the
maximum. Okay. Either in Xmap or Ymap, identify, remove it and reiterate until
this gets 0.
This just some spatial verification examples, okay. This is original to match
before verification. This is identified the false matches, okay. These identified
this field to the [inaudible] spatial this coding verification. And this is remaining,
okay, which means the past spatial verification is this SIFT to match itself.
This shows SIFT matches, okay, for two irrelevant matches. One is a Chinese
document, one is a English document. And you can see like test to see if SIFT
matches on them.
This is identified false matches, okay? And still, okay, you see there are three
pairs past spatial version. And at least one pair with past verification. Because
when any one pair left, it can not do the verification. There's no relative position
information. Okay.
So we filter most of the false matches. Okay.
>>: [inaudible]. The idea of [inaudible] is to prove the [inaudible] problem, right
[inaudible] originally which are rotated a certain way.
>> Qi Tian: Here ->>: Here you are not doing that actually, you are just matching, you know, with
the same keys that you [inaudible] for the rotation. Do you see what I mean?
>> Qi Tian: Right now we haven't talked about here is the rotation [inaudible].
Okay. So we haven't talked about this, okay. It's in later slides.
So, okay, this is our indexing structure. So, again, this is a list of visual words.
This is matched features, okay? For each matched features we store this
information. First image ID -- right now we use 4 bytes roughly index up to 1
billion images ID, okay. And the feature orientation, one byte scale. And YX
location is in one byte each. So in total we have 8 bytes for each indexed
feature. And now we reduce all the images in 400 by 400 okay. For this there's
a demo. The average there are 300 features per image. So therefore, one
image index size is about 2.4 kilobytes. And that's for one million image index
2.4 gigabytes. That's the demo I'm going to show.
And 10 million image index, 24 gigabytes. That's why I cannot show it on my
laptop. I have a video.
Of course this is a very rough estimation, not optimized. We can maybe further
reduce the storage for each indexed feature.
Now, so beyond the spatial coding we have considered four of these
enhancements. The first one is how to handle rotation invariance. And how to
further recover some false negative matched features. So even for the first
features two parts. One is a false negative and false positive. We have a way to
reestimate the model to recover some of false negative and how to do query
expansion and bi-space quantization.
So, you know, traditional, in the traditional nature video or camera done by
[inaudible] and this clustering is in a SIFT space, which means in a -- this
description space -- this descriptive space went to the edge. And then further
used this orientation information to field to some of the -- to do this quantization.
Okay. So there's a slide -- actually I have them hided. So when we finish the
talk, I can come back to talk about some of the details. Okay.
Now, experiments. So we first construct 1 million image database downloaded
from the web and construct three smaller data set, 50K, 200K, 500K by random
sampling them.
And ground truth data set we obtain 1100 partially duplicated web images of 23
groups and use 100 of them to do this representative query, okay.
We compare to three algorithms. So the baseline is the David Nister CVPR 2006
paper. This is a well cited paper. This is using visual vocabulary tree.
And second one is Jegou, ECCV 2008, Hamming Embedding cluster geometric
constraint.
And there's a fourth you have full geometric verification using [inaudible].
For performance evaluation, using first one is a mean average precision and
second one is time cost per query. We didn't consider like memory cluster in this
work. Okay.
Well, the first experiment using this 16 GB memory and 2.0 gigahertz CPU.
So before there's a few parameters to tune. The first one is -- okay. Codebook
size, okay. In David Nister's paper they index one million images they found
[inaudible] is around one million size of visual codebook. So we got the similar,
okay, like observation.
And second thing is quadrant division factor. Each quadrant is divided into how
many parts, okay? For our case, R. And next one is orientation quantization
size, okay?
And so this is considered trade-off between this precision and cost, time cost. So
this is orientation quantization size from 0 to 10 -- to 20, which means that for
circle, 360 degree how many further you cut, they quantize into orientation
space. And we found this is a mean average precision. So we found, okay,
when quantization size is 11, it actually was best performance in terms of the
precision and time cost. Okay.
>>: This data set [inaudible] I suppose.
>> Qi Tian: Because we tested like over a million image data sets, yeah. So
that's the result from that one. Okay.
>>: When is the [inaudible] 360 degrees quantization or ->> Qi Tian: Let me get -- let me [inaudible]. So when 2 same features match to
each other, they may match based on 1 to the 8th descriptor.
>>: Right.
>> Qi Tian: But their orientation might be totally different. So each feature has a
dominant orientation. So we only consider a certain angle, okay? We need a
certain angle to consideration the orientation consistency as well.
We found that this information actually is quite useful for partial-duplicate image
search.
>>: [inaudible].
>> Qi Tian: Yes, it's in the SIFT, SIFT match. Okay.
>>: That's all interesting if you were -- originally if you were [inaudible]
orientation dependent, right, orientation dependent [inaudible]. Feature. Well
actually -- but at the back side -- the downside of that actually cause some
mismatches, that's why [inaudible] don't do that [inaudible] SIFT [inaudible]
orientation is not -- it's not too far away ->> Qi Tian: Yeah, that orientation I say has to be [inaudible].
>>: Yeah.
>> Qi Tian: Degree. Which means that if this is [inaudible] size label [inaudible]
almost circuit degrees.
And also second thing is visual codebook size. And we construct different visual
codebook size from 12K, 130K, 250K and 500K. And these -- actually this
observation is different from what observed in David Nister's paper. Because in
his work, he found actually when we increase the codebook size the performance
gets better then it gets [inaudible] okay.
In our case, actually it's different, okay. So when you -- when -- you can see
here. When we have a smaller codebook actually we get a better performance.
Why is that? Because we're in a feature space when you have a smaller
codebooks -- codebook, which means each packet is bigger. So that more
features fall into each packet here. Which means we make it a lot of the force
match the features, right. And but our angle [inaudible] very efficiently to remove
them. Okay.
>>: So what happens when you increase the codebook?
>> Qi Tian: When you increase the codebook.
>>: [inaudible].
>> Qi Tian: It's much -- contains much fewer points.
>>: So that means you [inaudible] mismatches.
>> Qi Tian: So -- yes. It not have mismatches.
>>: Okay.
>> Qi Tian: So we found the problem forum if you -- at the beginning if you do
not detect so much map feature it doesn't work. So you have lots of features.
Even if lots of false matches we can remove them.
So therefore we find our trade-off for one million image, 130K is a good trade-off
between precision and time cost.
The last one is this quantity division factor. So again, like I said, this is on the
quadrant data set, okay. Now we found quadrant factor is R equal to 3, which
means for each quadrant it divided into three parts, okay. We found the best
performance so far. Okay.
Okay. This is summary. The codebook size is from one million image database.
If you have like 100 million image base, my thought is when I show they will be
increased.
So this is a performance comparison on different size of image database, 50K,
200K, 500K and one million. And we compare with this is the baseline. This is
David Nister paper. This Is Hamming Embedding. This is [inaudible] fully rank
measured. This is our spatial coding algorithm. And this is a spatial coding per
expansion.
Now, on one million datasets, so performance precision compared to baseline
is .48 to .73. So nearly a 52 improvement in precision.
And compared to using a full geometric verification RANSAC it's improved
from .61 to .73, roughly 20 percent. And if we consider a query expansion above
the spatial coding, it's further improved from .73 to .88, about 20 percent. And
this is a precision -- this is a time cost.
So this is -- this is time cost of per image in this one million datasets, okay, our
like [inaudible] configuration. So this is a baseline. This is a spatial coding.
Actually our spatial coding is faster than baseline. Okay. I'll explain later. Okay.
And when you introduce the query expansion and to introduce additional .6
second quadrant. And this is Hamming Embedding. This is rerank using
RANSAC for the top 300 images. It's like a three seconds per each one.
So comparing to baseline, okay, we have a 46 time reduction, okay. Okay. This
is some of our simple results. These are some of the queries. And these are
so-called partial-duplicate images font in the database. Our demo can show you
this. For example we have various -- some occlusion here for the head. And this
is [inaudible], this is Starbucks, this is Starbucks it's English, Chinese and
another cafe, local. And this is a scale. This is several of the viewpoint. And this
-- I have some -- I can show the demo.
So further we tested the scalability on 10 million database. This 10 million
database downloaded from the web. Now, we increased the computing sever to
2.4 gigahertz and used the server to 32 gigabytes memory. And this is some -this approach, this is performance. So in terms of the codebook size and we
have a smaller codebooks we have better performance. But small codebook you
have a longer image list. So the time cost will be higher. So therefore trading
between these two we use this one, consider a trade-off for the 10 best
performance for the 10 million images. I have a video demo for some of the
examples.
Okay. Before conclusion let me show you the examples.
>>: [inaudible] could you provide us some insights into why the [inaudible]
images [inaudible] than your spatial ->> Qi Tian: Because when you do the match we check the indexing [inaudible]
check you're going to return for one million images you might return 10K image
back. And [inaudible] verification because it takes the top 300 images to do that.
Not taking all the images. So RANSAC is [inaudible] returned images.
>>: Top 300.
>> Qi Tian: 300 or 500, yes.
>>: [inaudible].
>> Qi Tian: Actually checking order returned images.
>>: How much? How many?
>> Qi Tian: So depends for one million database it various. So let's say an
average return a hundred -- 10K ->>: Eight ->> Qi Tian: 8,000 images, yes.
>>: So if you do the ->> Qi Tian: [inaudible].
>>: Not considering the time.
>> Qi Tian: Right.
>>: If you do the full geometric ->> Qi Tian: Yes.
>>: [inaudible] better. Would you agree with me?
[brief talking over].
>>: [inaudible] so many facts compared to [inaudible]. Because you both
consider the order ->> Qi Tian: It's a big difference. Big difference. Because they only use a special
consistency to weight the -- to weight the visual word. Then so [inaudible].
>>: [inaudible] verify the matching. [inaudible] use that to verify the matching.
>>: [inaudible].
>>: But just use it as a weight to ->>: [inaudible].
>>: And if for instance [inaudible].
>> Qi Tian: For location.
>>: [inaudible] for location, yeah. I mean, I got a spatial order [inaudible] is the
same so [inaudible].
>>: He uses [inaudible].
>>: Yeah, that part is different. I understand.
>>: And [inaudible] question because currently you are [inaudible]. So you have
to compare the vertical one with [inaudible].
>> Qi Tian: I have -- I'll show a demo. I open the slides [inaudible].
>>: [inaudible].
>> Qi Tian: We have -- so this is first work. Then we propose to handle location
variance. We also have a demo for that. This is just -- this is the first work in this
direction. So actually we have different ways to model this [inaudible] context
now. Now, this is not how we take like this way and cutting this way. Actually we
have a later so-called geometric ring coding. So instead of considering which
quadrant it is in, we consider like for example each point has a scale okay. Then
the relative position with respect to other points, whether it's inside the scale or
outside the scale and we can -- this is divided the image space into like
concentric ranks with like increment on the scale. We get a different spatial
coding map. So among that we have a geometric ring coding, further improves
with geometric coding [inaudible] coding and geometric square coding. And I
have some additional slides, okay. Because I want to show the basic idea, okay,
then talk about if I'm not discussing. Okay;
>>: So this is a [inaudible] and let me ask [inaudible].
>> Qi Tian: Okay.
>>: So I mean you talk about the Xmap and the Ymap.
>> Qi Tian: Okay.
>>: I mean, different quadrant to different directions, right? When you talk about
inverted index you also talk about [inaudible] features or the Xmap and Ymap to
[inaudible] and store it in image or basically taking [inaudible].
>> Qi Tian: On the spot. On the fly. So actually I say any use of this XY and
scale orientation to generate the spatial coding. So even if we average say Y
image had 300 features does not manner we're going to construct 300 [inaudible]
because the map features is much less. So there's a hundred, okay, a hundred
by hundred. So this going to be construct online.
And because most of the operation is binary exclusive of our operation, so it's
fast, very fast, okay.
>>: I have a question before you go to the demo.
>> Qi Tian: Okay.
>>: Can I why rank 4 is not rank 1 on that query image?
>> Qi Tian: Oh, okay. Good question. Actually this can be easily fixed.
Because -- okay. Let me show you how they're matched.
>>: You have a very sharp eye. [laughter].
>> Qi Tian: Actually ->>: I'm in the back.
>> Qi Tian: So if you check here -- okay. It's in the back. This is just -- two
matches, two SIFT matches after the spatial verification. And this say they have
20 matches past the spatial verification. And this is first one.
Second one they also have a 20 number of matches. Right now they are
displayed by the image lamp. So that's why you see to be more this -- this one
actually should be ranked first. Basically because we display the part -- this rank
is determined by the longer of the two matches, between these two images.
>>: So evidence a number of matches not [inaudible].
>> Qi Tian: So you have same number of matches.
>>: No, no. I'm asking a question. So is your [inaudible] by the number of
matches not by the similarity within the matches?
>> Qi Tian: No, it's not similarity. Because originally, okay, had they could
match a hundred SIFT matches. Now, any maybe 20 past verification. So we
use that number as our like similarity measurement. So because this one we can
reduce the further time cost to calculate the image similarity.
>>: Okay.
>>: [inaudible] use the similarity or they match it, right?
>> Qi Tian: Right. Actually the first SIFT matches by is looking it's a hierarchical
visual vocabulary tree.
>>: Okay. This is -- this shows the top -- this shows top -- okay. So there are
six matches. The database is here is 1.2 million images in this disk. And the
time cost, this is 125 milliseconds up to say this does not include the feature
extraction time for each image. Okay. This is everything above the feature
extraction, okay. And this codebook size we use the 1.39 million is code size.
And while we display [inaudible] we return images with at least three matches
passing the spatial -- past the spatial verification. So if we have lower than three
we didn't display. Okay.
So this is a rank determined by the number of the true matches after verification.
So as you can see -- so the small match like -- so if we create another one, the
match on this right side. So this red line means the correct match. Okay. This
blue line means the incorrect match.
>>: Yes. On all these images are [inaudible].
>> Qi Tian: You mean ->>: From the perspective point of view. In your earlier slides you have a Mona
Lisa, which [inaudible].
>> Qi Tian: Yeah. I can show you. I can show you.
>>: Yeah. So I'm just saying how well ->> Qi Tian: Okay.
>>: [inaudible].
>> Qi Tian: Yes.
>>: [inaudible].
>> Qi Tian: Because right now we found of course it works best on the flat, flat
image, flat structure. And if I have a curved feature or if I have a large font, for
example if I have a large -- very large font transformation the SIFT will miss the
detection in the first place. So that means that we cannot get good results even
if -- to the spatial verification.
In order to improve that, the first you have to use like a fine SIFT and to capture
-- and to detect first and to perform the spatial verification.
Now, but how good or how tolerant for those very curved features? We don't
know. I can -- I have some example to show you like for the -- I can show like
right now. Let me -- so this is a -- this is a Starbucks logo. So it takes 7
milliseconds returns total 50 with at least three matches. So this is some of the -so this one -- there are some like a real point change and this is in the same
image. And this one it's -- for example even this star matched to here and every
time we consider that a quadrant match because geometric inconsistent. And
here -- so even like if it's a [inaudible] so we found the matches to be here. Now,
okay, now, if I -- if I use this one to do the search, okay ->>: [inaudible] how does the older version of the [inaudible] version matches?
>> Qi Tian: Okay. Before you go -- so this is like also not -- let's search to this
one. Let's do search. Okay. Now we found -- to this one. Actually this looks like
Starbucks but it's not exactly.
>>: Is that considered a [inaudible], I mean, just -- [laughter].
>>: Suppose you give to it copyright lawyers, is this considered [inaudible].
>> Qi Tian: Actually I don't know. Actually I don't know. But it should be
considered [inaudible] do we match or not. So it's the same [inaudible] except -so even if they matched like the C matched to C and K matched to K, all right, O
matched to OFF. Okay. So our method is still not symmetric match, okay, it's
still [inaudible] feature match.
>>: So I was thinking that whenever I see like a duplications search I always
wonder what was the real -- how do you judge -- I mean, the definition is not that
clear, right?
>> Qi Tian: They're different [inaudible] I got some requests from companies.
For example they have ---ive an demo. They say I have a document. It may
contains the logo [inaudible] they may contain the logo, like company logo for
other like FedEX or other low goes like 7-11. So can you automatically find it?
Okay.
>>: Of course you could find it ->> Qi Tian: No, automatic find it.
>>: You mean find it from ->> Qi Tian: You have an image document here. So this ->>: You mean they have their own documents?
>> Qi Tian: Yes, a scanned document, it contains the logo, okay. But you have
lots of documents. Given this document can you scan -- can you identify any like
logos or this trademarks from this one?
>>: But that's maybe a -- okay. So -- but I -- [inaudible].
>>: Question.
>> Qi Tian: Uh-huh.
>>: Do you have the SIFT feature for the [inaudible] do you have the [inaudible]
show what are the SIFT features extracted from that ->> Qi Tian: Oh, to come and show the features. Actually you mean -- no, no,
no. Here I show the matched features.
>>: Well [inaudible] you show the matching features.
>> Qi Tian: Right. The original features.
>>: Whether you have the original 300 features, I mean, [inaudible].
>> Qi Tian: We could have just a -- it's not this here okay. Now, okay, this is just
this is just okay given this picture, okay, this one, we found actually second
image contains four low goes. That's why it's -- that's why when we do a spatial
coding expression we cannot take the whole to do the query expansion because
it doesn't -- it don't know which one it looking for. So it has to be localized.
So one is -- from my way -- when we were do the query expansion, we found a
minimum bundle box for the matches. And use any information here or some
point match to [inaudible] to original this data set to the query expansion. So this
is similar idea used before in 2007. Now, if I say use this one to do the search, to
do the quadrant -- sorry. I pressed the along button.
So if I just want to do the search it take us 31 milliseconds, okay. And of course
it also search for the Dunkin' Donuts and do search for the 7-11. And this -- there
are some McDonald here. This McDonald is really smooth. So we didn't get
many points detected on McDonald.
Now, this should match on Starbucks. Now if I use this one to do the search,
[inaudible] for example we found -- there is an IBM here matched to second
feature and MSN. So original feature, this is [inaudible] there's a FedEx image.
Now, okay, this is Starbucks. And there's more to search. Of course we
[inaudible] Google. There's Google inside. This one you see Google. This one I
think this will show [inaudible].
>>: [inaudible] again to cluster results based on the feature point location in your
query image.
>> Qi Tian: What ->>: Because the query image include much more objects.
>> Qi Tian: Right.
>>: And you can do the clustering on the result -- on the search result.
>> Qi Tian: Okay.
>>: And based on that feature point on the original image.
>> Qi Tian: Okay.
>>: Because here it does all of the features on the right bottom corner.
>> Qi Tian: No, actually this lots of points, original points detailed here. But in
any -- for example these two, they found the correct matches to be hear and also
part of spatial version. And -- okay. Now, I will show you a final -- one more -okay. This is old couple. Okay. This is top 20 -- in the top -- okay. 44. So for
example you see a GM here. Why is that? Because there is a GM here. Okay.
And so this is -- so maybe there's some notation, not much. This is 4 -- so for
example, this is a good example. So actually they showed lots of match here but
because position is both so we consider it inconsistent. So that is old man and a
house. This one -- I mean, this should be matched before but now it's filtered.
Okay. This is Starbucks. Actually when we search for this, use this one, we do
not know which one we search for.
Now, if I use any of them to do the search, okay, now, I can show you the
precision for this one. So this shows a random just one to do the search. And
we will have a very high prediction of this one in the top returns. But it's just less
than .6, .6. And now I'm sure where we can prove it. We can do an adding a
query expansion. So we take the total 5 images to do a query expansion. Of
course the cost is time. I show this precision.
>>: You're not [inaudible].
>> Qi Tian: [inaudible] okay. So this originally 47 milliseconds. But this shows
additional cost of 125 milliseconds for the query expansion. And this [inaudible].
Okay. A little bit higher, okay, for the 2.6, 2. -- but the precision is still very high,
okay.
Okay. Due to the time, let me show finally show you this film on this 10 million
image database. So this video just direct copy the screen.
[music played].
>> Qi Tian: So this shows a 10 million image database. This is time cluster not
including the feature extraction for 453 milliseconds. And this codebook size we
used here.
[music played].
>> Qi Tian: And even reverse some more here.
[music played].
>> Qi Tian: It's actually a different category which is a logo like a product and
some artwork.
[music played].
>> Qi Tian: And this here.
[music played].
>> Qi Tian: This means actually the master to the text. So again, it's not
semantic search, okay. It's know they are a match. And they're a feature match.
>>: So how many of [inaudible].
>> Qi Tian: On which scale?
>>: You can match big and small images.
>> Qi Tian: Okay. This depends on the image -- okay. Like one is a image
resolution and if the image is too small it may not get enough points. So that
means we now come back to fix the detection part, okay. It's ->>: [inaudible].
>> Qi Tian: This is 200 by 200, yeah.
>>: Yeah.
>> Qi Tian: Okay.
>>: [inaudible].
>> Qi Tian: The average -- so if it's larger than 400 by 400 [inaudible].
>>: [inaudible].
>> Qi Tian: Yeah, is original.
>>: [inaudible].
>> Qi Tian: No. Because some [inaudible] image is too large, like 2,000 by
2,000 and it takes [inaudible].
>>: The reason you are able to match [inaudible] is due to the fifth feature?
>> Qi Tian: Yes.
>>: [inaudible].
>> Qi Tian: Because at least it provides a lot of the first match. And mixed with
a good one.
>>: So if you flip the image ->> Qi Tian: Very good question. Yes. [laughter]. We fix it actually.
>>: Okay. So [inaudible].
>> Qi Tian: Because we will flip it like a descriptor changed. But an easy way to
fix it.
>>: But [inaudible] cutting it in half?
>> Qi Tian: Yeah.
>>: [inaudible].
>> Qi Tian: Yeah.
>>: So that's [inaudible] [laughter].
>> Qi Tian: So I also talk to Google. Do you care about flipping like a mirrored
and how many occasions in the search people use or care about it? They don't
care about it. So for the search, okay, not people ask, okay, do the flipping -- can
you rotate like random rotate any degree.
>>: [inaudible].
>>: [inaudible].
>> Qi Tian: Depends on the ->>: And the user. I just want to look for similar images.
>> Qi Tian: Right.
>>: And that's fine. But [inaudible] really pleased I [inaudible].
>> Qi Tian: Yeah.
>>: [inaudible] copyright.
>> Qi Tian: Yeah.
>>: [inaudible] I would want to find a [inaudible] easy ones I really don't care.
>> Qi Tian: Yeah.
>>: [inaudible] look for the sneaky ones.
>> Qi Tian: So we have a version of this one to handle. Let me tell you basic
idea how to handle rotation randomness. So there are two way. The first one is
we can pre-rotate images in like inside [inaudible] different angles. Okay. And
then we construct spatial coding map for rotation pictures. Because we loaded
rotation we can't construct the coding map from the existing one. That's very
easy. And then we do this check.
Then -- and a cluster is more time cost. But we can handle rotation. But to
certain degree. Not totally, okay?
Another way is -- let me show this slide, okay. The basic idea it -- I hide a lot of
slides. Another way is -- another way is now is we construct the spatial map.
We construct total -- this Rmap. So for example, okay, this is feature, this is a
scale this feature. Okay. Considering the positions of other features, whether
they're inside or outside if they are out -- if they are -- this is -- either inside is 1,
outside is 0, okay? Now, you can construct so-called this region map. But this is
for one scale.
And later we can -- you consider this is one, just very roughly divides the image
into insides and outsides. Then we construct a ring. For example we used ->>: [inaudible].
>> Qi Tian: The rotation, yes. The rotation. Because when they fall on the
same ring, no matter where they are, it's rotation. Later -- but we also found
some -- I'm not showing in this drawback because image is square. So we divide
the image by ring. Okay. May not cut the image best. So it cuts the image by
square. That's here.
Now, we cut the image -- so for example, we can cut images by square, okay.
Then inside square, outside square. And we can -- there are a few -- all right.
This is five features. Now, later we add some change. So before we can start a
spatial coding map we rotate each feature according to its dominant position -orientation.
For example, for feature two its orientation is one. So before we construct we
first rotate, rotate like to this dominant angle. Then we cut as of -- okay, instead
of this ring inside/outside we construct a square, okay. A square might fit this
image better, okay. We don't know.
And then, okay, this is -- because we know this dominant angle, we know the
before rotation and the position of the rotation. So we construct this rotation -this after rotation relative positions we can construct this square map.
Further, this is a -- this is square inside -- we combined the way of this spatial
coding map. So we can get a lot of different spatial context.
>>: So I still don't understand why you don't [inaudible] to your spatial coding.
Your spatial coding has already divided the whole image into key orientations.
So if I have one set with my original ->> Qi Tian: Okay.
>>: When you present this key division of the spatial coding.
>> Qi Tian: Key?
>>: The key -- quantization, right?
>> Qi Tian: Okay.
>>: You can already deal with the rotation. So currently you are computing one
spatial map, coding map with the same orientation. So you can't compute with
different orientation. [inaudible].
>> Qi Tian: Oh, okay. So let -- we have some results. I can show you the
results. Like you can rotate different angles, okay. But that actually is a
[inaudible] roughly. Because ->>: You don't need the [inaudible] images. And a map is already ->> Qi Tian: Yeah. That's actually -- that's ->>: Changes order of the [inaudible].
>> Qi Tian: Actually I had a slide. That's all the [inaudible] do here in our
extension. So we had [inaudible]. So you do not need to rotate images, okay,
you rotate this -- so you rotate -- so this is rotated like a position of the features.
So you just needed to construct the feature map for this one. And this is a simple
way for -- when this is rotated 90 degrees.
Now, this is code before and after this rotation. And the next is -- now this
rotates 45 degrees. And this is a code before and after. Basically so two
becomes three, three becomes four, four becomes five. This is rotation.
>>: That [inaudible].
>> Qi Tian: Okay.
>>: You have already -- okay. If you go back to your spatial coding image.
>> Qi Tian: Before? This one?
>>: Right, right. Okay.
>> Qi Tian: Let me just [inaudible].
>>: So you have all of the orientation [inaudible] here. Okay. So you have this
map, you have this map, et cetera. So now you can compare. You compare this
key to 0 to the key 1, you can already deal with orientation with 15 degrees.
So currently in your comparing [inaudible] geometrically [inaudible].
>> Qi Tian: This one? No, this one. This one?
>>: I guess [inaudible].
>>: You don't need to [inaudible].
>>: Original image can be rotated K times. The image can be rotated K times.
>>: [inaudible] already done that.
>>: You have this K map. And supposedly you can map with destination the
retrieval in which each time [inaudible].
>>: [inaudible].
>> Qi Tian: Okay. So if you rotate. There are two rotation here. Y is -- another
rotation. Y is -- each quadrant were divided into R parts. We're going to
construct for each part. That's one rotation.
Another rotation is image rotated. And how do you construct this spatial coding
map from the rotated image?
>>: Okay. So my [inaudible] here you compare K, right? K equals [inaudible].
>> Qi Tian: K is -- your question is [inaudible]. K is from 1 to 3. R is 3. So it's
just 4 -- just 0, 2, 3 -- just, 0, 3 is ->>: So it's different from the orientation key?
>> Qi Tian: Yes, it's different.
>>: [inaudible].
>> Qi Tian: Right. That's ->>: That's the reason [inaudible].
>> Qi Tian: Right. This is a -- in this case ->>: This what I meant.
>>: So you can compare K for 0 to K for the 1s and you can deal with -- deal
with the rotation to 30 degrees.
>> Qi Tian: Yeah. I sort [inaudible].
>>: That is exactly what we did [inaudible].
>>: But you don't need it to do it again and again. [inaudible].
>> Qi Tian: Yeah.
[brief talking over].
>>: Okay.
>> Qi Tian: Okay. So that's for query -- for the rotation invariance. This
particular case [inaudible] rotated to a certain angle. So we constructed -- so last
one is. This is -- okay. I didn't talk all of them, but say XC, this is spatial coding.
XC certain is you rotate [inaudible] different angles. Okay. So that's a
performance. When you look at them more you get better performance.
And this is a geometric coding, okay. Using the square and the -- I haven't talked
about it. So we found -- so later there was a physical geometric coding. Its time
cost is smaller, okay -- is similar to when you rotate it eight times. But
performance it's close to [inaudible]. Okay.
So basically I'm getting trade-off between time. So if we rotate the spatial coding
you can rotate more angles to get better performance. But again, it's time cost.
>>: [inaudible]. Okay. Thank you very much.
>> Qi Tian: Okay. Thank you.
[applause]
Download