>> Zhengyou Zhang: Okay. Let's get started. Good morning, everyone. It's my pleasure to introduce Qi Tian. Qi is the associate professor at the University of Texas in Antonio. Three years ago he spent one-year sabbatical at the MSR Asia, and he has been one of the experts in multimedia information retrieval. He has done a lot of research for the low level feature extraction to indexing retrieval. Today he will talk about some new researches tell entitled the Spatial Coding for Large-scale Partial-duplicate Image Retrieval. So, Qi. >> Qi Tian: Yes, thanks for a nice introduction. So this is recent -- actually last year's work for this large-scale partial-duplicate image retrieval. So during my talk, if you have any questions, just feel free to interrupt me any time. So first of all, this is joint work with my cosupervisors students, Wengang Zhou. And now Wengang is here. He is my current post-doc at UTSA. And also collaborate with Professor Houqiang Li from UST China. And Lucy Lu from Texas State University. So here is our talk outline. The first introduction to the problem. Then I'll talk about the motivation for geometric verification for this partial-duplicate partial-duplicate image retrieval. Then I propose this spacial coding scheme and how to construct spacial coding map. We also have four enhancements beyond this spacial coding map. And finally I show the experiments and two demos. One is an online demo with my laptop; another is this video demo, okay, for determining image database. What's the problem? So our goal is to search images with partial-duplicated patches in its large-scale image basis. So what are the partial-duplicate images? So in our in definition these partial-duplicate images are defined by editing the original images with some changes, for example in scale. Okay. Cropping or partial occlusion. Okay. There is I guess a website called TinEye which does the very similar thing to our work. And it claimed it has indexed over one billion images. So some of results compared to -- examples compared to their cases. We have partial-duplicate image retrieval is different from general image based object retrieval or general object recognition because the latter, okay, is more challenging, okay, and has more variations due to 3 point viewpoint change, object-class variability. So partial-duplicate image retrieval is more mature. In these slides released some of the potential applications. So first of course to de-dup, to save storage space and for copy violation detection. And maybe it can be used on some mobile devices, okay, to search for landmark, artwork, logo and some like product search. And this is -- release this application because depends on the feature we use for this work. We used safety based features. So that works well -- works better on the rigid object than the non-rigid object. >>: Is there any [inaudible] what is [inaudible]. >> Qi Tian: Quantity ->>: [inaudible] duplicate, what [inaudible], I mean, there must be some kind of a criteria where [inaudible]. >> Qi Tian: So some examples in my slides later. And so for partial duplicate images we assume local -- some local patch, spatial patch are similar. >>: Similar meaning exact same? >> Qi Tian: Maybe there's some like due to various like transformation or some can be [inaudible] transformation. >>: Okay. >> Qi Tian: Some can be the change in the viewpoint or something like that. >>: Okay. >> Qi Tian: Okay. So this slide shows general pipeline for like this image retrieval. So first you have -- you collect large-scale image database. So next step is perform feature extraction. In earlier days extract global features, color, texture, shape. But in the last 10 years local features become more popular. So these are some of the local features, SIFT feature, SURF features, MSER based feature. And second step is you have out of this local feature extracting from image database. So in average one image could extract several hundreds to a few thousand local features. So, therefore, the next step is to build discriminative -descriptive visual codebook. In the last two years we also have some -- two papers. One is build a descriptive visual word, a descriptive visual phrase. And another work is context actual visual vocabulary construction to publish in [inaudible] multimedia 2009 and 2010. And third step is how do you index, how do build efficient indexing structure. So for this work, we use inverted file -- index file. And finally, okay, to the image retrieval. So that's the general key steps for the image retrieval. Okay? Let's talk about -- briefly talk about each step. So for the local features we used to hear some of the desired properties of local features. For example, it showed it has a high repeatability. So therefore, make them invariant to illumination, rotation, or scale change. And it should be unique. Each feature has a distinctive description. And for example has to be compact and efficient at compute and also preserve some locality property. For example occupies a relatively small patch of image, robust to occlusion or clutter. To extract local features there are two steps. The first step is interest local feature detection called the interest point detector. And second step is interest point descriptor, okay? Here are just a few [inaudible] popular use interest point detector. It says Harris and DoG based, Harris-Affine, Hessian-Affine, and MSER. And these are some popular interest point descriptor. So most well known is SIFT, PCA-SIFT, okay. Or SIFT context based. So in this work since we used SIFT as the descriptor as our local features. So these slides give a brief introduction. I know many of you may already have been very familiar with this -- with this context. So for SIFT detector, either a difference of Gaussian based or MSER based. So to describe this SIFT, okay, first okay, centered on this interest point location we get 16 by 16 patch for each pixel in this 16 by 16 patch we compute its orientation. Then we find to the dominant intent orientation for all the pixels in this local patch. The next step is we rotate this local patch to the dominant orientation found in this local patch. Then we further cut this 16 by 16 patch into 4 by 4, these patches. Then for each patch we use a 8-bin, this histogram to do their orientation, their histogram. So, therefore, for each patch it's 8 dimension. So we have 14 by 14, 16. 16 times 8 is 128 dimension. So for one local feature. The second step is as I said one image that could result in hundreds of or even a few thousands of local features. And each feature is high dimensional -- is a vector, one dimensional -- one to the eighth dimension. Now, consider large-scale database. Example, millions of images or even billions of images. So this is a very, very large space of this local features. So this step people usually perform feature clustering, K-means or use of hierarchical K-means or other clustering like affinity propagation. And they use these class centers, okay. And the visual code words, so code visual word, okay. So this is construction of visual codebook. So third step is once visual codebook is constructed, we need to this -- to the feature assignment, feature quantization which means map a high-dimensional local feature vector to visual words in the codebook. Okay. And usually this labor assignment is used for this purpose. And next step 3 is build inverted indexing file. So this is a list of the visual code words in the visual codebook. And so for imaging our database, first we extract local features through this feature quantization map its local feature to one of the visual word in this list. Then it link this image to this visual word. For example, this image has two local feature. So link the -- map the two visual words. So you link these two images along this list here. >>: [inaudible]. >> Qi Tian: Yes, good question. So when they -- so, yes, we expect it has more semantical meaning but from what we have got, it's still kind of low level description. So men at work or in the past has been devoted to how to model the semantic meaning, how to model the spatial context in the visual codebook. So a lot of features on this direction. So it's -- so it cannot be really compared to the text domain, like keywords, okay? It's just not that high level. So for the second image, okay, we do the same thing. And link the image to the features to map the visual words it contains. Okay. So after we done for all the images in a database, okay, so we build this so-called in words the index file structure. So what can we do with this inverted index file? Simply we can use this for to do the retrieval. For example, given our current image, okay, extract local features. The map to the visual words. And we simply can vote each image, okay, containing this visual word. Okay? Like voting by term of frequency. And we can return images with votes, okay, to achieve the image retrieval. Okay. So visual word, a concept that was proposed in video Google, ICCV 2003. And there's another popular work by David Nister so called constructive visual vocabulary tree in CVPR 2006. And after that a lot of work in this direction. And I just list a few related. So how do construct higher-order visual word, okay. How to construct collocating pattern, bundled features by [inaudible] from MSRA. I'll talk about it later in the slides. And how to build a visual synsets, okay, and our work, how to construct descriptive video words, descriptive video phrase. And how to build a contextual visual vocabulary. Okay. So this -- so next slides I'll talk about the motivation ->>: [inaudible] of the image and visual word. >> Qi Tian: You mean how to construct the video ->>: [inaudible] because you prefer the visual word and you social image with visual word. I think essentially this is a bi-clustering problem. One dimension is the number of the images. Another dimension is the features which [inaudible] the bi-clustering trying to do something like this rather than just clustering over the features? >> Qi Tian: We haven't tried that. So nowadays, okay, because of simplicity like [inaudible] especially hierarchical [inaudible] adopted to generate this so-called visual words for the images. Okay? And certainly there can be different ways to generate like visual codebook. So even if in our work, okay, instead of considering each single local feature, we consider features in pairs. And we consider the co-occurrence context, spatial context to build this visual word pair and the basic description element. So I just feel like maybe there are lots of way to do that, yeah. Okay. So this is talk about why this apology metric verification is important in this scenario, partial duplicate image retrieval. So our contribution for this work lies here, okay. So after these three steps we construct a spatial coding map and to do the spatial verification. And the spatial verification are two ways. One is a local spatial verification and a global verification. And our method is global spatial geometric verification in this scenario. And finally perform the image retrieval. So the goal is originally when two images are matched to each other, there are a lot of force matched to SIFT features, okay. Our contribution is efficiently remove force matched SIFT features. And of course this forced match is in terms of the geometric consistency. Okay. We preserve the SIFT to match the features by checking their geometric consistence. Okay. >>: [inaudible] is done in the indexing stage or in the retrieval stage? >> Qi Tian: Retrieval stage. >>: Retrieval stage. >> Qi Tian: So in the indexing stage we're going -- I will talk about later in the slides. We are going to save some information for the matched features. So, for example, image ID, this feature comes from which image? For each SIFT feature what's its scale, one number, what's orientation? And what's its location to XY location in image. So that's information we store in the indexing stage. And during the retrieval stage we're going to use this information to build a spatial coding map online. And to do the spatial verification. So this is how example, okay. This shows the SIFT point matches between these two relevant called partial duplicate images. And you can see this matches, okay. Consists of two parts called the true matches and the false matches. And the false matches mainly because they're geometrically inconsistent. For example, some hair point on here match to here. Okay. So our goal is design a filtering method to preserve these true matches and fail to this forced matched features. Okay. So what cause this problem? Okay. There are a few, okay, factors. So first the local features. They do not preserve enough spatial information. And also, like they are not stable due to affine transformation. Sometimes corrupted by noises. In the feature quantization, because it's from high-dimensional feature vector to a single ID, so this introduce a large quantization error. And in this very long bag-of-visual words model, it's orderless. No spatial information. And in the indexing, in the past this scale and orientation for the local features are not used. And we found orientation and scale of local features are very important for partial-duplicate image retrieval. And we use it in our work. >>: [inaudible]. >> Qi Tian: Uh-huh. >>: [inaudible]. >> Qi Tian: Yeah. From no images. And also I have to point out is when we take down the image from the Internet, we reduce image size to 400 by 400. And so in average per image about 300 local features. >>: [inaudible]. >> Qi Tian: Uh-huh. So far we haven't. Yes. We can discuss off -- later. Because we haven't done anything for the compressed -- in a compressed domain. So we use low features from this image. Okay. So this is a geometric verification. The goal is to -- first to be effective to remove this false matched features. Again false is -- means they are geometrically inconsistent, okay. And first secondly has to be fast, efficient in implementation. Considering like if you do retrieval over billions of images, 10 billions, hundred images has to be realtime response. Okay. So it has to be very fast. So there's two verification. One is local verification. There's two of -- there plenty of work. One is bundled feature from CVPR 2009 and locally nearest neighbor approach. So global verification, the [inaudible] RANSAC and our spatial coding map. So this work is based on a peer's publication multimedia 2010 plus some extensions. So for the local geometric verification this is Video Google, ICCV 2003. The idea is simple. So considering there are two points matched in image 1, image 2, A match to B. In order to accept the true matches, we also consider each -- the neighborhood of each feature, okay? If the neighbors are also mapped to the same visual word, we consider this pair has local reaching support. If no neighborhood matched to each other it's low region support and we're going to reject them. That's the idea. However, this is -- drawback to this approach is kind of sensitive to the cluttered background. Okay. So second approach is bundled featured. This approach, instead of consider each individual feature alone, it consider features in groups. The group made from regions. And so this shows two bundled features. So this bundle has four local feature. This has five. So they matched -- four of them matched to the same visual words, okay. So this is another example -- there are two bundles. Also they have four features mapped to the same visual word. So next step is they use spatial consistency for the bundled feature. This information is used to weight the visual words. So besides the -- this traditional, okay, this TF-ID weighting for each visual word, he add this item N, okay, and this spatial consistency also consist two parts. The first part is how many of the shared visual words between these two bundle. So for example both of these two cases -- paces, they share same four visual words, okay. But the second part is a spatial consistency between these two. So in the first example, they checked spatial consistency in two directions. One is X direction horizontal. One is vertical, Y direction. For example, in the X horizontal direction increasing order is, okay, circle is triangle, cross and is square, okay, for this bundle. And also for second bundle in increasing order in the X direction, also circle this triangle, this cross and square, which means the order of this bundled features in the X direction are consistent -- is consistent here. Okay. Also this is consistent in the Y direction, okay. But for [inaudible] case like in the X direction, okay, so this is a circle, this a triangle, cross, and this one. But it got incorrect matching order. So therefore the matching order is false on the first one and third one. So you can see how two degree of inconsistency. Okay. And they showed good performance for partial-duplicate image detection in over one million database. However, this method has a drawback. So it's going to be infeasible if this bundle's rotated here, okay, because the XY, this order will be changed. Okay. So for global verification, okay, the most [inaudible] is this RANSAC algorithm. And the RANSAC stands for Random Sample Consensus. Iterative procedure. So it's iteratively remove outliers inlier classification. Our inlier defined as true matched features. Outlier is false matched features. So we start with randomly sample some features. Consider them as inliers. And estimate the affine transformation model based on these matched features. Then use this model to test against all the other features in the image. Then classify them as inliers and outliers. And then based on the increased inlier data set, do the sampling again and estimate the model begin and so on. And the drawback of this one is computational expensive, therefore it's not scalable. Usually in image retrieval, when we have initial retrieval image list, it takes 300 -- or 500 to do the RANSAC check. Okay. So Bates it's any reranked images within the top 300 images, so 500 images, okay. So in our case, we basically checked the -- all the images returned by this inverted file indexing. So next is how to construct a spatial coding map. The key idea is year going to construct spatial coding map which can record relative spatial positions between matched features. In the first work of this direction we construct two map. One is called Xmap and the second is Ymap to record to the relative spatial direction in X and Y direction. And for example, okay, this image has four matched features, okay. Now, we use say in this Xmap we use the reference point as I. We consider a relative position J with respect to I. If J is right to I we record it as 0, otherwise record as 1, okay. And similarly so if we consider reference point I if J is above I record as 0, otherwise record as 1. So let me show you this toy example. So first we start with reference point 1, okay. Because 2 and 3 are right to 1. So for the first point 1, the first one and a 2 and 3 position they're recorded as 0, okay. The rest are recorded as 1, okay. And then consider Ymap okay for the point 1 because all the point are below it here, okay, including itself. So it's recorded as all 1s in the Ymap, okay. So this is down for the Xmap, Ymap for the first feature. Then we move to the second feature. For reference 2, reference point 2. This 3 is on right side of 2 so it's 0 in Xmap, okay, and in Y direction since point 1 is positive point 2, okay, so at the position 1 it's recorded as 0, the rest are 1s. They are down for the second 1. So we just continue until we finish this coding for order this matched features in the image. The constructor, this so-called this Xmap and the Ymap for each image, okay. Okay. This is for the last one. So what can we do with this Xmap and the Ymap? So this illustrative very simple example. So this is a quadrant image, this is matched image in a database. And you can see there are five pairs of matched features. It either can be verified. Four of them are geometrically consistent except point 5, okay? And this other -- this other Xmap and the Ymap of the quadrant image is X map and the Ymap of the matched images. So the first four are geometrically consistent. So you can see they have the same Xmap and the Ymap for, okay. Xmap and the Ymap for. So if it takes exclusive all operation between these two map and it would be all 0, okay, on these locations and be 1s when there is the inconsistency. So it takes summation for each row. We get the overall inconsistent disagree for each feature. If you can identify the map in X direction or Y direction, point 5 has the largest inconsistent degree. So we can identify the largest one and remove that feature, remove that column and row and continue this second process for the rest of the features until this inconsistent degree to be -- all to be 0. Okay. So this is a very strict condition. Okay. But in the previous case, it's a very simple one. Each quadrant has any one part. Now, consider each quadrant. Now it's uniformly divided into two parts. Because a point located in this left corner is still different not from this located here, okay. So refer divided into two parts. Now, this can be considered as a combination of two division. The first one and the second one. Okay. Here we can construct XY map for the first one. But how to construct XY map for this second one? Because it's not like horizontal vertical, its position. Now, we can rotate, okay, 40 minimize degrees counterclockwise. This will take the features, okay. Now, we can construct Xmap and Ymap for this one, okay. In general case if we divided each quadrant uniformly into R parts, so we have this layout can be considered the combination of the R different division. And if we rotate each division by different angle, okay, because it's rotated here, this position and we can construct Xmap or Ymap for each of them. So put them together, have a generalized Xmap and a Ymap, okay. Then we just generalized Xmap, Ymap can do this spatial verification. So here this Q is a quadrant image. This M is matched image. Okay. IJ are the local features index, okay. From 1 to N. K is the R parts divided how many parts from 0 to R minus 1. N is the total number of matched features. So for each again we take exclusive R if they are consistent they will be either -they'll be same. If their inconsistent will be 1. Finally sum up for the inconsistent degree for each feature in the X direction and Y direction and we identify the maximum. Okay. Either in Xmap or Ymap, identify, remove it and reiterate until this gets 0. This just some spatial verification examples, okay. This is original to match before verification. This is identified the false matches, okay. These identified this field to the [inaudible] spatial this coding verification. And this is remaining, okay, which means the past spatial verification is this SIFT to match itself. This shows SIFT matches, okay, for two irrelevant matches. One is a Chinese document, one is a English document. And you can see like test to see if SIFT matches on them. This is identified false matches, okay? And still, okay, you see there are three pairs past spatial version. And at least one pair with past verification. Because when any one pair left, it can not do the verification. There's no relative position information. Okay. So we filter most of the false matches. Okay. >>: [inaudible]. The idea of [inaudible] is to prove the [inaudible] problem, right [inaudible] originally which are rotated a certain way. >> Qi Tian: Here ->>: Here you are not doing that actually, you are just matching, you know, with the same keys that you [inaudible] for the rotation. Do you see what I mean? >> Qi Tian: Right now we haven't talked about here is the rotation [inaudible]. Okay. So we haven't talked about this, okay. It's in later slides. So, okay, this is our indexing structure. So, again, this is a list of visual words. This is matched features, okay? For each matched features we store this information. First image ID -- right now we use 4 bytes roughly index up to 1 billion images ID, okay. And the feature orientation, one byte scale. And YX location is in one byte each. So in total we have 8 bytes for each indexed feature. And now we reduce all the images in 400 by 400 okay. For this there's a demo. The average there are 300 features per image. So therefore, one image index size is about 2.4 kilobytes. And that's for one million image index 2.4 gigabytes. That's the demo I'm going to show. And 10 million image index, 24 gigabytes. That's why I cannot show it on my laptop. I have a video. Of course this is a very rough estimation, not optimized. We can maybe further reduce the storage for each indexed feature. Now, so beyond the spatial coding we have considered four of these enhancements. The first one is how to handle rotation invariance. And how to further recover some false negative matched features. So even for the first features two parts. One is a false negative and false positive. We have a way to reestimate the model to recover some of false negative and how to do query expansion and bi-space quantization. So, you know, traditional, in the traditional nature video or camera done by [inaudible] and this clustering is in a SIFT space, which means in a -- this description space -- this descriptive space went to the edge. And then further used this orientation information to field to some of the -- to do this quantization. Okay. So there's a slide -- actually I have them hided. So when we finish the talk, I can come back to talk about some of the details. Okay. Now, experiments. So we first construct 1 million image database downloaded from the web and construct three smaller data set, 50K, 200K, 500K by random sampling them. And ground truth data set we obtain 1100 partially duplicated web images of 23 groups and use 100 of them to do this representative query, okay. We compare to three algorithms. So the baseline is the David Nister CVPR 2006 paper. This is a well cited paper. This is using visual vocabulary tree. And second one is Jegou, ECCV 2008, Hamming Embedding cluster geometric constraint. And there's a fourth you have full geometric verification using [inaudible]. For performance evaluation, using first one is a mean average precision and second one is time cost per query. We didn't consider like memory cluster in this work. Okay. Well, the first experiment using this 16 GB memory and 2.0 gigahertz CPU. So before there's a few parameters to tune. The first one is -- okay. Codebook size, okay. In David Nister's paper they index one million images they found [inaudible] is around one million size of visual codebook. So we got the similar, okay, like observation. And second thing is quadrant division factor. Each quadrant is divided into how many parts, okay? For our case, R. And next one is orientation quantization size, okay? And so this is considered trade-off between this precision and cost, time cost. So this is orientation quantization size from 0 to 10 -- to 20, which means that for circle, 360 degree how many further you cut, they quantize into orientation space. And we found this is a mean average precision. So we found, okay, when quantization size is 11, it actually was best performance in terms of the precision and time cost. Okay. >>: This data set [inaudible] I suppose. >> Qi Tian: Because we tested like over a million image data sets, yeah. So that's the result from that one. Okay. >>: When is the [inaudible] 360 degrees quantization or ->> Qi Tian: Let me get -- let me [inaudible]. So when 2 same features match to each other, they may match based on 1 to the 8th descriptor. >>: Right. >> Qi Tian: But their orientation might be totally different. So each feature has a dominant orientation. So we only consider a certain angle, okay? We need a certain angle to consideration the orientation consistency as well. We found that this information actually is quite useful for partial-duplicate image search. >>: [inaudible]. >> Qi Tian: Yes, it's in the SIFT, SIFT match. Okay. >>: That's all interesting if you were -- originally if you were [inaudible] orientation dependent, right, orientation dependent [inaudible]. Feature. Well actually -- but at the back side -- the downside of that actually cause some mismatches, that's why [inaudible] don't do that [inaudible] SIFT [inaudible] orientation is not -- it's not too far away ->> Qi Tian: Yeah, that orientation I say has to be [inaudible]. >>: Yeah. >> Qi Tian: Degree. Which means that if this is [inaudible] size label [inaudible] almost circuit degrees. And also second thing is visual codebook size. And we construct different visual codebook size from 12K, 130K, 250K and 500K. And these -- actually this observation is different from what observed in David Nister's paper. Because in his work, he found actually when we increase the codebook size the performance gets better then it gets [inaudible] okay. In our case, actually it's different, okay. So when you -- when -- you can see here. When we have a smaller codebook actually we get a better performance. Why is that? Because we're in a feature space when you have a smaller codebooks -- codebook, which means each packet is bigger. So that more features fall into each packet here. Which means we make it a lot of the force match the features, right. And but our angle [inaudible] very efficiently to remove them. Okay. >>: So what happens when you increase the codebook? >> Qi Tian: When you increase the codebook. >>: [inaudible]. >> Qi Tian: It's much -- contains much fewer points. >>: So that means you [inaudible] mismatches. >> Qi Tian: So -- yes. It not have mismatches. >>: Okay. >> Qi Tian: So we found the problem forum if you -- at the beginning if you do not detect so much map feature it doesn't work. So you have lots of features. Even if lots of false matches we can remove them. So therefore we find our trade-off for one million image, 130K is a good trade-off between precision and time cost. The last one is this quantity division factor. So again, like I said, this is on the quadrant data set, okay. Now we found quadrant factor is R equal to 3, which means for each quadrant it divided into three parts, okay. We found the best performance so far. Okay. Okay. This is summary. The codebook size is from one million image database. If you have like 100 million image base, my thought is when I show they will be increased. So this is a performance comparison on different size of image database, 50K, 200K, 500K and one million. And we compare with this is the baseline. This is David Nister paper. This Is Hamming Embedding. This is [inaudible] fully rank measured. This is our spatial coding algorithm. And this is a spatial coding per expansion. Now, on one million datasets, so performance precision compared to baseline is .48 to .73. So nearly a 52 improvement in precision. And compared to using a full geometric verification RANSAC it's improved from .61 to .73, roughly 20 percent. And if we consider a query expansion above the spatial coding, it's further improved from .73 to .88, about 20 percent. And this is a precision -- this is a time cost. So this is -- this is time cost of per image in this one million datasets, okay, our like [inaudible] configuration. So this is a baseline. This is a spatial coding. Actually our spatial coding is faster than baseline. Okay. I'll explain later. Okay. And when you introduce the query expansion and to introduce additional .6 second quadrant. And this is Hamming Embedding. This is rerank using RANSAC for the top 300 images. It's like a three seconds per each one. So comparing to baseline, okay, we have a 46 time reduction, okay. Okay. This is some of our simple results. These are some of the queries. And these are so-called partial-duplicate images font in the database. Our demo can show you this. For example we have various -- some occlusion here for the head. And this is [inaudible], this is Starbucks, this is Starbucks it's English, Chinese and another cafe, local. And this is a scale. This is several of the viewpoint. And this -- I have some -- I can show the demo. So further we tested the scalability on 10 million database. This 10 million database downloaded from the web. Now, we increased the computing sever to 2.4 gigahertz and used the server to 32 gigabytes memory. And this is some -this approach, this is performance. So in terms of the codebook size and we have a smaller codebooks we have better performance. But small codebook you have a longer image list. So the time cost will be higher. So therefore trading between these two we use this one, consider a trade-off for the 10 best performance for the 10 million images. I have a video demo for some of the examples. Okay. Before conclusion let me show you the examples. >>: [inaudible] could you provide us some insights into why the [inaudible] images [inaudible] than your spatial ->> Qi Tian: Because when you do the match we check the indexing [inaudible] check you're going to return for one million images you might return 10K image back. And [inaudible] verification because it takes the top 300 images to do that. Not taking all the images. So RANSAC is [inaudible] returned images. >>: Top 300. >> Qi Tian: 300 or 500, yes. >>: [inaudible]. >> Qi Tian: Actually checking order returned images. >>: How much? How many? >> Qi Tian: So depends for one million database it various. So let's say an average return a hundred -- 10K ->>: Eight ->> Qi Tian: 8,000 images, yes. >>: So if you do the ->> Qi Tian: [inaudible]. >>: Not considering the time. >> Qi Tian: Right. >>: If you do the full geometric ->> Qi Tian: Yes. >>: [inaudible] better. Would you agree with me? [brief talking over]. >>: [inaudible] so many facts compared to [inaudible]. Because you both consider the order ->> Qi Tian: It's a big difference. Big difference. Because they only use a special consistency to weight the -- to weight the visual word. Then so [inaudible]. >>: [inaudible] verify the matching. [inaudible] use that to verify the matching. >>: [inaudible]. >>: But just use it as a weight to ->>: [inaudible]. >>: And if for instance [inaudible]. >> Qi Tian: For location. >>: [inaudible] for location, yeah. I mean, I got a spatial order [inaudible] is the same so [inaudible]. >>: He uses [inaudible]. >>: Yeah, that part is different. I understand. >>: And [inaudible] question because currently you are [inaudible]. So you have to compare the vertical one with [inaudible]. >> Qi Tian: I have -- I'll show a demo. I open the slides [inaudible]. >>: [inaudible]. >> Qi Tian: We have -- so this is first work. Then we propose to handle location variance. We also have a demo for that. This is just -- this is the first work in this direction. So actually we have different ways to model this [inaudible] context now. Now, this is not how we take like this way and cutting this way. Actually we have a later so-called geometric ring coding. So instead of considering which quadrant it is in, we consider like for example each point has a scale okay. Then the relative position with respect to other points, whether it's inside the scale or outside the scale and we can -- this is divided the image space into like concentric ranks with like increment on the scale. We get a different spatial coding map. So among that we have a geometric ring coding, further improves with geometric coding [inaudible] coding and geometric square coding. And I have some additional slides, okay. Because I want to show the basic idea, okay, then talk about if I'm not discussing. Okay; >>: So this is a [inaudible] and let me ask [inaudible]. >> Qi Tian: Okay. >>: So I mean you talk about the Xmap and the Ymap. >> Qi Tian: Okay. >>: I mean, different quadrant to different directions, right? When you talk about inverted index you also talk about [inaudible] features or the Xmap and Ymap to [inaudible] and store it in image or basically taking [inaudible]. >> Qi Tian: On the spot. On the fly. So actually I say any use of this XY and scale orientation to generate the spatial coding. So even if we average say Y image had 300 features does not manner we're going to construct 300 [inaudible] because the map features is much less. So there's a hundred, okay, a hundred by hundred. So this going to be construct online. And because most of the operation is binary exclusive of our operation, so it's fast, very fast, okay. >>: I have a question before you go to the demo. >> Qi Tian: Okay. >>: Can I why rank 4 is not rank 1 on that query image? >> Qi Tian: Oh, okay. Good question. Actually this can be easily fixed. Because -- okay. Let me show you how they're matched. >>: You have a very sharp eye. [laughter]. >> Qi Tian: Actually ->>: I'm in the back. >> Qi Tian: So if you check here -- okay. It's in the back. This is just -- two matches, two SIFT matches after the spatial verification. And this say they have 20 matches past the spatial verification. And this is first one. Second one they also have a 20 number of matches. Right now they are displayed by the image lamp. So that's why you see to be more this -- this one actually should be ranked first. Basically because we display the part -- this rank is determined by the longer of the two matches, between these two images. >>: So evidence a number of matches not [inaudible]. >> Qi Tian: So you have same number of matches. >>: No, no. I'm asking a question. So is your [inaudible] by the number of matches not by the similarity within the matches? >> Qi Tian: No, it's not similarity. Because originally, okay, had they could match a hundred SIFT matches. Now, any maybe 20 past verification. So we use that number as our like similarity measurement. So because this one we can reduce the further time cost to calculate the image similarity. >>: Okay. >>: [inaudible] use the similarity or they match it, right? >> Qi Tian: Right. Actually the first SIFT matches by is looking it's a hierarchical visual vocabulary tree. >>: Okay. This is -- this shows the top -- this shows top -- okay. So there are six matches. The database is here is 1.2 million images in this disk. And the time cost, this is 125 milliseconds up to say this does not include the feature extraction time for each image. Okay. This is everything above the feature extraction, okay. And this codebook size we use the 1.39 million is code size. And while we display [inaudible] we return images with at least three matches passing the spatial -- past the spatial verification. So if we have lower than three we didn't display. Okay. So this is a rank determined by the number of the true matches after verification. So as you can see -- so the small match like -- so if we create another one, the match on this right side. So this red line means the correct match. Okay. This blue line means the incorrect match. >>: Yes. On all these images are [inaudible]. >> Qi Tian: You mean ->>: From the perspective point of view. In your earlier slides you have a Mona Lisa, which [inaudible]. >> Qi Tian: Yeah. I can show you. I can show you. >>: Yeah. So I'm just saying how well ->> Qi Tian: Okay. >>: [inaudible]. >> Qi Tian: Yes. >>: [inaudible]. >> Qi Tian: Because right now we found of course it works best on the flat, flat image, flat structure. And if I have a curved feature or if I have a large font, for example if I have a large -- very large font transformation the SIFT will miss the detection in the first place. So that means that we cannot get good results even if -- to the spatial verification. In order to improve that, the first you have to use like a fine SIFT and to capture -- and to detect first and to perform the spatial verification. Now, but how good or how tolerant for those very curved features? We don't know. I can -- I have some example to show you like for the -- I can show like right now. Let me -- so this is a -- this is a Starbucks logo. So it takes 7 milliseconds returns total 50 with at least three matches. So this is some of the -so this one -- there are some like a real point change and this is in the same image. And this one it's -- for example even this star matched to here and every time we consider that a quadrant match because geometric inconsistent. And here -- so even like if it's a [inaudible] so we found the matches to be here. Now, okay, now, if I -- if I use this one to do the search, okay ->>: [inaudible] how does the older version of the [inaudible] version matches? >> Qi Tian: Okay. Before you go -- so this is like also not -- let's search to this one. Let's do search. Okay. Now we found -- to this one. Actually this looks like Starbucks but it's not exactly. >>: Is that considered a [inaudible], I mean, just -- [laughter]. >>: Suppose you give to it copyright lawyers, is this considered [inaudible]. >> Qi Tian: Actually I don't know. Actually I don't know. But it should be considered [inaudible] do we match or not. So it's the same [inaudible] except -so even if they matched like the C matched to C and K matched to K, all right, O matched to OFF. Okay. So our method is still not symmetric match, okay, it's still [inaudible] feature match. >>: So I was thinking that whenever I see like a duplications search I always wonder what was the real -- how do you judge -- I mean, the definition is not that clear, right? >> Qi Tian: They're different [inaudible] I got some requests from companies. For example they have ---ive an demo. They say I have a document. It may contains the logo [inaudible] they may contain the logo, like company logo for other like FedEX or other low goes like 7-11. So can you automatically find it? Okay. >>: Of course you could find it ->> Qi Tian: No, automatic find it. >>: You mean find it from ->> Qi Tian: You have an image document here. So this ->>: You mean they have their own documents? >> Qi Tian: Yes, a scanned document, it contains the logo, okay. But you have lots of documents. Given this document can you scan -- can you identify any like logos or this trademarks from this one? >>: But that's maybe a -- okay. So -- but I -- [inaudible]. >>: Question. >> Qi Tian: Uh-huh. >>: Do you have the SIFT feature for the [inaudible] do you have the [inaudible] show what are the SIFT features extracted from that ->> Qi Tian: Oh, to come and show the features. Actually you mean -- no, no, no. Here I show the matched features. >>: Well [inaudible] you show the matching features. >> Qi Tian: Right. The original features. >>: Whether you have the original 300 features, I mean, [inaudible]. >> Qi Tian: We could have just a -- it's not this here okay. Now, okay, this is just this is just okay given this picture, okay, this one, we found actually second image contains four low goes. That's why it's -- that's why when we do a spatial coding expression we cannot take the whole to do the query expansion because it doesn't -- it don't know which one it looking for. So it has to be localized. So one is -- from my way -- when we were do the query expansion, we found a minimum bundle box for the matches. And use any information here or some point match to [inaudible] to original this data set to the query expansion. So this is similar idea used before in 2007. Now, if I say use this one to do the search, to do the quadrant -- sorry. I pressed the along button. So if I just want to do the search it take us 31 milliseconds, okay. And of course it also search for the Dunkin' Donuts and do search for the 7-11. And this -- there are some McDonald here. This McDonald is really smooth. So we didn't get many points detected on McDonald. Now, this should match on Starbucks. Now if I use this one to do the search, [inaudible] for example we found -- there is an IBM here matched to second feature and MSN. So original feature, this is [inaudible] there's a FedEx image. Now, okay, this is Starbucks. And there's more to search. Of course we [inaudible] Google. There's Google inside. This one you see Google. This one I think this will show [inaudible]. >>: [inaudible] again to cluster results based on the feature point location in your query image. >> Qi Tian: What ->>: Because the query image include much more objects. >> Qi Tian: Right. >>: And you can do the clustering on the result -- on the search result. >> Qi Tian: Okay. >>: And based on that feature point on the original image. >> Qi Tian: Okay. >>: Because here it does all of the features on the right bottom corner. >> Qi Tian: No, actually this lots of points, original points detailed here. But in any -- for example these two, they found the correct matches to be hear and also part of spatial version. And -- okay. Now, I will show you a final -- one more -okay. This is old couple. Okay. This is top 20 -- in the top -- okay. 44. So for example you see a GM here. Why is that? Because there is a GM here. Okay. And so this is -- so maybe there's some notation, not much. This is 4 -- so for example, this is a good example. So actually they showed lots of match here but because position is both so we consider it inconsistent. So that is old man and a house. This one -- I mean, this should be matched before but now it's filtered. Okay. This is Starbucks. Actually when we search for this, use this one, we do not know which one we search for. Now, if I use any of them to do the search, okay, now, I can show you the precision for this one. So this shows a random just one to do the search. And we will have a very high prediction of this one in the top returns. But it's just less than .6, .6. And now I'm sure where we can prove it. We can do an adding a query expansion. So we take the total 5 images to do a query expansion. Of course the cost is time. I show this precision. >>: You're not [inaudible]. >> Qi Tian: [inaudible] okay. So this originally 47 milliseconds. But this shows additional cost of 125 milliseconds for the query expansion. And this [inaudible]. Okay. A little bit higher, okay, for the 2.6, 2. -- but the precision is still very high, okay. Okay. Due to the time, let me show finally show you this film on this 10 million image database. So this video just direct copy the screen. [music played]. >> Qi Tian: So this shows a 10 million image database. This is time cluster not including the feature extraction for 453 milliseconds. And this codebook size we used here. [music played]. >> Qi Tian: And even reverse some more here. [music played]. >> Qi Tian: It's actually a different category which is a logo like a product and some artwork. [music played]. >> Qi Tian: And this here. [music played]. >> Qi Tian: This means actually the master to the text. So again, it's not semantic search, okay. It's know they are a match. And they're a feature match. >>: So how many of [inaudible]. >> Qi Tian: On which scale? >>: You can match big and small images. >> Qi Tian: Okay. This depends on the image -- okay. Like one is a image resolution and if the image is too small it may not get enough points. So that means we now come back to fix the detection part, okay. It's ->>: [inaudible]. >> Qi Tian: This is 200 by 200, yeah. >>: Yeah. >> Qi Tian: Okay. >>: [inaudible]. >> Qi Tian: The average -- so if it's larger than 400 by 400 [inaudible]. >>: [inaudible]. >> Qi Tian: Yeah, is original. >>: [inaudible]. >> Qi Tian: No. Because some [inaudible] image is too large, like 2,000 by 2,000 and it takes [inaudible]. >>: The reason you are able to match [inaudible] is due to the fifth feature? >> Qi Tian: Yes. >>: [inaudible]. >> Qi Tian: Because at least it provides a lot of the first match. And mixed with a good one. >>: So if you flip the image ->> Qi Tian: Very good question. Yes. [laughter]. We fix it actually. >>: Okay. So [inaudible]. >> Qi Tian: Because we will flip it like a descriptor changed. But an easy way to fix it. >>: But [inaudible] cutting it in half? >> Qi Tian: Yeah. >>: [inaudible]. >> Qi Tian: Yeah. >>: So that's [inaudible] [laughter]. >> Qi Tian: So I also talk to Google. Do you care about flipping like a mirrored and how many occasions in the search people use or care about it? They don't care about it. So for the search, okay, not people ask, okay, do the flipping -- can you rotate like random rotate any degree. >>: [inaudible]. >>: [inaudible]. >> Qi Tian: Depends on the ->>: And the user. I just want to look for similar images. >> Qi Tian: Right. >>: And that's fine. But [inaudible] really pleased I [inaudible]. >> Qi Tian: Yeah. >>: [inaudible] copyright. >> Qi Tian: Yeah. >>: [inaudible] I would want to find a [inaudible] easy ones I really don't care. >> Qi Tian: Yeah. >>: [inaudible] look for the sneaky ones. >> Qi Tian: So we have a version of this one to handle. Let me tell you basic idea how to handle rotation randomness. So there are two way. The first one is we can pre-rotate images in like inside [inaudible] different angles. Okay. And then we construct spatial coding map for rotation pictures. Because we loaded rotation we can't construct the coding map from the existing one. That's very easy. And then we do this check. Then -- and a cluster is more time cost. But we can handle rotation. But to certain degree. Not totally, okay? Another way is -- let me show this slide, okay. The basic idea it -- I hide a lot of slides. Another way is -- another way is now is we construct the spatial map. We construct total -- this Rmap. So for example, okay, this is feature, this is a scale this feature. Okay. Considering the positions of other features, whether they're inside or outside if they are out -- if they are -- this is -- either inside is 1, outside is 0, okay? Now, you can construct so-called this region map. But this is for one scale. And later we can -- you consider this is one, just very roughly divides the image into insides and outsides. Then we construct a ring. For example we used ->>: [inaudible]. >> Qi Tian: The rotation, yes. The rotation. Because when they fall on the same ring, no matter where they are, it's rotation. Later -- but we also found some -- I'm not showing in this drawback because image is square. So we divide the image by ring. Okay. May not cut the image best. So it cuts the image by square. That's here. Now, we cut the image -- so for example, we can cut images by square, okay. Then inside square, outside square. And we can -- there are a few -- all right. This is five features. Now, later we add some change. So before we can start a spatial coding map we rotate each feature according to its dominant position -orientation. For example, for feature two its orientation is one. So before we construct we first rotate, rotate like to this dominant angle. Then we cut as of -- okay, instead of this ring inside/outside we construct a square, okay. A square might fit this image better, okay. We don't know. And then, okay, this is -- because we know this dominant angle, we know the before rotation and the position of the rotation. So we construct this rotation -this after rotation relative positions we can construct this square map. Further, this is a -- this is square inside -- we combined the way of this spatial coding map. So we can get a lot of different spatial context. >>: So I still don't understand why you don't [inaudible] to your spatial coding. Your spatial coding has already divided the whole image into key orientations. So if I have one set with my original ->> Qi Tian: Okay. >>: When you present this key division of the spatial coding. >> Qi Tian: Key? >>: The key -- quantization, right? >> Qi Tian: Okay. >>: You can already deal with the rotation. So currently you are computing one spatial map, coding map with the same orientation. So you can't compute with different orientation. [inaudible]. >> Qi Tian: Oh, okay. So let -- we have some results. I can show you the results. Like you can rotate different angles, okay. But that actually is a [inaudible] roughly. Because ->>: You don't need the [inaudible] images. And a map is already ->> Qi Tian: Yeah. That's actually -- that's ->>: Changes order of the [inaudible]. >> Qi Tian: Actually I had a slide. That's all the [inaudible] do here in our extension. So we had [inaudible]. So you do not need to rotate images, okay, you rotate this -- so you rotate -- so this is rotated like a position of the features. So you just needed to construct the feature map for this one. And this is a simple way for -- when this is rotated 90 degrees. Now, this is code before and after this rotation. And the next is -- now this rotates 45 degrees. And this is a code before and after. Basically so two becomes three, three becomes four, four becomes five. This is rotation. >>: That [inaudible]. >> Qi Tian: Okay. >>: You have already -- okay. If you go back to your spatial coding image. >> Qi Tian: Before? This one? >>: Right, right. Okay. >> Qi Tian: Let me just [inaudible]. >>: So you have all of the orientation [inaudible] here. Okay. So you have this map, you have this map, et cetera. So now you can compare. You compare this key to 0 to the key 1, you can already deal with orientation with 15 degrees. So currently in your comparing [inaudible] geometrically [inaudible]. >> Qi Tian: This one? No, this one. This one? >>: I guess [inaudible]. >>: You don't need to [inaudible]. >>: Original image can be rotated K times. The image can be rotated K times. >>: [inaudible] already done that. >>: You have this K map. And supposedly you can map with destination the retrieval in which each time [inaudible]. >>: [inaudible]. >> Qi Tian: Okay. So if you rotate. There are two rotation here. Y is -- another rotation. Y is -- each quadrant were divided into R parts. We're going to construct for each part. That's one rotation. Another rotation is image rotated. And how do you construct this spatial coding map from the rotated image? >>: Okay. So my [inaudible] here you compare K, right? K equals [inaudible]. >> Qi Tian: K is -- your question is [inaudible]. K is from 1 to 3. R is 3. So it's just 4 -- just 0, 2, 3 -- just, 0, 3 is ->>: So it's different from the orientation key? >> Qi Tian: Yes, it's different. >>: [inaudible]. >> Qi Tian: Right. That's ->>: That's the reason [inaudible]. >> Qi Tian: Right. This is a -- in this case ->>: This what I meant. >>: So you can compare K for 0 to K for the 1s and you can deal with -- deal with the rotation to 30 degrees. >> Qi Tian: Yeah. I sort [inaudible]. >>: That is exactly what we did [inaudible]. >>: But you don't need it to do it again and again. [inaudible]. >> Qi Tian: Yeah. [brief talking over]. >>: Okay. >> Qi Tian: Okay. So that's for query -- for the rotation invariance. This particular case [inaudible] rotated to a certain angle. So we constructed -- so last one is. This is -- okay. I didn't talk all of them, but say XC, this is spatial coding. XC certain is you rotate [inaudible] different angles. Okay. So that's a performance. When you look at them more you get better performance. And this is a geometric coding, okay. Using the square and the -- I haven't talked about it. So we found -- so later there was a physical geometric coding. Its time cost is smaller, okay -- is similar to when you rotate it eight times. But performance it's close to [inaudible]. Okay. So basically I'm getting trade-off between time. So if we rotate the spatial coding you can rotate more angles to get better performance. But again, it's time cost. >>: [inaudible]. Okay. Thank you very much. >> Qi Tian: Okay. Thank you. [applause]