>>: Good morning, everybody. So we're going to start this last day of the Virtual Earth Summit, but today we see the combined event, the Virtual Earth Summit and the Location Summit which is an internal event driven by and led by John Cram (phonetic), sitting here and here is John. We do an introduction after the break on the Location Summit part. So if you are internal to Microsoft and are wondering am I in the right room, you are. So this is a combination of the Virtual Earth and Location Summit today. I have a few announcements for the people who are external. So if you need vouchers to go back to the airport, either you should already have it, or just ask Jennifer who is at the end there in the room and she will have one for you. This evening we have a reception at five p.m. The reception is in the atrium, just next to here. Concerning the demos and poster session, it happens at lunchtime. There will be in two rooms like room 1915 and 1927, which are next to here. You can set up if you are demoing or you have a poster, you can go talk to Jennifer at the break or at lunchtime to start setting up your demo or your poster. She has already assigned a room for you. Okay. And the abstracts, actually some of the abstracts for the demo for the posters are in the proceedings that you should have in the bag in the Virtual and Location Summit bag. So now let's start with the first session this morning. And it's a pleasure to have here Steve Seitz to talk about navigating the world's photographs. Thank you. >> Steven Seitz: By the following question. So suppose that you had access to every photo that had ever been taken, and these are photos taken by anyone on earth at any time, anyplace. Suppose you had access to every photo that existed. What could you do with this collection of photos? So you know, a few years ago, this might have seemed completely academic, this question and unrealistic, but now, you know, it's becoming very, very pertinent. So if you look at the growth of photographs on the Internet, this is a graph that Noah Snavely put together. You know, a few years ago there were only a few million images, this is just looking at Flicker, there are a few million images on Flicker. It just passed the two billion mark late last year, and you can see where this trajectory appears to be going. So there are lots of photos online. So what do these photos look like? So on Live Search if you look at Notre Dame, Paris, you get back about 60,000 images of the cathedral from every imaginable viewpoint, different illumination conditions. You get inside, outside views, historical photos and so forth. So kind of the dream data set for both the analysis and visualization. There are lots of photos of celebrities, a quarter of a million images apparently of Britney Spears. There are evidently over three million images of Albert Einstein. And it may seem oddly comforting that there are more photos of Albert Einstein than Britney Spears, at least to this audience. But of course these aren't really all distinct photos, right, so if you look at these two, they appear to be different scans of the same image. I mean, who knows exactly what this is. And then, you know, these are nearly identical. And so if you look closely at -- more closely at photos like that, so that there's a Website where you can make him say whatever you want just by typing in a phrase. So some of these photos are questionable. And in fact, you know, there's -these searches pail in comparison. If you do a search on nothing, there's 47 million images of nothing. Okay. But some of these photos are quite interesting. So for instance if you do a search for Rome on for example Flicker, you get back a million images of the city of Rome, okay, so you can think of this as a huge historical record of the city. So let's look at what these photos look like. So most of them are just vacation photos. And if you don't know these people, this photo may not be that interesting to you either. But if you look at more of these photos, suddenly they start to get a bit more interesting. So here are 50 images from Flicker of Rome, and so there's the Trevi Fountain, there's the Colosseum, there's the inside of the Parthenon, and there's the Vatican. So we're starting to see the main sites of the city pop out of this. But of course with just 50 images you can't cover the whole city, so you might download say 200. And in this size of data set you will have an image of the Pope, for instance. Does anyone see where the Pope is? That's the Pope. Okay. So but, you know, not everything is here. I heard a yes in the background. So for instance the Sistine Chapel is missing. At least I couldn't find in it this data set. But if you download 800 images there's an image of the Sistine Chapel and actually there's a couple of others. I think I just pressed the wrong key. So there's a couple of others as well. So we're starting to get, you know, multiple views of most things. And if you really want to be getting safe that you have a good coverage, you might download 20,000, and the full data set is this. Okay. So these are the images of Rome on Flicker. And so this is really what I'm talking about, this is a great historical record of the city, but the only problem is you can't actually see anything. So how can we better visualize data sets like this and use them for some applications which I'll talk about? So this is a different visualization of the same data set. This is actually the 20,000 image as opposed to the million image data set. And these are similar to what Blaze (phonetic) presented yesterday. These are image graphs. So every red dot is an image, and the blue edges are edges connecting images that have overlapping content, features in common. And so you see these clusters, and these clusters correspond to the different sites of the city. And you probably can't tell just by looking at this image what these clusters are, but we've been studying them for a long time so we can actually recognize them. But so you know there's St. Peters, Trevi Fountain, Colosseum and lots of smaller things as well. So the structure of the city is kind of falling out into these clusters. So what you can do is -- one of the things we've done is developed some algorithms for processing data of this type to do some interesting things. Let me tell you a little about those. So one is scene summarization. So basically the idea is given this huge collection can I compute some summary of the city with a smaller number of images organized in some way. So let me show you that. So here is summary of the city of Rome. And this again was computed automatically. So what you're seeing here are representative images of the main sites of the city. So here you see the Colosseum, the inside and the outside, the Vatican, Trevi Fountain, St. Peters, the Parthenon and various other things like this. And you can scroll down this list, and this is computed basically using clustering methods on the graph I just showed you before. And you see different sites, the Hall of Maps, Sistine Chapel and so forth. Now, this is just giving you one representative image of each site, but if you click on one of these sites, like the outside of the Colosseum, it will show you representative images that I computed in some sense canonical views of that site from different viewpoints and so forth. So here the set of canonical views of the Colosseum and these are different basically popular views of the site. And if you click on any one of these, like so for instance this is a particular viewpoint at night, if I click on that one it will show me all the views from this Flicker collection that corresponds. So these are all images of about the same viewpoint taken at night. And these are computed just by basically, just by clustering on the space of features. Okay. So we're also computing these tags, and these are consensus tags from the tags on Flicker, and these are a bit more noisy, but overall, you can -- the top one or two tags tend to give you reasonable results. Okay. So -- okay. So that's one thing we can do with this data set. And just here's some follow-on work that we've been doing more recently. This is work with Ian Simon, Noah and myself. You can also instead of tagging the images, you can by analyzing the tags and the field of view of the images, you can identify regents in the scene which should be assigned tags. So here we're taking tags that have been assigned to whole images and we're figuring out based on co-occurrence of tags and features what regions of those images should be assigned those tags, so in this case for Prague, this church and astronomical clock. We can also figure out how popular certain things are, basically what in the scene do people take photographs of? And the saturation, how bright these colors are, correspond to how many images were taken of these regions. And so it's doing both a segmentation of the scene into regions and also displaying how popular these things are, how much they're photographed. So for example the statue is photographed a lot, whereas the scene in the background is less frequently photographed. And similarly down here, this particular -- sorry this fountain is not photographed as often as this region here, and this one even a little bit less than this region here and so forth. So you can compute how popular things are from this distribution of photos. So you've already heard about Photo tourism and Photosynth, so I'm not going to -fortunately don't have to spend that much time in this talk. But I would like to talk about where we're going in the space of 3D reconstruction and browsing of large photo collections. And I know I'm going fairly quickly here, but if people have questions at the end, I'll be happy to cover more things in detail. So these are the challenges that we're facing right now in terms of research related to Photo tourism, Photosynth. So one issue is scale. So when we first started this research, we were, you know, for instance, for the Notre Dame set we were literally downloading all of the images of Notre Dame on Flicker and reconstructing those sites from those images. But now there's basically one or two orders of magnitude more photos on Flicker so the old algorithms, you know, they would run for months or even a year to reconstruct these full data sets so we need more scaleable algorithms. We'd also like to reconstruct instead of a point cloud a better model of shape, so a dense mesh for instance. And then third, Blaze yesterday showed you some of the newer interfaces in Photosynth and coming up with better interfaces and better ways to navigate the especially larger scenes is an open research problem. So let me just touch on some of these different things. Okay. So large scale reconstruction initially in the Photo tourism project we were working on the order of a few hundred images. And, you know, so now there are 40,000 images of the Colosseum, so we simply cannot run at that scale. And ultimately we'd like to be able to do, operate on, you know, hundreds of thousands or millions of images. So how can we scale up to this level. And one nice observation about these Internet collections is that they are very non-uniformly sampled. So, for instance, everyone who visits Rome pretty much takes a photograph of the Colosseum from just about the same position, so there's huge cluster of photographs from certain positions and then other positions are more sparsely sampled. And so you should be able to take advantage of this if your goal is to reconstruct the scene or reconstruct camera positions by focusing on only a subset of the views in cases where they're highly oversampled. So let me show you an example. This is a fun data set. This is the Parthenon overhead view on the top right, reconstruction down here. And what's neat about this is that there are photographs both from the inside and the outside of the Parthenon. But there's also photographs that go through the door. And the door is large enough that you can actually from the outside you can see to the inside, to part of the inside, and that's enough for the algorithm to link the two reconstructions. So we just throw all of the images, the outside images and the inside images into the mix, and the algorithm reconstructs both of them together. So this is an example where that works. This is pretty cool. But what you'll notice is that there's clusters, you know, even in this data set, there's a bunch of images near the fountain, there's a bunch of images over here, you know, there's a bunch of images right in front of the front door and the outside, inside and so forth. So you should be able to choose smaller number of images for the purposes of doing reconstruction. Let me show you a different view. This is the match graph, like the graph of Rome that I showed you before. But I didn't -- these would be read in that version of the graph. So there's a node for every image, there's one of these black edges for every connection between them, and this shows I forget which is the outside and which is the inside. This is the outside and this is the inside of the Parthenon clustered into this graph. And the full graph is very dense, there's lots of nodes, there's 500 or so images. But it turns out we can find a subgraph which has some very nice properties. In particular the subgraph has the same shape as the full graph, so by the subgraph look at these black nodes. And every node of the original graph is connected to one of these nodes in the subgraph. And it turns out that this subgraph, these black nodes also have very good properties for purposes of scene reconstruction. So we can prove certain things about how accurate would be -- the scene be if we reconstructed just the black nodes and then use a much less expensive pose estimation process to add the gray nodes. So we can optimize for a smaller graph that will reduce the complexity considerably. And I'm not really going to go say much about the properties of this approach, but one nice thing about it is that it actually reconstructs all of the cameras in the full reconstruction not just the black nodes but focuses most of the operations, the effort on these black nodes. And gives you essentially the same quality much more quickly. So here's an example of runtime. Improvement on a bunch of different data sets. For a larger data set this one had about 3,000 images. The old algorithm would take about two months to run on a single machine. And the new algorithm takes just over a day, so -- which, you know, it would be nice to speed it up further, of course, but we're starting to see -- you know, we can arguably handle 10 -thousands of images and probably 10,000 images or a few more. So starting to see scaleable algorithms come out of this. Okay. So that's on the 3D reconstruction side. So that's on the structure from motion side how to get these point clouds. The next step is we'd like to get essentially laser scans without using a laser scanner. So here we have images from Flicker again and just from people's vacation photos we'd like to reconstruct this geometry. So this is a shaded mesh. And this is again not sparse anymore, this is a very dense mesh which has been reconstructed, and it's pretty accurate. So if you -- so just zooming in on this portal region you can start to see, you know, the figures, these different statues and so forth, so we're starting to get a centimeter resolution on these things. So this is computed using an algorithm which I'll tell you just a little bit about in a second. But let me describe some of the challenges. Part of the challenge here is that the images are -- have ton of variation. Images online have, you know, times of day, different weather conditions, there's lots of occlusion and so forth. We're not taking these photos ourselves so they're hard to control. So secondly the resolution is very different between these photos. So you know, they both zoomed out and zoomed in images. So this is both a challenge and a benefit because since we have such close-ups in the collection we shouldn't in principal at least be able to get extremely accurate high resolution results if we know how to process them. So let me say a bit about how we cope with these challenges, and of course there's lots of photos. But there's something -- so given this large set of photos under lots of different appearance conditions how can we reconstruct accurate 3D models? So there's a cool property which we're calling the law of large image collections. And basically it's as follows: Suppose that we choose an image from this collection and then we want to match it with other images which we need to do if we're going to compute shape and, you know, you could try to match it with an image where the weather or lighting conditions are very different but that would be hard, so instead what we're going to do is we're going to choose a subset of photos which have nearly the same conditions of which appear similar. And the point is that if you have enough images in your collection for any image you choose it's highly likely that you'll be able to find other photographs with very similar conditions. So if your collection is large enough, you can exploit this to dramatically simplify the matching problem, we don't have to match between images of different times of day, we can just find images that are compatible viewing conditions. So very briefly here's how the process works. This is just a simple tutorial for multi-view stereo for those of you who aren't familiar with it. Given a point in the right image, you consider a window, then that determines a set of what are known as epipolar lines in these other images that you have to consider that window may project to or that point my project to, and then you simply search kind of in parallel for correspondences, the best correspondence along these different epipolar lines. And in this case, this is the best answer. These two guys are occlusions because there's a foreground person in front of this window here and there's the tree in here, so we can defect these as outliers and we're left with these two. And so then once we have these two, we can based on the 2D correspondences and the known camera viewpoints which we computed earlier we can triangulate the 3D point and we're done, so we get a depth map basically for every point in this image. And so we're recovering a depth map for every image and then we can merge the depth maps and then we get these 3D models. Okay. So let me say a bit about one of the things we are doing a lot of work on is evaluation, how can we evaluate the accuracy of these models that we're getting. So one is -- so there's a couple of different things that we're doing. One is reconstructing dense shape and the other is reconstructing sparse shape or structure from motion, which is kind of the techniques that are used in Photo tourism and Photosynth. So we'd like to evaluate both of these. I'll say a couple words very briefly about both. So here's a data set which was downloaded from Flicker for this Duomo in Pisa, and here's our reconstruction. Now, what's nice about this data set is that there is a laser scan. A group in Italy put together, Roberto Scopigno captured this laser scan, and they graciously allowed us to use it. So we can compare this to the laser scan and then compute treating laser scan as ground truth we can compute the accuracy and how well they align. And the laser scan's certainly more accurate, but it's not a lot more accurate, so we're within -- so 90 percent of the points in our model are within a quarter percent of the ground truth. So for a building which is around 50 meters or so, this is a few centimeters. So it's, you know, it's not exactly as accurate as the laser scan, but it's probably good enough for a lot of applications. We're also -- we've also taken a data set of the same scene ourself with a calibrated camera, and we're trying to use that to evaluate the structure from motion algorithms, and I'm not going to go into detail on here also because we have just started this effort and I can only show you some preliminary results, but here are the structure from motion reconstructions of this cathedral next to the laser scan, same views as the laser scan on the right. And here's a cross-section showing both together. So the red points are -- well, you probably can't even see this, but there's tiny little red points near these green lines. So the green lines are the laser scan, the red points are the reconstruction. I'll zoom in on a little region. Maybe you'll be able to see this. I don't know whether you can see. But there are a bunch of little red points, and it turns out that, you know, again the accuracy so far from our preliminary evaluation it looks like on the order of one to ten centimeters. Most, 90 percent of the points are within one to ten centimeters and most are closer to the one centimeters mark. So these methods are working pretty well. All right. Okay. So for the last part I wanted to say a little bit about navigation controls. So once we have these 3D models, and you know plain clouds and so forth, camera position, how can we find better ways of navigating through the scene? So what would be for instance future controls to use in Photosynth. And so Blaze already talked a little bit about this notion of discovering objects and rotations and panoramas and so this is another view on that. Can you -anyone tell what this is, what site this is? Right. Statue of Liberty. It's an overhead view of the Statue of liberty and we're looking at the distribution of cameras. So first you might think this is an error, you know, people are standing out in the water, but it turns out that a lot of people visit the Statue of liberty from boats. So tour boats go and you get a great view from out in the water. At least that's our explanation. So basically there's a bunch of views out here, and then of course a lot of people are standing right on the island itself looking at the scene from the ground. And so you can discover these as different orbits of the scene and expose these controls to the viewer. And again, this is similar to what Blaze presented so maybe I'll just show you a bit of this but not go into detail. So here I'm showing, you know, one orbit control for the Statute of Liberty and so again just from photos on the Web we're sort of creating an interactive tour of the city, and here's the other orbit from right below the base of the island and you see tour wrists mop up in these images as well. Here's another example of -- so the key here is to try to understand something about the distribution of photos that people actually take to derive what the right controls are for exploring these scenes. So for the Parthenon, for example, suppose that we start outside from this view, this in green and our goal is to go inside to this view in red. And so we'd like to compute a path that goes from one point to the next. So the simplest -- I mean the obvious things to do would be just to you know, interpolate the posed parameters of the two, and as you might guess that doesn't result in a very satisfying transition visually because you end up going right through the wall. And I'll show you what that looks like. So if you just do this linear interpolation, you get a transition that looks like this. Now, better yet would be to choose transformation like this, where we're following -- so this path was computed so that it is close to input views at every point along the path. So we want to compute a path from this point to this one that hugs the input views everywhere. And the reason to do -- there's a couple of reasons. One is that you get better quality renderings if you choose new views which are near the original photos, and the second one is that by doing this, you can mimic more how a human would move through the scene. So the assumption here is that well these images were taken by humans and so if we move along this path, this is approximating how humans move through the scene. So I'll show you a quick video which illustrates how that works. So now when I click on this view of the inside, the camera's actually going to move through the front door and then it's going to turn around gradually and you'll see this part of the scene where the target is. So now if we click on something else, these are different views of the scene, we're clicking on this paint the statue and it moves naturally into the statue. If we click on another statue, the camera will first move out and then back in, and again this is approximating how people really experience this scene. All right. How am I doing? Five minutes. Okay. I'll show you this one more example. Suppose that you've gone to Rome, you took a bunch of photos yourself and you want to share these with your friends. Well, if this is you and you just shared these photos, it may not make sense to the person looking at them because they don't really understand the scene in context of these photos. But what we can do is we can use everyone else's photos as sort of the backdrop for your photos. So now when I move from your first photo to your second photo, I can use all the images on the Internet to compute the transition between the two. It's playing a little choppy for some reason. And so we're -- again, we're using everyone else's photos to facilitate displaying your photos. So the end points are always photos that you took. All right. So just to conclude, you know, our goal at this point for the short term is to reconstruct this Rome data set. So we have million images of Rome, we've downloaded them, we're currently matching them to each other, and then the goal is to run these algorithms and reconstruct as much of the city as we can and what will come out. So we don't know. Probably 3 the models of tourist sites, both the exteriors, the interiors, hopefully the popular statues and structures and artifacts and paintings and so forth. So we're excited to see what comes out of this data set. And there's also, looking a little bit further forward, there's lots of interesting data other than Flicker obviously so there's satellite images, aerial imagery, street-side imagery of many sites. So it would be great to combine all these different sources of imagery to reconstruct as much as we can about all these different cities. So basically that said, so I just wanted to mention this is collaborative work with a bunch of people, both at Washington Microsoft Research and on Michael Guslov (phonetic) was supposed to talk in our group now at Darmstadt. So thank you. (Applause) >>: Time for questions. >>: So, Steve, have you thought about since you have now a notion about what is the object and the things, you could remove these people like, have you thought about that? It should be easy, right? >> Steven Seitz: Yeah. Yeah. We started playing around with that idea of removing people. I mean of course that kind of defeats the purpose for most of these applications. Really people are most interested in seeing their own photos in context. But certainly for recovering good texture maps and things like that it's very useful, and we started to look into that. And the first -- our first experiments are very promising. You know, simply by using the 3D models to get correspondence and, you know, simple median and other statistics you can do it, yeah. >>: Other questions? >>: Almost all the examples we saw yesterday and today are from built environments with corners and features, and I wonder if there's much work yet on natural scene reconstruction and whether it's -- how different it is or easier, more difficult. >> Steven Seitz: Yeah, that's a good question. So we've done a little bit of -- so we've done a little bit of experimentation with some natural scenes. I'll show you one example from the Photo tourism work. See if it works. So this is let me pause it, go back. This is Yosemite from again Flicker images of Half Dome, Yosemite, and you're seeing -- point cloud reconstruction of Half Dome, now I circled Half Dome and you see photos that other people took of Half Dome, and now here you can basically lock the camera so stabilize the facade of Half Dome and then you can -- now when you view everyone's photos as sort of a slide show, it registers the face. So this is an example of, you know, one scene which we've experimented with that seems to work. But basically you need lots of features. I mean Half Dome is tricky because different seasons it looks completely different. So registering winter photos is incredibly difficult. But for some scenes we've had -- it's a good question and I'm not sure of the answer in general. >>: Okay questions? All right. Thank you, Steve. >> Steven Seitz: All right. (Applause) >>: So we mentioned that one of the themes of this Virtual Summit was looking at breadth of topics. The next talk is going to illustrate that with a talk that's going to focus on quality of the output of a system. So the talk will be given by Zhi Quan Zhou on testing non-testable information retrieval systems with geographic components on the Web. >> Zhi Quan Zhou: Good morning. It's on. Testing. Testing. Okay. It's on. Okay. Good morning everyone. Our team is from Australia, Hong Kong and Beijing. And here is a little bit information about myself. I'm George. I'm from the University of Wollongong, I'm lecturing in software engineering. And Wollongong is 40-minute drive from Sydney. And I'm also advising the United States Air Force Office of Scientific Research on software reliability. Okay. So Wollongong. Wollongong is an Aboriginal language which means the song of the sea. So we have many beautiful features. But there is another explanation. But most people believe it means the song of the sea. All right. Here is my presentation offline. I will first talk about what is the fundamental problem of researchers trying to, not to solve but to alleviate. And then we'll talk about assessing two important quality factors, that is to assess the quality of counts and most importantly the quality of ranking of search (inaudible). And also we'll make some recommendation to Microsoft's Live Search. And the end we will be talking about our ongoing project. Okay. First the fundamental problem is called the oracle problem. Oracle in English it means the word from God. Here in oracle, in software testing it means someone tells you whether the result is right or wrong. How many of you have seen this movie before? No? None from -- no one see it? Okay. It's last year's movie and it -- really no one? (Laughter) Okay. It says someone, he claims that he is from 14,000 years ago so he has been living on the earth for 14,000 years. But no one can tell whether he's telling the truth or lie. So the story is just a group of people sitting there and chat and chat and chat from the beginning to the end in one room. That's the movie. But it's interesting. (Laughter) So there is oracle problem because no one tells you whether something is right or wrong, right? And now here we have a map from Wollongong. How do you know I'm really from Wollongong? Right? So you cannot verify. That's the same problem in software testing. How do you know I'm from Wollongong? Okay. We still have some ways. All (inaudible) is that we continue to ask you questions. Although I cannot identify whether your individual answer to individual question is correct or wrong, but we can still check whether your answer is consistent among your -- between the multiple answers. So that is our measure to check the logical consistency of longer output. So more formally in software testing an oracle refers to a way, a method, a mechanism by which people can know whether the outcomes of the test executions are correct. The oracle problem means otherwise no oracle or it's very, very expensive. But just a couple of examples. First is the search. Supposedly searching for a hotel. Okay. On certain days satisfying certain criteria. And the search engine retains no hotels I found. The question is how do you know there is really no hotel? Right? You have no access to the database. Is there really no hotel in the database or it is because of some software fault in the search services. We don't know. Okay. Ranking quality. How can we assess the ranking quality of search results? Another example is the search problem, searching for shortest pass from A to B. Okay. If the program is simple, then we can verify the result. But suppose we have a very big comprehensive graph. Given our input is A and B and the whole graph and the system retains that this is the shortest path, how can the papers verify whether this is really the shortest path? This is a combinatorial problem. Also very difficult to verify. There is an oracle. We can brute force all the possible combinations, but practically it is too expensive. So our method is to use a term called metamorphic relations which is a relation among multiple input and output, not a single execution but multiple executions. So we named this kind of relation a metamorphic relation because it's changing, the input is always changing. For example, still we can see the shortest path graph area problem. A metamorphic relation can be that. Okay. In first execution you enter whole graph and starting point A and the starting point B to the program and you get an output, right? And then you can follow up, transform this graph into another graph. But this graph is actually a permutation of the original graph. And this A prime corresponds to original A, and B prime corresponds to original B. And then you enter the graph into the system again, be A prime as the starting point and B prime as the destination and you expect that the program should still retain the same. Certainly the paths may be different, but the length should be the same, right? So this is a kind of metamorphic relation. But given the problem we can identify many metamorphic relations. For example, another metamorphic relation is that we can exchange the direction. Originally here is A and here is B. But now we can put starting point here and the destination here, so the system will search in a reverse direction. But we also expect that the length of the paths should be the same. All right. So we are just talking about the general problem in software testing, that's an oracle problem. Now, let's look at the Web search engines. So what we want is people want to improve. But how can you improve. Only if you can assess the quality, then can you improve the quality, right? So we cannot control unless we can evaluate. And also in big search commercial search engines we often upgrade to the programs and we often revise the algorithms, or sometimes we change the representation of data. But every time we make a change we have to reassess our system, we have to assess whether the algorithm is better, whether the new data representation is better. So every time we make some change we need to test the system. So it's a very expensive process. Commercially we have three approaches to information retrieval systems. Precision and the recall are the most traditional approach. But they are -- a precision is very expensive for Webs search engines and recall is impossible to apply to Web search engines. For another approach is we use human judgment to evaluate the quality of ranking, that is the basic approach people use, but the human judgment has two problems. In addition to its cost different people will have different opinion. So different people give inconsistent judgment. That is also a big problem. Okay. Let's simply review what is precision. Okay. Suppose this is the whole database and then given the query, this one is represent the site of all relevant pages. So we name it this set A, and we have another set B which is the return result. So there is an intersection between A and B which is C. That's what it's used for, all the useful pages mark this intersection, right? So what is precision? Precision is okay the size of A, size of C, divided by the size of B. And the recall is size of C divided by the size of A. But because of the huge amounts of data on the Internet, it's impossible to evaluate recall because we have no knowledge about the actual set of A, the actual set of all the pages satisfying the query. We don't know, so we cannot evaluate recall. How about precision. Okay. Theoretically speaking we can still go through every page and count whether they are relevant, but practically it's -- if you can do it, it's very expensive. Okay. Then how about ranking? Ranking quality is more difficult, it's rather subjective. Okay. Now here is our method. Our method is to find the relations among multiple responses. Let's look at this example. It's an Australian yellow page. We enter a query university and the location is New South Wales. New South Wales is a state of Australia. And then we search. Is it 265 results are returned for university in New South Wales state. Okay. Next time we do the search again using the same key word, university. Okay. But this time we change the location to all states. And we see 789 results. This is reasonable or not reasonable? Not reasonable? No? Is it reasonable? I don't know how many states are there in Australia. Okay. It's reasonable, right? Because, anyway, as long as this number is greater than or equal to this number we cannot say it's wrong, right? Because this is all states. But here is only a particular state. So we see nothing wrong here. Okay. So this is a successful example. But in our testing experiments we also found this kind of situation. Okay. Shopping is our query. Location, New South Wales. How many? 3,827 pages. Now we change, shopping is still the same. But we change it to all states, and the systems returns 651. So now my question is, is this reasonable? No. Right? And we enter the query for a long time, I think for at least several weeks, right, several weeks or several months for this one. Okay. And I also talk about until recently this arrow disappeared, but it remained there for a long time in Australia yellow page. I don't know whether they're using American yellow page software, but they look similar. Okay. So our original plan in our project proposal is to use this -develop this kind of method to test search services with geographic components based on what your (inaudible). But after a few weeks we found some problems. That is when accessing yellow page from -- accessing from the programming interface in Australia it is too slow because every time we call a function you have to open a map, and it takes a long time. And also in the -- let me repeat it because the program it's repeat the call automatically, and so it's not stable because if the two call is too close we traveled then we don't -- sometimes we don't receive any return value. So the system is not stable. So in the end we decided to change instead of what you -we change to Microsoft Live Search because that is much quicker because that's a conventional text based search engine. But the idea, the fundamental methodology equally applicable to what you are. But just because of the speed and the stability issues when accessing from Australia we change to Live Search. Okay. So when we're talking about quality, what quality is. There are many qualities for search services, the response or the size of the database or what? Here we are most concerned about ranking. And also related to ranking is the counts, the number of pages returned. For example here, this is a Google screen shot, it's -- we are searching for two key words and the number 59 pages are returned. All right. So first let's look at count, and after discussing count we will be looking at the ranking quality which is more interesting. Count. Why counts? Why are we interested in counts? Okay. First of all, count is a kind of ranking, so count quality reflect to certain degree, okay, ranking. And secondly, some users they use counts to find the probability of frequencies of certain phenomenon. For example, in natural language process people use Web search engines to find the frequency of certain combination of letters and phrases. And I read some paper in SM transactions in SM transactions in different areas they are using this search engine counts. And also we can also know whether there's search engines shared because they are exaggerating the size of that database. Okay. So using our main thought we don't know whether 59 is correct, is the correct count for these two key words. But all method -- we can identify metamorphic relation; that is if A is a subset of B, then the size of A should be smaller than size of B is straightforward. So we conducted a follow-up search. Now, we get 169 pages. My question is, is this reasonable? You see, we have identical -- the first two key words are identical, but we added one more key word. But the number of pages is much larger than the previous one. But before we draw a conclusion we may have some discussion. First, is this caused by dynamic nature of the Web so we repeated the first part but got the same results. So it's not because of the dynamic of the database. Okay, second, how about the filters, are there any filters? No filters. (Inaudible) So there are complete results. And secondly, we know that all search engines they return approximate results. Is this still to the approximate nature but actually in our experiments we detected many even larger differences. For example, this one for cutlery 589. For cutlery and another were together 2,660. The two searches are consecutive, so they are within a couple seconds. And then also we repeated all the experiments. All these reported repeatable. Okay. And also the counts may change between pages because of estimation. And we click next, next, next, until the last page, still 58 versus 165. No match change. And in the end, we collected all the written pages and we found there are really -they are really different. All right. So this is an example phenomenally of search engines. Okay. How about Live Search. It's the first time we enter the key word g-l-i-f, glif. All right. How many? 11,783 pages I found. And then we enter two key words, glif and 554W. We select any of these terms. So they are all relationship. But no results are found. And we repeated it several times. So this obviously shows something wrong in the search engine. So for Google, Live Search, Yahoo, all the search engines we found similar problems. So we did an experiment using and/or and exclude relationship. And then we -here is the experiment result. So for and, so it is too series of experiments. First series we use the random words from an English dictionary, random. And second series we use random screens combination of English upper case letters, lower case letters, and the numbers zero to nine. So the length is full. So let's look at the first series. Enter anomaly. Okay. Only Live Search has 0.5 percent. But we'll talk about the and later. We did some other experiments. Okay. So let's look at all. Okay. Yahoo is the lowest and Live Search is the highest. Exclude Google as the lowest and Live Search as highest. Okay. And in the second series of experiments you see the Google perform, Google become worse, but Live Search here is 13, here is two. So Live Search becomes better. Okay. Different search engines they have very different performance. Google's performance become worse, but generally speaking Live Search performance becomes better. Okay. And then we -- through this observation, we can see that the difference between the two is because the random screen so maybe the Live Search in the semantics and the frequency components maybe the need to study more. Because we have no the fists call we cannot be back further. All right. We did some -- one more series of experiments of and because and is most important. We use multiple rather than two key words, and here are the anomaly rates. Basically they are similar. But now talk about the ranking quality. Most users are concerned not counts but the quality of ranking. All right. So here is our method to assess the quality of ranking. We do two queries. In the first query we encounter all kinds of pages, right, and then we collect for example TXT files only, we collect the top 20 TXT files. And the next query we also do the same but we use the command type limits the file type to TXT only like entering a key word file type TXT. All right. So we'll do two comparisons. For the ideal search engine the two sides would be identical not only the contents but also the order of the contents should be identical. That is a relation we identify. And also we identify two measures. One is the common line rate. Because we know that in real life it's not possible for two sides always identical because of some other practical issues. But the more they have more lines they have in common then the better the performance. So we have defined the common line rate. That is the number of pages they share in common. We only collect the top 20. All right. And the second measure is offset. That is suppose there are 20 pages but they have only 10 pages, common pages, and then we measure the -- then we take the common pages in the original order, but they may be in different order, right? One is A, B, C, D, the other may be D, C, B, A and then we count the offset, that is the difference in their order. We have some measurement. So we have the offset. And also because we also apply some rating because different rating -- in different positions the rating is different. For the top pages they have a higher rating, but for the say the page top page and the second page if their order changed then there is a big impact. But if the 19th page and 20th page the position changed, the impact is lower. So we have a rating mechanism. But I won't illustrate the formula here. And then I show you an example. For example, in Google search for this key word, this Website, you are ranked third in one of the sites but 13th in the other sites. And the total common URLs are 18. So we see the difference is quite big. Three and 13. Okay. And then we did another series of the experiment to compare the common line rate. Here is the result. Google average -- the higher the better. So here Yahoo is the best. Yahoo is highest and followed by Live Search, followed by Google. All right. So for example for this work too many sites it's ranked -- there are only five lines in common out of 20. And for Yahoo out of the 20 pages nothing in common. And Live Search is this key word. But in the best case both sides are identical, but not the order, by the contents. All right. And here it shows that we use the number of tests increasing how the average common line rate changed. So basically we see it's quite stable when it comes to 100. All right. So the top, the higher the better. So here. And then this is for text, TXT files. We also did the experiment for PDF and HTML. The reason is because I will tell immediately why we do a TXT, PDF and an HTM. We see, okay, let's look at the performance. This line is Live Search, all right. This line is Live Search. So for TXT its relative position is here. But for PDF it goes down. The higher the better. But for HTML it is higher. Okay. Let's take a closer look here. Okay. This is a performance of Live Search on the three types of files. The best performance is on the HTML, and this one is on -middle is PDF. The worst performance is on the TXT file. So it shows that that it seems our recommendation -- our analysis here shows that it seems that the Live Search is vulnerable to hyperlink structure because the text file and the PDF file they have less hyperlink structure or the topology of the Web. But HTML has a better link structure. So also -- so maybe you need to improve the semi-structure files, the performance on this kind of files. Okay. And also we did some experiments after the common lines rate, we also compiled the offset, and the offset performance is similar to the common line rate. That is the performance of -- the lower the better. So Yahoo is the best followed by Google and followed by Live Search for TXT file. And also we did the experiment with PDF and HTML file. So if we take a look here, we can see that Live Search, this one is for TXT file is not good, but when it comes to PDF file it becomes better. The best performance is the HTML. So again when there is a structure, a hyperlink structure in the file, then it becomes better. PDF file also has hyperlink structure, so better than TXT but less than HTML. All right. Okay. So the recommendation is the same as common line rate. Okay. Our future work is that we want to start the -- actually we are doing the work currently. We are starting the impact of these inconsistencies and the quality of ranking from the users' perspective, how does this impact on the perceived quality of ranking. And also we are going to study because the inconsistencies appear to be a consequence of an optimization procedure because you want to increase the response speeds. But we want to study how to minimize the impact of optimization procedure from users' perspective and how to balance the optimization and the perceived quality, and we also want to develop a user focus optimization procedure; for example, the rating mechanism in it. And here is a proposal -- a collaboration with, further collaboration with Microsoft. First we wish to see whether it's possible for us to get some data set for Microsoft database because crawling the Web is very, very time-consuming from our (inaudible) so we want to know whether we can share some data. And, secondly, to maximize the outcome we may -- there is an Australian Research Council research grants which encourage collaboration between universities and industry. If the industry can meet one dollar, the government will match more dollars. So that's all. Yeah. Thank you. (Applause) >>: Time for questions. >>: Yes. If it would have been feasible to look at maps, what kind of inconsistencies would you have looked for? >> Zhi Quan Zhou: Yes, yes, it is very feasible. The reason we didn't look at map is because of the speed. But we know that we can identify many relations because here and all this kind of relation are just some examples, but we can identify many more relations. But we also need to know which relations are the best. But we're also studying how to identify the good relations. >>: Other question? So I want to mention that -- oh, you have a question. Okay. Go ahead. >>: Yeah. When you look at counts and compare the different search engine here, how many -- first question is how many samples you are using. I understand you actually use one key word. When you use two key words, you actually find the number of counts actually is more. Actually different search engine has internal trigger, you know, based on internal ranking, of course that's the, you know, dynamic ranking, static ranking. So for example in our search engine when you put two key words there, my confidence is actually higher. That's why I give you more counts. And either you give me one key word, my confidence is actually lower because many of the occupants have just one key word and I don't have a confidence of the, you know, of the result. That's why some time I give you a lower count. So but different search engines set up different triggering rate, which we don't know for example how Google is setting up there. So purely based on the counts can be biased. >> Zhi Quan Zhou: Yes. This is a thread into the internal validity because there is a -- actually we have a concern. Whether there is an internal threshold to -after you have more, you get more information then you get more pages. So we plan to test on this because we don't know, and we didn't talk with people in Microsoft, and also we didn't know what happens in Google. So my point is so first we want to know whether they have such an internal mechanism. And, secondly, if there is such a mechanism, I think from people's -users' perspective they should know, rather than saying nothing is found or they should know how the pages are found, that should tell the user. And certainly actually count is minor. The most important focus of our research is on the ranking part. >>: Thank you. Thank again our speaker. (Applause)