>>: Good morning, everybody. So we're going to... Earth Summit, but today we see the combined event, the...

advertisement
>>: Good morning, everybody. So we're going to start this last day of the Virtual
Earth Summit, but today we see the combined event, the Virtual Earth Summit
and the Location Summit which is an internal event driven by and led by John
Cram (phonetic), sitting here and here is John. We do an introduction after the
break on the Location Summit part. So if you are internal to Microsoft and are
wondering am I in the right room, you are. So this is a combination of the Virtual
Earth and Location Summit today.
I have a few announcements for the people who are external. So if you need
vouchers to go back to the airport, either you should already have it, or just ask
Jennifer who is at the end there in the room and she will have one for you.
This evening we have a reception at five p.m. The reception is in the atrium, just
next to here. Concerning the demos and poster session, it happens at lunchtime.
There will be in two rooms like room 1915 and 1927, which are next to here. You
can set up if you are demoing or you have a poster, you can go talk to Jennifer at
the break or at lunchtime to start setting up your demo or your poster. She has
already assigned a room for you. Okay. And the abstracts, actually some of the
abstracts for the demo for the posters are in the proceedings that you should
have in the bag in the Virtual and Location Summit bag.
So now let's start with the first session this morning. And it's a pleasure to have
here Steve Seitz to talk about navigating the world's photographs. Thank you.
>> Steven Seitz: By the following question. So suppose that you had access to
every photo that had ever been taken, and these are photos taken by anyone on
earth at any time, anyplace. Suppose you had access to every photo that
existed.
What could you do with this collection of photos? So you know, a few years ago,
this might have seemed completely academic, this question and unrealistic, but
now, you know, it's becoming very, very pertinent.
So if you look at the growth of photographs on the Internet, this is a graph that
Noah Snavely put together. You know, a few years ago there were only a few
million images, this is just looking at Flicker, there are a few million images on
Flicker. It just passed the two billion mark late last year, and you can see where
this trajectory appears to be going.
So there are lots of photos online. So what do these photos look like? So on
Live Search if you look at Notre Dame, Paris, you get back about 60,000 images
of the cathedral from every imaginable viewpoint, different illumination conditions.
You get inside, outside views, historical photos and so forth. So kind of the
dream data set for both the analysis and visualization.
There are lots of photos of celebrities, a quarter of a million images apparently of
Britney Spears.
There are evidently over three million images of Albert Einstein. And it may
seem oddly comforting that there are more photos of Albert Einstein than Britney
Spears, at least to this audience.
But of course these aren't really all distinct photos, right, so if you look at these
two, they appear to be different scans of the same image. I mean, who knows
exactly what this is. And then, you know, these are nearly identical. And so if
you look closely at -- more closely at photos like that, so that there's a Website
where you can make him say whatever you want just by typing in a phrase.
So some of these photos are questionable. And in fact, you know, there's -these searches pail in comparison. If you do a search on nothing, there's 47
million images of nothing.
Okay. But some of these photos are quite interesting. So for instance if you do a
search for Rome on for example Flicker, you get back a million images of the city
of Rome, okay, so you can think of this as a huge historical record of the city. So
let's look at what these photos look like.
So most of them are just vacation photos. And if you don't know these people,
this photo may not be that interesting to you either. But if you look at more of
these photos, suddenly they start to get a bit more interesting. So here are 50
images from Flicker of Rome, and so there's the Trevi Fountain, there's the
Colosseum, there's the inside of the Parthenon, and there's the Vatican. So
we're starting to see the main sites of the city pop out of this.
But of course with just 50 images you can't cover the whole city, so you might
download say 200. And in this size of data set you will have an image of the
Pope, for instance. Does anyone see where the Pope is? That's the Pope.
Okay. So but, you know, not everything is here. I heard a yes in the
background. So for instance the Sistine Chapel is missing. At least I couldn't
find in it this data set. But if you download 800 images there's an image of the
Sistine Chapel and actually there's a couple of others. I think I just pressed the
wrong key. So there's a couple of others as well. So we're starting to get, you
know, multiple views of most things. And if you really want to be getting safe that
you have a good coverage, you might download 20,000, and the full data set is
this. Okay. So these are the images of Rome on Flicker.
And so this is really what I'm talking about, this is a great historical record of the
city, but the only problem is you can't actually see anything.
So how can we better visualize data sets like this and use them for some
applications which I'll talk about? So this is a different visualization of the same
data set. This is actually the 20,000 image as opposed to the million image data
set. And these are similar to what Blaze (phonetic) presented yesterday. These
are image graphs. So every red dot is an image, and the blue edges are edges
connecting images that have overlapping content, features in common. And so
you see these clusters, and these clusters correspond to the different sites of the
city. And you probably can't tell just by looking at this image what these clusters
are, but we've been studying them for a long time so we can actually recognize
them. But so you know there's St. Peters, Trevi Fountain, Colosseum and lots of
smaller things as well.
So the structure of the city is kind of falling out into these clusters.
So what you can do is -- one of the things we've done is developed some
algorithms for processing data of this type to do some interesting things. Let me
tell you a little about those.
So one is scene summarization. So basically the idea is given this huge
collection can I compute some summary of the city with a smaller number of
images organized in some way. So let me show you that. So here is summary
of the city of Rome.
And this again was computed automatically. So what you're seeing here are
representative images of the main sites of the city. So here you see the
Colosseum, the inside and the outside, the Vatican, Trevi Fountain, St. Peters,
the Parthenon and various other things like this. And you can scroll down this
list, and this is computed basically using clustering methods on the graph I just
showed you before. And you see different sites, the Hall of Maps, Sistine Chapel
and so forth.
Now, this is just giving you one representative image of each site, but if you click
on one of these sites, like the outside of the Colosseum, it will show you
representative images that I computed in some sense canonical views of that site
from different viewpoints and so forth.
So here the set of canonical views of the Colosseum and these are different
basically popular views of the site. And if you click on any one of these, like so
for instance this is a particular viewpoint at night, if I click on that one it will show
me all the views from this Flicker collection that corresponds. So these are all
images of about the same viewpoint taken at night. And these are computed just
by basically, just by clustering on the space of features.
Okay. So we're also computing these tags, and these are consensus tags from
the tags on Flicker, and these are a bit more noisy, but overall, you can -- the top
one or two tags tend to give you reasonable results.
Okay. So -- okay. So that's one thing we can do with this data set. And just
here's some follow-on work that we've been doing more recently. This is work
with Ian Simon, Noah and myself. You can also instead of tagging the images,
you can by analyzing the tags and the field of view of the images, you can
identify regents in the scene which should be assigned tags. So here we're
taking tags that have been assigned to whole images and we're figuring out
based on co-occurrence of tags and features what regions of those images
should be assigned those tags, so in this case for Prague, this church and
astronomical clock.
We can also figure out how popular certain things are, basically what in the
scene do people take photographs of? And the saturation, how bright these
colors are, correspond to how many images were taken of these regions. And so
it's doing both a segmentation of the scene into regions and also displaying how
popular these things are, how much they're photographed.
So for example the statue is photographed a lot, whereas the scene in the
background is less frequently photographed. And similarly down here, this
particular -- sorry this fountain is not photographed as often as this region here,
and this one even a little bit less than this region here and so forth.
So you can compute how popular things are from this distribution of photos. So
you've already heard about Photo tourism and Photosynth, so I'm not going to -fortunately don't have to spend that much time in this talk.
But I would like to talk about where we're going in the space of 3D reconstruction
and browsing of large photo collections. And I know I'm going fairly quickly here,
but if people have questions at the end, I'll be happy to cover more things in
detail.
So these are the challenges that we're facing right now in terms of research
related to Photo tourism, Photosynth.
So one issue is scale. So when we first started this research, we were, you
know, for instance, for the Notre Dame set we were literally downloading all of
the images of Notre Dame on Flicker and reconstructing those sites from those
images. But now there's basically one or two orders of magnitude more photos
on Flicker so the old algorithms, you know, they would run for months or even a
year to reconstruct these full data sets so we need more scaleable algorithms.
We'd also like to reconstruct instead of a point cloud a better model of shape, so
a dense mesh for instance.
And then third, Blaze yesterday showed you some of the newer interfaces in
Photosynth and coming up with better interfaces and better ways to navigate the
especially larger scenes is an open research problem. So let me just touch on
some of these different things.
Okay. So large scale reconstruction initially in the Photo tourism project we were
working on the order of a few hundred images. And, you know, so now there are
40,000 images of the Colosseum, so we simply cannot run at that scale. And
ultimately we'd like to be able to do, operate on, you know, hundreds of
thousands or millions of images. So how can we scale up to this level.
And one nice observation about these Internet collections is that they are very
non-uniformly sampled. So, for instance, everyone who visits Rome pretty much
takes a photograph of the Colosseum from just about the same position, so
there's huge cluster of photographs from certain positions and then other
positions are more sparsely sampled. And so you should be able to take
advantage of this if your goal is to reconstruct the scene or reconstruct camera
positions by focusing on only a subset of the views in cases where they're highly
oversampled.
So let me show you an example. This is a fun data set. This is the Parthenon
overhead view on the top right, reconstruction down here. And what's neat about
this is that there are photographs both from the inside and the outside of the
Parthenon. But there's also photographs that go through the door. And the door
is large enough that you can actually from the outside you can see to the inside,
to part of the inside, and that's enough for the algorithm to link the two
reconstructions.
So we just throw all of the images, the outside images and the inside images into
the mix, and the algorithm reconstructs both of them together. So this is an
example where that works. This is pretty cool. But what you'll notice is that
there's clusters, you know, even in this data set, there's a bunch of images near
the fountain, there's a bunch of images over here, you know, there's a bunch of
images right in front of the front door and the outside, inside and so forth.
So you should be able to choose smaller number of images for the purposes of
doing reconstruction.
Let me show you a different view. This is the match graph, like the graph of
Rome that I showed you before. But I didn't -- these would be read in that
version of the graph. So there's a node for every image, there's one of these
black edges for every connection between them, and this shows I forget which is
the outside and which is the inside. This is the outside and this is the inside of
the Parthenon clustered into this graph. And the full graph is very dense, there's
lots of nodes, there's 500 or so images. But it turns out we can find a subgraph
which has some very nice properties. In particular the subgraph has the same
shape as the full graph, so by the subgraph look at these black nodes.
And every node of the original graph is connected to one of these nodes in the
subgraph. And it turns out that this subgraph, these black nodes also have very
good properties for purposes of scene reconstruction. So we can prove certain
things about how accurate would be -- the scene be if we reconstructed just the
black nodes and then use a much less expensive pose estimation process to add
the gray nodes.
So we can optimize for a smaller graph that will reduce the complexity
considerably. And I'm not really going to go say much about the properties of
this approach, but one nice thing about it is that it actually reconstructs all of the
cameras in the full reconstruction not just the black nodes but focuses most of
the operations, the effort on these black nodes. And gives you essentially the
same quality much more quickly.
So here's an example of runtime. Improvement on a bunch of different data sets.
For a larger data set this one had about 3,000 images. The old algorithm would
take about two months to run on a single machine. And the new algorithm takes
just over a day, so -- which, you know, it would be nice to speed it up further, of
course, but we're starting to see -- you know, we can arguably handle 10 -thousands of images and probably 10,000 images or a few more.
So starting to see scaleable algorithms come out of this. Okay. So that's on the
3D reconstruction side. So that's on the structure from motion side how to get
these point clouds.
The next step is we'd like to get essentially laser scans without using a laser
scanner. So here we have images from Flicker again and just from people's
vacation photos we'd like to reconstruct this geometry. So this is a shaded mesh.
And this is again not sparse anymore, this is a very dense mesh which has been
reconstructed, and it's pretty accurate. So if you -- so just zooming in on this
portal region you can start to see, you know, the figures, these different statues
and so forth, so we're starting to get a centimeter resolution on these things.
So this is computed using an algorithm which I'll tell you just a little bit about in a
second. But let me describe some of the challenges. Part of the challenge here
is that the images are -- have ton of variation. Images online have, you know,
times of day, different weather conditions, there's lots of occlusion and so forth.
We're not taking these photos ourselves so they're hard to control.
So secondly the resolution is very different between these photos. So you know,
they both zoomed out and zoomed in images. So this is both a challenge and a
benefit because since we have such close-ups in the collection we shouldn't in
principal at least be able to get extremely accurate high resolution results if we
know how to process them.
So let me say a bit about how we cope with these challenges, and of course
there's lots of photos. But there's something -- so given this large set of photos
under lots of different appearance conditions how can we reconstruct accurate
3D models?
So there's a cool property which we're calling the law of large image collections.
And basically it's as follows: Suppose that we choose an image from this
collection and then we want to match it with other images which we need to do if
we're going to compute shape and, you know, you could try to match it with an
image where the weather or lighting conditions are very different but that would
be hard, so instead what we're going to do is we're going to choose a subset of
photos which have nearly the same conditions of which appear similar.
And the point is that if you have enough images in your collection for any image
you choose it's highly likely that you'll be able to find other photographs with very
similar conditions. So if your collection is large enough, you can exploit this to
dramatically simplify the matching problem, we don't have to match between
images of different times of day, we can just find images that are compatible
viewing conditions.
So very briefly here's how the process works. This is just a simple tutorial for
multi-view stereo for those of you who aren't familiar with it. Given a point in the
right image, you consider a window, then that determines a set of what are
known as epipolar lines in these other images that you have to consider that
window may project to or that point my project to, and then you simply search
kind of in parallel for correspondences, the best correspondence along these
different epipolar lines.
And in this case, this is the best answer. These two guys are occlusions
because there's a foreground person in front of this window here and there's the
tree in here, so we can defect these as outliers and we're left with these two.
And so then once we have these two, we can based on the 2D correspondences
and the known camera viewpoints which we computed earlier we can triangulate
the 3D point and we're done, so we get a depth map basically for every point in
this image.
And so we're recovering a depth map for every image and then we can merge
the depth maps and then we get these 3D models.
Okay. So let me say a bit about one of the things we are doing a lot of work on is
evaluation, how can we evaluate the accuracy of these models that we're getting.
So one is -- so there's a couple of different things that we're doing. One is
reconstructing dense shape and the other is reconstructing sparse shape or
structure from motion, which is kind of the techniques that are used in Photo
tourism and Photosynth. So we'd like to evaluate both of these. I'll say a couple
words very briefly about both.
So here's a data set which was downloaded from Flicker for this Duomo in Pisa,
and here's our reconstruction. Now, what's nice about this data set is that there
is a laser scan. A group in Italy put together, Roberto Scopigno captured this
laser scan, and they graciously allowed us to use it. So we can compare this to
the laser scan and then compute treating laser scan as ground truth we can
compute the accuracy and how well they align.
And the laser scan's certainly more accurate, but it's not a lot more accurate, so
we're within -- so 90 percent of the points in our model are within a quarter
percent of the ground truth. So for a building which is around 50 meters or so,
this is a few centimeters.
So it's, you know, it's not exactly as accurate as the laser scan, but it's probably
good enough for a lot of applications.
We're also -- we've also taken a data set of the same scene ourself with a
calibrated camera, and we're trying to use that to evaluate the structure from
motion algorithms, and I'm not going to go into detail on here also because we
have just started this effort and I can only show you some preliminary results, but
here are the structure from motion reconstructions of this cathedral next to the
laser scan, same views as the laser scan on the right.
And here's a cross-section showing both together. So the red points are -- well,
you probably can't even see this, but there's tiny little red points near these green
lines. So the green lines are the laser scan, the red points are the
reconstruction. I'll zoom in on a little region. Maybe you'll be able to see this. I
don't know whether you can see. But there are a bunch of little red points, and it
turns out that, you know, again the accuracy so far from our preliminary
evaluation it looks like on the order of one to ten centimeters. Most, 90 percent
of the points are within one to ten centimeters and most are closer to the one
centimeters mark. So these methods are working pretty well.
All right. Okay. So for the last part I wanted to say a little bit about navigation
controls. So once we have these 3D models, and you know plain clouds and so
forth, camera position, how can we find better ways of navigating through the
scene? So what would be for instance future controls to use in Photosynth.
And so Blaze already talked a little bit about this notion of discovering objects
and rotations and panoramas and so this is another view on that. Can you -anyone tell what this is, what site this is? Right. Statue of Liberty. It's an
overhead view of the Statue of liberty and we're looking at the distribution of
cameras.
So first you might think this is an error, you know, people are standing out in the
water, but it turns out that a lot of people visit the Statue of liberty from boats. So
tour boats go and you get a great view from out in the water. At least that's our
explanation.
So basically there's a bunch of views out here, and then of course a lot of people
are standing right on the island itself looking at the scene from the ground. And
so you can discover these as different orbits of the scene and expose these
controls to the viewer. And again, this is similar to what Blaze presented so
maybe I'll just show you a bit of this but not go into detail. So here I'm showing,
you know, one orbit control for the Statute of Liberty and so again just from
photos on the Web we're sort of creating an interactive tour of the city, and here's
the other orbit from right below the base of the island and you see tour wrists
mop up in these images as well.
Here's another example of -- so the key here is to try to understand something
about the distribution of photos that people actually take to derive what the right
controls are for exploring these scenes. So for the Parthenon, for example,
suppose that we start outside from this view, this in green and our goal is to go
inside to this view in red. And so we'd like to compute a path that goes from one
point to the next.
So the simplest -- I mean the obvious things to do would be just to you know,
interpolate the posed parameters of the two, and as you might guess that doesn't
result in a very satisfying transition visually because you end up going right
through the wall. And I'll show you what that looks like.
So if you just do this linear interpolation, you get a transition that looks like this.
Now, better yet would be to choose transformation like this, where we're
following -- so this path was computed so that it is close to input views at every
point along the path. So we want to compute a path from this point to this one
that hugs the input views everywhere. And the reason to do -- there's a couple of
reasons. One is that you get better quality renderings if you choose new views
which are near the original photos, and the second one is that by doing this, you
can mimic more how a human would move through the scene. So the
assumption here is that well these images were taken by humans and so if we
move along this path, this is approximating how humans move through the
scene.
So I'll show you a quick video which illustrates how that works. So now when I
click on this view of the inside, the camera's actually going to move through the
front door and then it's going to turn around gradually and you'll see this part of
the scene where the target is.
So now if we click on something else, these are different views of the scene,
we're clicking on this paint the statue and it moves naturally into the statue. If we
click on another statue, the camera will first move out and then back in, and
again this is approximating how people really experience this scene.
All right. How am I doing? Five minutes. Okay. I'll show you this one more
example. Suppose that you've gone to Rome, you took a bunch of photos
yourself and you want to share these with your friends.
Well, if this is you and you just shared these photos, it may not make sense to
the person looking at them because they don't really understand the scene in
context of these photos. But what we can do is we can use everyone else's
photos as sort of the backdrop for your photos. So now when I move from your
first photo to your second photo, I can use all the images on the Internet to
compute the transition between the two. It's playing a little choppy for some
reason.
And so we're -- again, we're using everyone else's photos to facilitate displaying
your photos. So the end points are always photos that you took.
All right. So just to conclude, you know, our goal at this point for the short term is
to reconstruct this Rome data set. So we have million images of Rome, we've
downloaded them, we're currently matching them to each other, and then the
goal is to run these algorithms and reconstruct as much of the city as we can and
what will come out. So we don't know. Probably 3 the models of tourist sites,
both the exteriors, the interiors, hopefully the popular statues and structures and
artifacts and paintings and so forth. So we're excited to see what comes out of
this data set.
And there's also, looking a little bit further forward, there's lots of interesting data
other than Flicker obviously so there's satellite images, aerial imagery, street-side
imagery of many sites. So it would be great to combine all these different
sources of imagery to reconstruct as much as we can about all these different
cities.
So basically that said, so I just wanted to mention this is collaborative work with a
bunch of people, both at Washington Microsoft Research and on Michael Guslov
(phonetic) was supposed to talk in our group now at Darmstadt. So thank you.
(Applause)
>>: Time for questions.
>>: So, Steve, have you thought about since you have now a notion about what
is the object and the things, you could remove these people like, have you
thought about that? It should be easy, right?
>> Steven Seitz: Yeah. Yeah. We started playing around with that idea of
removing people. I mean of course that kind of defeats the purpose for most of
these applications. Really people are most interested in seeing their own photos
in context. But certainly for recovering good texture maps and things like that it's
very useful, and we started to look into that. And the first -- our first experiments
are very promising. You know, simply by using the 3D models to get
correspondence and, you know, simple median and other statistics you can do it,
yeah.
>>: Other questions?
>>: Almost all the examples we saw yesterday and today are from built
environments with corners and features, and I wonder if there's much work yet on
natural scene reconstruction and whether it's -- how different it is or easier, more
difficult.
>> Steven Seitz: Yeah, that's a good question. So we've done a little bit of -- so
we've done a little bit of experimentation with some natural scenes. I'll show you
one example from the Photo tourism work. See if it works. So this is let me
pause it, go back. This is Yosemite from again Flicker images of Half Dome,
Yosemite, and you're seeing -- point cloud reconstruction of Half Dome, now I
circled Half Dome and you see photos that other people took of Half Dome, and
now here you can basically lock the camera so stabilize the facade of Half Dome
and then you can -- now when you view everyone's photos as sort of a slide
show, it registers the face.
So this is an example of, you know, one scene which we've experimented with
that seems to work. But basically you need lots of features. I mean Half Dome is
tricky because different seasons it looks completely different. So registering
winter photos is incredibly difficult. But for some scenes we've had -- it's a good
question and I'm not sure of the answer in general.
>>: Okay questions? All right. Thank you, Steve.
>> Steven Seitz: All right.
(Applause)
>>: So we mentioned that one of the themes of this Virtual Summit was looking
at breadth of topics. The next talk is going to illustrate that with a talk that's going
to focus on quality of the output of a system. So the talk will be given by Zhi
Quan Zhou on testing non-testable information retrieval systems with geographic
components on the Web.
>> Zhi Quan Zhou: Good morning. It's on. Testing. Testing. Okay. It's on.
Okay. Good morning everyone. Our team is from Australia, Hong Kong and
Beijing. And here is a little bit information about myself. I'm George. I'm from
the University of Wollongong, I'm lecturing in software engineering. And
Wollongong is 40-minute drive from Sydney. And I'm also advising the United
States Air Force Office of Scientific Research on software reliability.
Okay. So Wollongong. Wollongong is an Aboriginal language which means the
song of the sea. So we have many beautiful features. But there is another
explanation. But most people believe it means the song of the sea.
All right. Here is my presentation offline. I will first talk about what is the
fundamental problem of researchers trying to, not to solve but to alleviate. And
then we'll talk about assessing two important quality factors, that is to assess the
quality of counts and most importantly the quality of ranking of search (inaudible).
And also we'll make some recommendation to Microsoft's Live Search. And the
end we will be talking about our ongoing project.
Okay. First the fundamental problem is called the oracle problem. Oracle in
English it means the word from God. Here in oracle, in software testing it means
someone tells you whether the result is right or wrong. How many of you have
seen this movie before?
No? None from -- no one see it? Okay. It's last year's movie and it -- really no
one? (Laughter) Okay. It says someone, he claims that he is from 14,000 years
ago so he has been living on the earth for 14,000 years. But no one can tell
whether he's telling the truth or lie.
So the story is just a group of people sitting there and chat and chat and chat
from the beginning to the end in one room. That's the movie. But it's interesting.
(Laughter)
So there is oracle problem because no one tells you whether something is right
or wrong, right? And now here we have a map from Wollongong. How do you
know I'm really from Wollongong? Right? So you cannot verify. That's the same
problem in software testing. How do you know I'm from Wollongong?
Okay. We still have some ways. All (inaudible) is that we continue to ask you
questions. Although I cannot identify whether your individual answer to individual
question is correct or wrong, but we can still check whether your answer is
consistent among your -- between the multiple answers. So that is our measure
to check the logical consistency of longer output.
So more formally in software testing an oracle refers to a way, a method, a
mechanism by which people can know whether the outcomes of the test
executions are correct. The oracle problem means otherwise no oracle or it's
very, very expensive. But just a couple of examples. First is the search.
Supposedly searching for a hotel. Okay. On certain days satisfying certain
criteria. And the search engine retains no hotels I found. The question is how do
you know there is really no hotel? Right? You have no access to the database.
Is there really no hotel in the database or it is because of some software fault in
the search services. We don't know. Okay. Ranking quality. How can we
assess the ranking quality of search results?
Another example is the search problem, searching for shortest pass from A to B.
Okay. If the program is simple, then we can verify the result. But suppose we
have a very big comprehensive graph. Given our input is A and B and the whole
graph and the system retains that this is the shortest path, how can the papers
verify whether this is really the shortest path? This is a combinatorial problem.
Also very difficult to verify.
There is an oracle. We can brute force all the possible combinations, but
practically it is too expensive.
So our method is to use a term called metamorphic relations which is a relation
among multiple input and output, not a single execution but multiple executions.
So we named this kind of relation a metamorphic relation because it's changing,
the input is always changing.
For example, still we can see the shortest path graph area problem. A
metamorphic relation can be that. Okay. In first execution you enter whole
graph and starting point A and the starting point B to the program and you get an
output, right? And then you can follow up, transform this graph into another
graph. But this graph is actually a permutation of the original graph. And this A
prime corresponds to original A, and B prime corresponds to original B. And then
you enter the graph into the system again, be A prime as the starting point and B
prime as the destination and you expect that the program should still retain the
same. Certainly the paths may be different, but the length should be the same,
right?
So this is a kind of metamorphic relation. But given the problem we can identify
many metamorphic relations. For example, another metamorphic relation is that
we can exchange the direction. Originally here is A and here is B. But now we
can put starting point here and the destination here, so the system will search in
a reverse direction. But we also expect that the length of the paths should be the
same.
All right. So we are just talking about the general problem in software testing,
that's an oracle problem. Now, let's look at the Web search engines. So what
we want is people want to improve. But how can you improve. Only if you can
assess the quality, then can you improve the quality, right? So we cannot control
unless we can evaluate.
And also in big search commercial search engines we often upgrade to the
programs and we often revise the algorithms, or sometimes we change the
representation of data. But every time we make a change we have to reassess
our system, we have to assess whether the algorithm is better, whether the new
data representation is better. So every time we make some change we need to
test the system.
So it's a very expensive process. Commercially we have three approaches to
information retrieval systems. Precision and the recall are the most traditional
approach. But they are -- a precision is very expensive for Webs search engines
and recall is impossible to apply to Web search engines.
For another approach is we use human judgment to evaluate the quality of
ranking, that is the basic approach people use, but the human judgment has two
problems. In addition to its cost different people will have different opinion. So
different people give inconsistent judgment. That is also a big problem.
Okay. Let's simply review what is precision. Okay. Suppose this is the whole
database and then given the query, this one is represent the site of all relevant
pages. So we name it this set A, and we have another set B which is the return
result. So there is an intersection between A and B which is C. That's what it's
used for, all the useful pages mark this intersection, right? So what is precision?
Precision is okay the size of A, size of C, divided by the size of B. And the recall
is size of C divided by the size of A. But because of the huge amounts of data on
the Internet, it's impossible to evaluate recall because we have no knowledge
about the actual set of A, the actual set of all the pages satisfying the query. We
don't know, so we cannot evaluate recall.
How about precision. Okay. Theoretically speaking we can still go through every
page and count whether they are relevant, but practically it's -- if you can do it,
it's very expensive.
Okay. Then how about ranking? Ranking quality is more difficult, it's rather
subjective.
Okay. Now here is our method. Our method is to find the relations among
multiple responses. Let's look at this example. It's an Australian yellow page.
We enter a query university and the location is New South Wales. New South
Wales is a state of Australia. And then we search. Is it 265 results are returned
for university in New South Wales state.
Okay. Next time we do the search again using the same key word, university.
Okay. But this time we change the location to all states. And we see 789
results. This is reasonable or not reasonable? Not reasonable? No? Is it
reasonable? I don't know how many states are there in Australia. Okay. It's
reasonable, right? Because, anyway, as long as this number is greater than or
equal to this number we cannot say it's wrong, right? Because this is all states.
But here is only a particular state. So we see nothing wrong here.
Okay. So this is a successful example. But in our testing experiments we also
found this kind of situation. Okay. Shopping is our query. Location, New South
Wales. How many? 3,827 pages.
Now we change, shopping is still the same. But we change it to all states, and
the systems returns 651. So now my question is, is this reasonable? No. Right?
And we enter the query for a long time, I think for at least several weeks, right,
several weeks or several months for this one. Okay. And I also talk about until
recently this arrow disappeared, but it remained there for a long time in Australia
yellow page.
I don't know whether they're using American yellow page software, but they look
similar. Okay. So our original plan in our project proposal is to use this -develop this kind of method to test search services with geographic components
based on what your (inaudible).
But after a few weeks we found some problems. That is when accessing yellow
page from -- accessing from the programming interface in Australia it is too slow
because every time we call a function you have to open a map, and it takes a
long time. And also in the -- let me repeat it because the program it's repeat the
call automatically, and so it's not stable because if the two call is too close we
traveled then we don't -- sometimes we don't receive any return value. So the
system is not stable. So in the end we decided to change instead of what you -we change to Microsoft Live Search because that is much quicker because that's
a conventional text based search engine. But the idea, the fundamental
methodology equally applicable to what you are. But just because of the speed
and the stability issues when accessing from Australia we change to Live Search.
Okay. So when we're talking about quality, what quality is. There are many
qualities for search services, the response or the size of the database or what?
Here we are most concerned about ranking. And also related to ranking is the
counts, the number of pages returned. For example here, this is a Google
screen shot, it's -- we are searching for two key words and the number 59 pages
are returned. All right.
So first let's look at count, and after discussing count we will be looking at the
ranking quality which is more interesting. Count. Why counts? Why are we
interested in counts? Okay. First of all, count is a kind of ranking, so count
quality reflect to certain degree, okay, ranking. And secondly, some users they
use counts to find the probability of frequencies of certain phenomenon. For
example, in natural language process people use Web search engines to find the
frequency of certain combination of letters and phrases. And I read some paper
in SM transactions in SM transactions in different areas they are using this
search engine counts. And also we can also know whether there's search
engines shared because they are exaggerating the size of that database. Okay.
So using our main thought we don't know whether 59 is correct, is the correct
count for these two key words. But all method -- we can identify metamorphic
relation; that is if A is a subset of B, then the size of A should be smaller than
size of B is straightforward.
So we conducted a follow-up search. Now, we get 169 pages. My question is, is
this reasonable? You see, we have identical -- the first two key words are
identical, but we added one more key word. But the number of pages is much
larger than the previous one.
But before we draw a conclusion we may have some discussion. First, is this
caused by dynamic nature of the Web so we repeated the first part but got the
same results. So it's not because of the dynamic of the database. Okay,
second, how about the filters, are there any filters? No filters. (Inaudible) So
there are complete results.
And secondly, we know that all search engines they return approximate results.
Is this still to the approximate nature but actually in our experiments we detected
many even larger differences. For example, this one for cutlery 589. For cutlery
and another were together 2,660. The two searches are consecutive, so they
are within a couple seconds. And then also we repeated all the experiments. All
these reported repeatable. Okay.
And also the counts may change between pages because of estimation. And we
click next, next, next, until the last page, still 58 versus 165. No match change.
And in the end, we collected all the written pages and we found there are really -they are really different. All right. So this is an example phenomenally of search
engines.
Okay. How about Live Search. It's the first time we enter the key word g-l-i-f,
glif. All right. How many? 11,783 pages I found. And then we enter two key
words, glif and 554W. We select any of these terms. So they are all relationship.
But no results are found. And we repeated it several times. So this obviously
shows something wrong in the search engine. So for Google, Live Search,
Yahoo, all the search engines we found similar problems.
So we did an experiment using and/or and exclude relationship. And then we -here is the experiment result. So for and, so it is too series of experiments. First
series we use the random words from an English dictionary, random. And
second series we use random screens combination of English upper case letters,
lower case letters, and the numbers zero to nine. So the length is full. So let's
look at the first series. Enter anomaly. Okay. Only Live Search has 0.5 percent.
But we'll talk about the and later. We did some other experiments. Okay. So
let's look at all. Okay.
Yahoo is the lowest and Live Search is the highest. Exclude Google as the
lowest and Live Search as highest. Okay. And in the second series of
experiments you see the Google perform, Google become worse, but Live
Search here is 13, here is two. So Live Search becomes better. Okay. Different
search engines they have very different performance. Google's performance
become worse, but generally speaking Live Search performance becomes better.
Okay. And then we -- through this observation, we can see that the difference
between the two is because the random screen so maybe the Live Search in the
semantics and the frequency components maybe the need to study more.
Because we have no the fists call we cannot be back further. All right.
We did some -- one more series of experiments of and because and is most
important. We use multiple rather than two key words, and here are the anomaly
rates. Basically they are similar.
But now talk about the ranking quality. Most users are concerned not counts but
the quality of ranking. All right. So here is our method to assess the quality of
ranking. We do two queries. In the first query we encounter all kinds of pages,
right, and then we collect for example TXT files only, we collect the top 20 TXT
files. And the next query we also do the same but we use the command type
limits the file type to TXT only like entering a key word file type TXT. All right. So
we'll do two comparisons. For the ideal search engine the two sides would be
identical not only the contents but also the order of the contents should be
identical. That is a relation we identify.
And also we identify two measures. One is the common line rate. Because we
know that in real life it's not possible for two sides always identical because of
some other practical issues. But the more they have more lines they have in
common then the better the performance. So we have defined the common line
rate. That is the number of pages they share in common. We only collect the
top 20.
All right. And the second measure is offset. That is suppose there are 20 pages
but they have only 10 pages, common pages, and then we measure the -- then
we take the common pages in the original order, but they may be in different
order, right? One is A, B, C, D, the other may be D, C, B, A and then we count
the offset, that is the difference in their order. We have some measurement. So
we have the offset. And also because we also apply some rating because
different rating -- in different positions the rating is different. For the top pages
they have a higher rating, but for the say the page top page and the second page
if their order changed then there is a big impact. But if the 19th page and 20th
page the position changed, the impact is lower. So we have a rating mechanism.
But I won't illustrate the formula here.
And then I show you an example. For example, in Google search for this key
word, this Website, you are ranked third in one of the sites but 13th in the other
sites. And the total common URLs are 18. So we see the difference is quite big.
Three and 13.
Okay. And then we did another series of the experiment to compare the
common line rate. Here is the result. Google average -- the higher the better.
So here Yahoo is the best. Yahoo is highest and followed by Live Search,
followed by Google. All right. So for example for this work too many sites it's
ranked -- there are only five lines in common out of 20. And for Yahoo out of the
20 pages nothing in common. And Live Search is this key word.
But in the best case both sides are identical, but not the order, by the contents.
All right. And here it shows that we use the number of tests increasing how the
average common line rate changed. So basically we see it's quite stable when it
comes to 100. All right. So the top, the higher the better. So here.
And then this is for text, TXT files. We also did the experiment for PDF and
HTML. The reason is because I will tell immediately why we do a TXT, PDF and
an HTM. We see, okay, let's look at the performance. This line is Live Search,
all right. This line is Live Search. So for TXT its relative position is here. But for
PDF it goes down. The higher the better. But for HTML it is higher. Okay.
Let's take a closer look here. Okay. This is a performance of Live Search on the
three types of files. The best performance is on the HTML, and this one is on -middle is PDF. The worst performance is on the TXT file. So it shows that that it
seems our recommendation -- our analysis here shows that it seems that the
Live Search is vulnerable to hyperlink structure because the text file and the PDF
file they have less hyperlink structure or the topology of the Web. But HTML has
a better link structure. So also -- so maybe you need to improve the
semi-structure files, the performance on this kind of files.
Okay. And also we did some experiments after the common lines rate, we also
compiled the offset, and the offset performance is similar to the common line
rate. That is the performance of -- the lower the better. So Yahoo is the best
followed by Google and followed by Live Search for TXT file. And also we did
the experiment with PDF and HTML file. So if we take a look here, we can see
that Live Search, this one is for TXT file is not good, but when it comes to PDF
file it becomes better. The best performance is the HTML.
So again when there is a structure, a hyperlink structure in the file, then it
becomes better. PDF file also has hyperlink structure, so better than TXT but
less than HTML. All right.
Okay. So the recommendation is the same as common line rate. Okay. Our
future work is that we want to start the -- actually we are doing the work currently.
We are starting the impact of these inconsistencies and the quality of ranking
from the users' perspective, how does this impact on the perceived quality of
ranking. And also we are going to study because the inconsistencies appear to
be a consequence of an optimization procedure because you want to increase
the response speeds. But we want to study how to minimize the impact of
optimization procedure from users' perspective and how to balance the
optimization and the perceived quality, and we also want to develop a user focus
optimization procedure; for example, the rating mechanism in it.
And here is a proposal -- a collaboration with, further collaboration with Microsoft.
First we wish to see whether it's possible for us to get some data set for Microsoft
database because crawling the Web is very, very time-consuming from our
(inaudible) so we want to know whether we can share some data. And,
secondly, to maximize the outcome we may -- there is an Australian Research
Council research grants which encourage collaboration between universities and
industry. If the industry can meet one dollar, the government will match more
dollars.
So that's all. Yeah. Thank you.
(Applause)
>>: Time for questions.
>>: Yes. If it would have been feasible to look at maps, what kind of
inconsistencies would you have looked for?
>> Zhi Quan Zhou: Yes, yes, it is very feasible. The reason we didn't look at
map is because of the speed. But we know that we can identify many relations
because here and all this kind of relation are just some examples, but we can
identify many more relations. But we also need to know which relations are the
best. But we're also studying how to identify the good relations.
>>: Other question? So I want to mention that -- oh, you have a question.
Okay. Go ahead.
>>: Yeah. When you look at counts and compare the different search engine
here, how many -- first question is how many samples you are using. I
understand you actually use one key word. When you use two key words, you
actually find the number of counts actually is more. Actually different search
engine has internal trigger, you know, based on internal ranking, of course that's
the, you know, dynamic ranking, static ranking.
So for example in our search engine when you put two key words there, my
confidence is actually higher. That's why I give you more counts. And either you
give me one key word, my confidence is actually lower because many of the
occupants have just one key word and I don't have a confidence of the, you
know, of the result. That's why some time I give you a lower count. So but
different search engines set up different triggering rate, which we don't know for
example how Google is setting up there. So purely based on the counts can be
biased.
>> Zhi Quan Zhou: Yes. This is a thread into the internal validity because there
is a -- actually we have a concern. Whether there is an internal threshold to -after you have more, you get more information then you get more pages. So we
plan to test on this because we don't know, and we didn't talk with people in
Microsoft, and also we didn't know what happens in Google.
So my point is so first we want to know whether they have such an internal
mechanism. And, secondly, if there is such a mechanism, I think from people's -users' perspective they should know, rather than saying nothing is found or they
should know how the pages are found, that should tell the user. And certainly
actually count is minor. The most important focus of our research is on the
ranking part.
>>: Thank you. Thank again our speaker.
(Applause)
Download