22782 >> Sing Bing Kang: Good morning, everybody. It's... introduce Christian Richardt. He's a Ph.D. student at the...

advertisement

22782

>> Sing Bing Kang: Good morning, everybody. It's a pleasure for me to introduce Christian Richardt. He's a Ph.D. student at the University of

Cambridge. His advisor is Neil Dodgson. His research interests are in stereo and nonphotoristic rendering.

So...

>> Christian Richardt: Thank you very much for the introduction. So today I'd like to talk about two pieces of recent work that both relate to coherent depth in stereo vision.

And they're taking slightly different views. The first part is going to be about how to extract coherent stereo maps from disparity maps from stereo videos. And the second part is going to be about how we assume it's perceived coherent depth from stereo images and what causes viewing discomfort in these images.

So the first part is joint work with Douglas Orr, Ian Davies and Tony Criminisi at

Microsoft Research at Cambridge and my advisor, and the second part is joint work with Lech Swirski, Ian Davies and, again, my advisor.

So starting with the first part of the work, our work was motivated by adaptive support weights which is a very popular local stereo matching technique that produces good results but it's also very slow, takes about one minute for the, to create the stereo image. I must speed this up by several orders of magnitude.

And this is to make it practical for real time use and to make it fast enough so that coherent stereo matching from videos could be looked into.

So to briefly describe this, what I've done is for computing the support weight of a particular disparity hypothesis at a pixel, they consider the distance between pixels in space, as well as in color.

And use an exponential fall-off. So the pixels that are further away in terms of distance within the support window, for example, pixels in the corner receive lower weight and also pixels that are different in color like those pixels that are not actually on the orange lamp also receive a lower support rate.

And the aggregate support space is summing over all pixels and normalizing again. Looks very much like a lateral filter. So we reformulated that technique as a bilateral filter using Gaussian weights instead of these exponential weights.

And the result is what we call a dual-cross-bilateral cost aggregation. Because the cost aggregation happens with both stereo images in mind. And color differences in both stereo images are considered. These are the first two

Gaussian weights. And there's the distance weight as well. But it's only considered once instead of as squared as in the case of Yoon & Kweon.

So we found this produces slightly better results in almost all cases in our implementation.

But now that we reformulated this bilateral filter, the bilateral filter also is very slow. And so we looked into different techniques of how to accelerate this aggregation approach. We used the bilateral grid to approximate the aggregation.

And what we achieved is more than 200 times speed-up compared to the full kernel implementation. And this is more than 30 frames per second on all the

Middlebury datasets.

And the way it works roughly outlined is that the original signals embedded in higher dimensional space and sub sampled, this space is then smoothed and then the signal is sliced again out of it so that it's filtered.

So it's an approximation of the bilateral filter that works very well for large standard deviations, which is the case in our approach.

And here are some disparity maps for comparison. So this is the original adaptive support weights by Yoon & Kweon. Our implementation. It ranks about halfway down the Middlebury stereo evaluation website. And the runtime for the

Kweon's image was about 14 seconds on a GPU.

After our reformulation, the rank is slightly improved. The runtime is increased a little more because of the computation, is a bit more expensive. But after the speedup, some of the quality has been lost. This is because now the stereo matching happens in grayscale only and is an approximation of the full kernel bilateral filter. So the rank is reduced a bit but the runtime is improved by 180 times. Now it's very fast to run this. And at the time of publication, at ECV 2010 last year this was the fastest technique in the Middlebury website.

Now, these artifacts are obviously very undesirable. So we investigated ways of reducing these artifacts. We came up with a dichromatic approach. So we embedded -- we used an additional color distance to get some of the discriminatability back into our approach.

The problem is that the bilateral grid has exponential memory requirements in terms of the number of dimensions. So when you want extra, dimension could be accommodated in the memory on the graphics card. So we analyzed a few different ones and interesting we came up with the U is the most discriminative by only a slight margin over the others that we considered.

But what it does restore is, for example, these stakes which are very similar in the grayscale image, it does restore them and it also removes some of the artifacts of the lampshade compared to the bookcase in the background.

>>: Question. I'm not sure this would work, but couldn't you run your cost aggregation, your support window on all three color channels or LED, something like that, and just add them together so you don't take this exponential blowup?

In other words, run three sort of grayscale problems and then just merge them?

>> Christian Richardt: That's very interesting. So it would have to be run three times, and then you could presumably work. That's very interesting, yeah.

>>: You'd have to think about how the costs get added, what it means to do a qualification or an addition or something like that. But basically there's some evidence in each of the three color channels, whichever slice you choose to use or two in this case. But you didn't tell what the time is here but I presume --

>> Christian Richardt: Sorry. I glossed over that. The interesting thing was the dichromatic approach is halfway in terms of performance runtime between the full kernel approach and the grayscale only DCB grid approach.

>>: It means that so 200 times reduction, you get --

>> Christian Richardt: It's only one order magnitude faster per kernel. So halfway in. It's logarithmic space.

>>: It's not bad, whereas if you just ran the problem three times in three color channels, three times slower, three times slower --

>> Christian Richardt: Yeah, I didn't actually consider this. And I did an application that we looked into a spatial depth resolution. And this was prepared by Yang, et al, in 2007 and they used Yoon & Kweon's aggregation approach to create high resolution disparity maps from low resolution disparity maps and high resolution stereo input images.

But that technique essentially uses Yoon & Kweon as a single step. So where

Young, et al, used Yoon & Kweon's technique, can also use our technique as just slotting it in. And the improvement is more than 100 times faster. But, again, you lose just a little bit of quality in all these cases, but this may be a trade-off in some cases.

But making the stereo matching so fast makes it possible to look into it kind of incorporating temporal evidence. Because if you apply stereo correspondence on this thing, on per frame basis only then it's like Flickr, video is noisy and can result in different disparity values over time.

So our approach is inspired by space time stereo approaches where we add another time dimension to our support window. Have a spatial temporal support window where we consider five frames and are Gaussian fall-off over time.

And the result of this is found in these videos. You can put on your glasses. The top video will be showed in [inaudible], so you might be able to see them.

The bottom two videos show disparity maps. The middle one for per frame approach. And the bottom one for our temporal approach, you can see the flickering is reduced in the particular sky region for the sky diving video.

With some limitations they'll persist like disparity in [inaudible] areas, they're difficult to compute accurately using the stereo matching technique.

So this is qualitative results. But for these videos no disparity maps are available -- correction, disparity maps are available. And we couldn't find any videos with [indiscernible] disparity maps available. So we decided to create our own using blender. We created these five videos. Hopefully they'll start playing at some point, yes.

So they show very simple objects, because we are not really good modelers. But they showed different artifacts. Different motions. So, for example, an object rotating or object motion or panning across the street or moving around in this tunnel.

>>: So these disparity maps are incredibly high quality, right; is that because you didn't --

>> Christian Richardt: Actually rendered right from the depth --

>>: Oh, it's not -- this is the graph image.

>> Christian Richardt: Sorry, these are the graph details.

>>: No problem. Right.

>> Christian Richardt: Too many videos. Go one too far. Let's go back. So here you can see the results of our techniques. The ground truth on the bottom left and the DCB creates fairly good result there. Also [indiscernible] slanted surface, even though the aggregation approach only assumes the front parallel plane approach.

And our temporal technique, you can see it doesn't improve much or at all over the per frame approach. That's because these videos are kind of the perfect case with stereo matching. There's no noise in them. It's ideal for stereo matching.

So in the next step we actually added some Gaussian noise to model this. And in this case our temporal technique improves quite a bit above the per frame technique, and also the original Yoon & Kweon approach and kernel reformulation show quite a few artifacts.

So some of the spatial sub sampling that happens in the bilateral grid seems to also produce some better results in the end.

So this is where the books video. And then also for this video.

>>: I mean, the amount of noise is higher than you normally expect but there's also something fishy about how bad Yoon & Kweon is falling. There's always a

[inaudible] inside every stereo algorithm, the adaptive noise. Seems like you didn't quite set that one right.

>> Christian Richardt: We actually run this on different noise levels. And, of course, there are five videos. Most of the videos, our temporal approach improves on the per temporal approach, quite noticeably in terms of the proportion of per pixels.

So just to sum this up again, we rewrote the adaptive support weights as a

[inaudible] aggregation filter. And this enabled it to be accelerated using the bilateral grid so making it more than 200 times faster compared to a full kernel implementation with only a small loss of precision. This is so we could incorporate the temporal evidence into the stereo matching process in real time.

Source code and datasets are available on the project website. So this marks the end of the first part. Are there any questions on this part so far?

>>: Do you have any insights why change of those kernel that will improve, the temporal data that will improve?

>> Christian Richardt: I don't have any insights on that, unfortunately. I believe that using a Gaussian fall-off, so quadratic exponential gives stronger, gives more way to local pixels that are more closer rather than the single exponential fall-off. So that the assumption that similar colors produce similar disparities.

The support windows and tighten up a little bit. But the improvement is only -- it's only very small compared to the adaptive support rate technique.

Okay. So the second part is more recent work that was presented at computations static two weeks ago before SIGGRAPH. And this is work that is trying to predict viewing comfort of stereoscopic imagery. So the problem is that the visual comfort in stereoscopic imagery is very subjective, if not subconscious.

And it's an issue in visual perception. And the assessment of this has traditionally been done by expert viewers or a panel of naive viewers. So this is in the context of a stereoscopic movie production, for example, where every day the produced content would be shown in a cinema on site so they can check the stereo is correct and doesn't introduce any discomfort. And this assessment is very time-consuming and also very costly. Our goal is to provide an objective assessment of this visual comfort. And to do that we rely on computational model of the univisual system and analyze how a range of image manipulations in our case photo filters influence the viewing comfort.

So there are many sources of discomfort. When looking at stereoscopic imagery on displays or on screens. And this part is only concerned with image discrepancies. So differences between the two stereo input images. But there's also a range of other issues that affect visual comfort, and they could be related to the physiology of vision. So the separation of the eyes, for example, but also the particular display device, display technology used.

So everything but the first one is assumed to be constant for the purposes of this work.

So our model consists of three stages. And the first stage starts with the input images, the left and right input images and applies an optical blur to mimic the

optical properties of the human eye. This blur is very high mimics high frequency details.

The second step computes disparity maps using normalized cross-correlation.

So this is computed from left to right and right to left and the consistency of the disparity is used the left to right check. This map where wide pixels indicate consistent areas summarized in a single coherence score that describes the proportion of pixels to be consistent or views and these are all the black pixels here. But let me go into more detail in all the stages. The first stage uses a weighted sum of two Gaussians to mimic the optical blur of the eye. This has been described by Gosslin [phonetic] Davila and Meshert [phonetic] for a particular scenario, both focus on three millimeter pupil. And also based on the assumption that one pixel in the image corresponds to 1.6 arc minutes, minutes of visual arc corresponds to the space between [inaudible] currents.

So this is what the point per function looks like. So this one larger Gaussian and one smaller Gaussian in the middle. The second step uses normalized cross correlation which has been proposed as the model of the human stereo, human stereo vision by Banks, et al. and has been refined by Filipini and Banks in 2009.

I'm just trying to explain normalized cross correlation without showing the formula. I think some of you will know how very well how this works, but I will try to describe this using images.

Starting from the input images we apply the optical blur. The blur has now been applied. Fine detail, fine high frequencies have been removed and the model also works in grayscale. So from now on everything is in grayscale.

So for a particular pixel, particular disparity hypothesis, two windows are considered in the two views. And these windows are first normalized. Zero means normalized.

>>: You're describing the Banks model here.

>> Christian Richardt: Yes.

>>: You're saying it works in grayscale?

>> Christian Richardt: This is my reading.

>>: No, that's fine. I haven't read it. But does that -- you make that assumption.

Does that mean that those of us who actually see the Tsukuba, the things pop out like a pencil pop out or somehow of normal because in grayscale we can see there's no difference between the pencil and the background, right?

>> Christian Richardt: That's a very interesting point. So this is actually work in the visual perception literature, and they only consider grayscale random dot stereo grams so they had no color component. So this is why I use the model as this. But, yes, different --

>>: Just send an e-mail to Marty Banks, ask him: Do you truly believe that this system only works in luminance. And therefore if we have an isoluminent stereo display, I think we know in general a lot of perceptual systems in the isoluminent don't work particularly well. But has he quantified this. Maybe there's follow-up papers or something like that.

>> Christian Richardt: That's interesting.

>>: It jumped out at me. It's not about your work, it's about his, because you had just showed us how the color makes a difference, right, in the performance.

>> Christian Richardt: Yes, but this has also been resulting in limitation of our particular approach, the color differences and the images are not considered if the isoluminant, they will not show up as issues.

So after normalizing these patches, they're multiplied together in a per pixel basis. And what Filipini and Banks improved on, contributed on top of this is to apply kind of, apply a Gaussian fall-off to mimic the cortico-receptive fields.

Again the pixels give way to further weight. This is now averaged together, all the average of all the pixels. This results in the correlations, correlation score.

And this correlation score of about.04 is fairly low but this fall parity hypothesis, and the highest score, in this case it's not .9 is then considered to be -- well, the disparity with the highest correlation score is then taken to be the disparity of this particular pixel. So it's a very simple stereo matching technique in terms of stereo vision literature. But it's very interesting that it correlates very well with human stereo vision.

And the third step of our model checks the consistency of the disparity map.

Identifies inconsistent pixels. Disparity maps from left to right and from right to left should show roughly the same view. So corresponding pixels, disparities of corresponding pixels should sum to 0. And if they don't they're considered to be inconsistent.

So these are the three steps of our model. We start with the input images, complete the two disparity maps, and then compare the consistency and sum it up using a single coherent score. This score is necessarily been zero and

100 percent. And our hypothesis was that this score correlates with the human comfort ratings. So to evaluate this we ran a perceptual study using 20 participants and we showed them 80 stereo images each on our passive polarized stereo display.

>>: Is this independent on the size of the disparity? Shouldn't it be dependent on the size of disparity, because that would [inaudible].

>> Christian Richardt: Yes. For in particular, similarly we showed the disparity in very similar ranges, a fairly comfortable range. Excessive disparities is one of the sources of discomfort that are listed in the beginning. We're not looking at excessive -- assessing the impact of excessive disparities.

>>: But what are you really assessing in this case? It seems like, for example, distribution or something like that, there's a big factor that will, in your model, make something -- because then you can't match.

>> Christian Richardt: Yeah. This is definitely one of the limitations. It's a limitation of our model, because the stereo matching is very simple. It has these limitations. It has these issues and occlusion areas. I'm showing some samples right of this.

So we showed these stereo images on our stereo display using polarized glasses everything in color no unlike like here. And we asked our participants to rate the visual comfort, how comfortable they perceived this particular image that we showed on this five point Lichert scale, very comfortable and very uncomfortable. And 80 images composed of four original images. Three images using your glasses for this. Three images from the Middlebury dataset, books

Mobius and Teddy, and then we also created one random one ourselves of a city model.

And all of these have been edited so that most of the content is behind the screen play, so it's more comfortable to look at. So this is also to kind of counterbalance with some of the other sources of discomfort.

And all the unactive images at this part are shown in grayscale as to reduce further the discomfort that you would see if they were actually in full color.

But all these images that we showed are actually in color. So we showed these four original images. For each image we applied 19 photo filters and also showed the original image to compare relative to how much a particular filter influenced being comfort. And these 19 photo filters applied to the original image. For example, you have the city image.

And this should be, out of all the effects, should be the most comfortable, because it's just rendered straightforward. There's no inconsistencies there.

But, for example, the track haul effect should still look fine. But one issue is that these light and dark strokes actually have kind of slightly random length. So they're not very consistent across the two views. But it should not be too much.

>>: So you're applying the sites independent.

>> Christian Richardt: Independent of the images, exactly. So originally we considered implementing the range of photo suit rendering techniques but then used PhotoShop filters instead. They implemented much larger range of image manipulations like segmentations and thresholding and all kinds of things. And, in fact, we actually looked at a larger set of photo filters, 51 filters first. And then identified these 19 to show them to a larger range of people.

And another example is this glass style. So this displaces pixels in both views.

But it displaces them consistently. So what it actually looks like is that if you're

looking through a glass plane. This is known as a shower door effect. I'll use this term later on.

But in comparison, this ocean ripple effect applies different displacements to the two views. So it's quite difficult to actually fuse the two views. And this is one of the most uncomfortable images that we had in our dataset according to human ratings.

And this also -- well, actually two more examples. So this is photocopy effect.

And it worked surprisingly well in my opinion, because it just extracts high frequencies. It's like an edge detector, essentially, but it's very coherent. It works very well.

And that's the last example. This texturizer effect applies some texture to both views the same. So it's like looking through this slightly translucent layer on to the scene.

And we analyze our 1600 view and comfort ratings first using correlation to find out how it relates to our model ratings. We found that our model is as good as any particular participant. So the correlation between pairs of participants is the same as the correlation between our model and any participant. This is the first conclusion we arrived at. And the other one is that our model overall correlates very, correlates strongly with the mean rating.

So taking the 20, taking the ratings of all 20 participants and merging that together. Because that's more indicative of how the average human feels about stereo content than any other individual feels about the images presented.

But we also are interested in looking at how our model, where it differs from the human perception. So we each scale, our coherence score between 0 and

100 percent into the same range of the human comfort ratings between 1 and 5 on the Lichert scale using these squares fit. And using this fit, our model predicts

70 out of 80 images within one unit of comfort. And in this histogram you can see the distribution of the difference between the predicted and the human comfort ratings.

And most of them are centered around 0 difference which is very good. But there's also some outliers. For example, these five outliers, on the positive end, where our model overestimates visual comfort, they're due to the shower door effect, which I showed earlier, too, some examples, and it seems that our model can tolerate this much better than human observers can. On the other end, where our model under submits visual comfort, these effects rely largely on noise, and the human system is tolerant to small levels of noise. Whereas our model seems to be more brittle in that area.

>>: I'm wondering why you assume that the scale is in there?

>> Christian Richardt: Well, the correlation I didn't actually assume that the scores linearly, they were just the simplest fit I could think of. And the fit was actually -- worked pretty well. But all the statistical data is available on our

project website. So I'll be interested in finding, maybe there's some more to be found there in the data.

Before running this perceptual study, like I mentioned before, we actually looked at a larger set of images using six input images and 51 PhotoShop filters. We looked through all these 306 images with two viewers. And we looked at -- tried to identify the issues that we saw in those stereo images.

And we found that there were only three major issues in our dataset. And they're due to binocular rivalry, the shower door effect, and also randomness in the images.

And I'm going to describe what we mean by these three categories, each with small examples and also showing some example images later. So retina rivalry is due to alternation in perception due to mismatch stimuli. In this particular example you should see vertical lines in one eye and horizontal lines in the other eye. This is similar to kind of fuse, because it's not physically possible. What you should see instead is one eye dominates for a few seconds. See, for example, horizontal lines and then after a few seconds the other eye will dominate and you'll see vertical lines instead.

And this is -- this crops up whenever there's strongly conflicting image regions between the stereo half images. The range of effects that can cause this, for example, segmentation and with logical operators or color quantization, because all of these can influence the boundary of objects and they can be modified differently in both views and not necessarily consistently.

And one particular strong example of this is the palette knife [phonetic] effect.

Where large areas of color are applied in quite a few areas of retina rivalry.

>>: I'm wondering if you've done any experiments where you keep the left image constant. You do something to the right image, for example, you know, oppress the right image or have a different resolution, effect the resolution.

>> Christian Richardt: I think there's been work that has done that. There was some work that considered a wide range of different manipulations, as well, like blur or vertical disparities or compression enveloped into that.

>>: Just to clarify what the model pulls, in those situations.

>> Christian Richardt: It's interesting. I will look into that. I can run that on this data. I can actually try this, only apply it on images that are manipulated in both views.

The second category in our taxonomy stereo issues is the shower door effect.

And this is the term used in nonphotodeterministic rendering literature, describes the effect as if you're looking through some frosted glass.

So here there's some lines behind the screen and these dots on the screen plane. And they actually detract, make it quite difficult to fuse the lines in the

background. And this example is as if a texture is composited into both stereo images, because that results identically into both stereo image. Because that results in a 0 disparity plane that historically is on the screen. And you have to conflict look through that before you can actually see the content of the screen.

The example I showed you before was this glass effect. But in general the shower door effect doesn't have to be in front of the content. It could also be behind. If there's some content that actually comes out of the screen in terms of disparity, then the disparity -- then the shower door effect on top would actually contradict the depth cues, and this would be more uncomfortable than looking through this layer of glass.

The third category I identified was randomness. This is a category that applies whenever the same effect apply twice to an image produces different results.

But they mention before the human vision system is actually quite good at tolerating small levels of noise. But it's just when there's more noise like these wiggly lines that are wiggling differently in the views, they're difficult to view, which makes it uncomfortable.

And the simplest example of this is per pixel noise like this film effect, can actually see it quite well here. It could still be possible to diffuse it quite well.

Because noise, it can [inaudible] by the human virtual system.

Building on this we'll also be wondering how we could detect and localize these issues and the images. So we came up with a set of computational tools. And the rivalry is already largely there using the left/right consistency map check because of the flaps of the consistency pixels. What we do then is we apply a

Gaussian blur to mimic receptive fields. And here you can see these wide areas are the inconsistent areas. And these are the occlusion areas in this example because it's the original input image and there shouldn't be any inconsistencies.

So the score given to this image is something like 84 percent, not 100 percent, as you would expect in a perfect model.

So it's also similar on this particular style. But on the parallel knife effect I showed earlier, the large areas of inconsistencies that reflect up.

And in particular on the sustained glass effect which is very inconsistent it uses different [inaudible] in the two views. It's really difficult to fuse that in rivalries reflect up through the entire mention. This was in fact the most uncomfortable image we had in our dataset.

The second issue we identified was a shower door effect. We build a detector that uses likely disparities which are locally maximal, which have locally maximal correlation instead of locally maximal. This because the shower door effect will still show up as a kind of, as a local maximum in the correlation, but it will not -- in most cases it will not be the strongest correlation score because the background of the actual scene usually results in a larger correlation score.

So we identified all likely disparities in all pixels and then accumulate them in a histogram throughout the entire image. And this is the histogram shown here.

For the image nothing is flecked up, and similarly for this cut-out effect.

But the glass effect that I showed earlier has a very strong peak of zero disparities. So I would detect a notice of something in front of the scene. Many pixels have likely disparity of 0 which indicates that there's something is in front of the scene, and also particularly strongly in the text driver case, where a text was identical with both images.

And for randomness, we compared the color of corresponding pixels. And this is only of corresponding consistent pixels. If the color is different, then this is considered to be, well, if it's above some just noticeable difference. So in this example here, in this case it's scaled so that black is 0, just noticeable difference.

And Y is 2. So any light colors should be, should result in a view to inconsistencies between the two images on a very fine scale.

The cutout effect also looks largely fine. Some of these artifacts are due to the lack of texture in the areas which local stereo techniques are not very good at handling. But for this reticulation style, quite a bit of randomness between the two images because these blobs are placed randomly. A lot will reflect up, particularly for the film grained effect where there's a lot of per pixel noise, inconsistencies are seen throughout the image.

So, in summary, I will describe the first computational model for predicting the visual for stereoscopic imagery. This is ideal for automatic assessment without any need for costly and lengthy perceptual studies. We can see this wherever an artist would edit stereo content. And kind of as a small tool on the side that will analyze the content created there and, for lack of areas, they needed for further attention.

And also introduce a taxonomy of stereo coherence issues and tools to detect unlocalized issues and the project website contains all the images that we showed to our participants as well as the ratings in our statistical analysis.

Before we wrap up, let me use this opportunity to say that I'm looking for a post-doc position or something similar from spring next year. Thank you very much for your attention.

[applause]

>>: Can you comment on the second part? How do you actually see it being used? Is this for PhotoShop respective binaries?

>> Christian Richardt: This is a particular approach we used, just as a proxy for different techniques.

Actually see this used, could be useful for any content, because it doesn't rely on having the original images available. It's not an image quality metric which would compare how the disparity maps are deteriorated by applying modification.

You can actually use it on just content that you have. And you could conceivably use this, for example, for stereoscopic image editing where you could retarget your own disparities and true that it applies to certain areas like used noise and less prevalent, made more consistent.

>>: You could use it for instead of for the next step greedy movements to get the sort of [inaudible] for the people editing some feedback, create scenes

[inaudible].

>> Christian Richardt: Yeah.

>>: There are inconsistencies. You have occlusion. You have disparities, and all sorts of things. So just -- yeah.

>>: Just seems --

>> Christian Richardt: There's lots of things you'd have to consider how the coherence changes over time. It's not just per frame, as in this case. We just do it on single images.

>>: If I was a PhotoShop effect designer, I would actually take the ground truth disparities that come with a bunch of images and actually apply it to that and use that as my correlation, see if this is something consistent rather than doing cross-correlation.

Now, the cross correlation is good for images which you don't have any depth for, random shoot, then you need some sort of baseline. And it just -- you feel, it's part of your editing.

>>: Could apply to just a small window, too.

>>: Yeah.

>>: But then you need, you do the editing, you're going to need your -- just seems unclear how it gets incorporated into something that someone essentially working on a stereo, say, trying to manipulate it in how this may be used.

>> Christian Richardt: Because, for example, it could be used in post production, when you edit both images independently for compositing effects onto something, making sure that the presentation in stereo is comfortable.

>>: Do people do that? Don't they render it in 3-D --

>> Christian Richardt: I think there's a lot of manual work involved as well.

There's visual artists who edit images by hand, paint maps by hand. And obviously if you have some, if you can render it and create it in stereo, it's preferable because it's most likely to be consistent.

>>: I have a question. Typically in most stereo pairs you have semi-visible regions due to occlusions. Right? And would this model predict that those cause the discomfort?

>> Christian Richardt: In the current intonation it does. Yes. It's limitation of the model. It's one of the items of future work that we had is using the thermatic technique with occlusion handling or stereo --

>>: Why do you say it's a limitation; is it possible that people just find it uncomfortable to see those semi-fused regions?

>> Christian Richardt: Yeah. I've been wondering about that. But I haven't grafted a conclusion yet really. I think probably have some mechanisms while coping with this because we see occlusion every day.

>>: I mean, it's hard to say without running an actual test, right? But we see them every day, but if we sort of stare intently at a region which has a semi-occluded region something seems a little flat or strange. So something you could investigate, right?

And, I mean, ultimately people who care about this work are people who are producing, let's say, stereo movies and they want to minimize fatigue and things like that, right, so that would be an interesting question to look at.

>>: Stereo, so how would you extend your work to video? Because that's what we want to [inaudible], and are there other things that have to be taken into consideration?

>> Christian Richardt: I think there's some, there might be some masking that comes into effect as well if things move, if objects move very fast or depth perception also deteriorates in that case.

So it might be very necessary for that. But I don't know of any target ideas how to extend this to video for now. But I think the most promising approach is actually model the stereo coherence, reinterviews, as well as temporal coherent and kind of one frame, rather than having independent components. Because I don't think it's actually orthogonal to each other.

>>: It may be possible, but the score is pretty low on a single frame, play a video.

May look okay.

>> Christian Richardt: Yes.

>> Sing Bing Kang: Any other questions? Well, if not, let's thank the speaker once more.

[applause]

Download