>> Philip Chou: So I'm delighted that Jacob Chakareski has been able to come back to MSR. He almost didn't make it. His flight was canceled yesterday, and he crawled in at 3:00 in the morning today. Jacob was an intern here many more years ago than I care to convey. He got his master's and PhD degrees at Worcester Polytechnic, Rice and Stanford, and besides Microsoft, the was an intern at HP Labs, and he's had full-time positions at Vidyo and EPFL, and now as a professor at the University of Alabama. He's got very wide interests, from wireless communication to social networking, and today he'll talk to us about multicamera networks. Jacob. >> Jacob Chakareski: Thank you, Phil. Thank you for the nice introduction. It's a really big pleasure for me to be here. I would say that my whole career started with my internship here, and I'm very grateful to Phil for that, so I always felt like home coming back here, and I really enjoyed being in an environment where you can be so productive and interact with people from so many different backgrounds. So since Phil introduced me well, I will skip this slide. I have a slide on my interests, which span graph-based signal processing, immersive communication, meaning like visual communication with video signals captured from multiple viewpoints, and I do a lot of stuff related to computer networking related to datacenters and network coding. And I will be very happy to talk to you on all these topics if you are interested. How many people are doing computer networking here? Probably no one, right? So an acknowledgement is due at this point. The work that I will be presenting was supported by this career award that I won in Switzerland when I was there from the NSF, and an acknowledgement is also due to my collaborators, who helped make it happen, and last but certainly not least, as I said, to Phil Chou, because my career really changed dramatically after I did my internship here with him. And he has always been there, providing like really good feedback and advice whenever I need it, so thank you, sincerely. Now, what is immersive communication? Immersive communication is, as I briefly said, the act of communicating visual information captured from different perspectives simultaneously. It has a number of applications in remote control, entertainment and so forth. The two major points of interest that arise there is how do you reconstruct efficiently a viewpoint from where you actually don't have a physical camera installed based on the captured signals? And another one is how to efficiently communicate this information to a client that interacts with the scene? And in addition to a variety of applications, this whole field has a more broad societal impact, I would say, because it could lead to advances in energy conservation -- for instance, people could telecommute rather than work at the office, and that would save, I think according to the Department of Energy, up to 25% of greenhouse gas emissions. It could lead to improvements in quality of life and also to advances in the global economy. And the National Academy of Engineering lists it as one of the 14 grand challenges of the 21st century. So in terms of like a high-level overview of a multicamera system, there are three stages that one could eliminate. One is capturing the content and encoding the content, then transmitting the content, and then finally how the content is reconstructed at the client. And what typically happens in such interactive environments, there is a client who interactively switches from one viewpoint to another. And then the server would either send a whole set of captured data points -- viewpoints -- and then the client basically selects the desired viewpoint from its local buffer, or the server may only send those viewpoints that are necessary to reconstruct a virtual viewpoint that the client is interested in. And that is illustrated here, where a client requests a viewpoint v, which is in between -- we don't have a laser pointer, right? So in essence, the client requests a viewpoint v, which sits -- I always think that I am attached to the computer. So it requests a viewpoint v that is in between the two actual viewpoints that are captured, and then the server actually sends those two referent viewpoints, called right view or left view, and then the client uses it to synthesize these virtual viewpoints, or it could display both of them on a stereo display. Now, the way this virtual view synthesis is done is via depth image-based rendering, and I imagine many people here are familiar with it. The way it happens is that for each captured viewpoint, there is a video signal and a depth signal, and the depth signal basically describes the distances of different objects in the scene from the camera. And then using a procedure called 3D warping, these reference viewpoints, these reference signals, being the video signal, is warped to the virtual camera's viewpoint perspective. And since we have two reference viewpoints that are warped to this virtual viewpoint, then we need to blend them, and there are procedures to do that. So just an illustration, how do these signals look like? The video signals looks like a regular image, and then the depth signal looks like a low-pass filter version of it, where there is a single intensity value assigned to all objects that are at the same distance from the camera. So these signals could be either estimated from the reference signals, video signals, using stereo matching, or they could be also captured by time of flight sensors. So the specific scenario that I will consider as part of this presentation in the rest of the talk, is as follows. We have a set of cameras, capturing a viewpoint, capturing a 3D scene of interest. And there is a client that is interested in interacting with the scene dynamically. Now, there an intermediate compression step that is involved before the content is actually sent to the other end, and these are like the viewpoints that the client can reconstruct. Now, the sender may decide, instead of encoding all captured viewpoints, it may decide to encode only a subset of the viewpoints, if the sender thinks that only a subset of them would be sufficient to provide good reconstruction quality of any viewpoint within V1 and Vn. Oh, I'm so stupid. Sorry. I'm really sorry. So we have a continuous range of viewpoints, starting from V1 to Vn in which v could be. Okay. Now, the specific setup may involve transfer bandwidth constraints, whether the net bandwidth may be unknown at streaming time and also may be limited and time varying. And, also, there may not be feedback from the user to begin with, and our goal would be to maximize reconstruction quality of the content for any possible viewpoint that the viewer may select to reconstruct. So one typical and very good solution that people have approached in situations of similar nature is to do scalable coding. So scalable coding of content, in particular video, is convenient because it automatically adapts to the network conditions. If the network conditions are better, you could send more. If the network conditions are not that good, you could send less. And then the reconstruction quality of the content at the client scales with how much data it receives. So it's optimal also in that sense, meaning you could reconstruct the content in multiple ways depending on how much bandwidth you have between the sender and the receiver. And people have looked at various codecs in the past, like JPEG 2000 and H.264 SVC, where scalability has been introduced. Are people familiar with the concept of scalability? Okay, that's good. So our take on scalability in this case is to provide joint view and rate scalability. So what does this mean? Let's say we have already a set of cameras that we want to encode, and then at compression, at any point where we want to encode more content, we could add an additional chunk of rate called delta-R to the new data that is being encoded. And as an output, we create a binary stream that we send to the client. Now, we could do two things when we encode this next subsequent chunk of data called delta-R. We could either refine the subset of views that are already encoded in the bit stream, meaning we have already encoded some viewpoints, video signals associated with this capture viewpoint. There's a lot of terminology, so please forgive me if I make an error there. And we could add this increment of rate delta-R to improve their quality, meaning if you had more data rate, the video quality when they reconstruct it will improve. Or what we could do instead, we will not refine the already-encoded viewpoints in the bit stream, but we could insert another viewpoint that has not been encoded by then. So we have like two axes of scalability, view or rate refinement. So the framework that he used comprises three components. The first one is a multi-view video coder that we apply to the video and depth signals that are captured. Okay, and then we use depth image-based rendering to synthesize views that the user is interested but they're not captured. And we have an optimization procedure, which would basically decide whether we code a new viewpoint, meaning insert a new viewpoint in the screen, we refine the existing viewpoints that are already encoded or we interpolate, meaning we don't decide to insert a viewpoint at all. So in terms of coding the video, what we use is a shape-adaptive wavelet transform, and it is adaptive because it adapts the wavelet filtering to the object boundaries in the image, so that will allow us to avoid highmagnitude coefficients that could arise if you applied the wavelet transform across object boundaries, across edges, in essence. Now, once we have that, we apply set partitioning in hierarchical trees, which is another algorithm, so that the resulting bit stream exhibits this fine rate granularity. If you have any related questions, you can always stop me along the way. I'll be happy to answer your questions. Now, for view prediction and interpolation, we use a 3D warping algorithm that has been proposed somewhere else, and in order to avoid background pixels, so we're writing foreground pixels. We maintain a depth buffer, meaning when we do the warping, different objects in the scene that are projected to the virtual viewpoint may lie on different depths, so we choose the one that is the closest. And then we do this weighted blending of projected views, according to the distance to the synthesized view location, and I'll explain that in a moment. So here is how the problem looks like, if you start describing informally. We have a set of captured views. And then we denote by capital V-sub note as a subset of viewpoints that we choose to encode. Okay, now since each viewpoint comprises texture, a video signal and depth signal, we will need to assign encoding rates to each one of them, and our end result will be a compressed bit stream phi. So what we need to do is to figure out how good each of these viewpoints will be for the reconstruction of a general virtual viewpoint. We need to figure out what's the reconstruction error for a viewpoint, given a bit stream, and that's this quantity that I will refer to. >>: Quick question. >> Jacob Chakareski: Yes, please. >>: Does this assume that the existing viewpoints are stationary, or can it adapt when they move? >> Jacob Chakareski: It can adapt if they move. It can adapt. But in the first setup here, we assume that they are stationary. I don't see a reason why it cannot be [plied] when they adapt, when they change location. Now, the quantity that we will -- since we don't know what viewpoint the user may select in the end, what we are interested in looking at is the aggregate distortion, meaning the viewer can pick any viewpoint between Vn and V1. What's the total reconstruction error that the viewer may observe? Now, in order to go about it, like to figure out what is this, we have developed in another work a model that basically quantifies the reconstruction error for a virtual viewpoint given an encoded bit stream as a function of the two nearest reference viewpoints, meaning those signals that we captured, and as a function of the location, basically where relatively this viewpoint is compared to the two reference viewpoints, where X basically characterizes how close it is to one or the other reference viewpoints. And I'll show you how the model works. Here, on the x-axis, I show the relative location of the virtual viewpoint, zero meaning it's aligned completely with the left reference, one meaning it's completely aligned with the right one, and on the y-axis, I show the reconstruction of mean square error, and this is the Middlebury data set. So the red graph shows our cubic model, and the blue graph shows the sample values. So we could see that we have a really good match between the actual values and the predicted values by the model. Now, what is nice about having a model, then, we could actually include it into an optimization framework where we could decide how the rate allocation should be performed. So additional things to consider within the setup is we say we have a certain network bandwidth, which is constrained between R-min and R-max, and we can set up a parameter called the rate granularity level, delta-R. So based on that we could figure out how many layers of scalability we could have, and then for every layer that we need to additionally encode within the bit stream, there are a few variables that we will assign. One is what are the encoded capture views by that point that are present in the bit stream. And rate vector basically denotes how much rate we have assigned to each of these captured viewpoints by that point. And then the view vector comprises then what were the viewpoints that are present in the bit stream at every new subsequent layer, so this is a vector and it's a collection, basically, of these quantities. And since our procedure is embedded, these sets also are embedded, meaning naturally in the beginning, if you included viewpoints three and six, these subsequent sets will include viewpoints three and six and maybe something more there. It's exactly what I said there. So then, using our aggregated distortion formula, we could compute what's the aggregate or the total distortion that is observed at the client, given the rate allocation and given the selection of viewpoints to be encoded up to layer L. So typically what we would like to have when quote things in a scalable way -- oh, sorry. I don't know why this is up here. Whoops, I think I slipped one slide further, so I apologize for that. So we would like to minimize across the rate vector and the viewpoint selection vector the aggregate rate summed up to layer L, such that at every subsequent point of adding a new layer, that rate doesn't exceed this constraint, where this constraint says whatever we have added up to that point, it should not be greater than R-min, which is the minimum rate that we started, plus all this L minus 1 delta-R increments, meaning every new layer should not exceed this value of delta-R. So using our old friend, the Lagrange multiplier method, we could transform this problem into a non-constrained optimization where we would use Lagrangian multipliers to include this constraint within the objective function. And what that allows us, we could then reformulate this aggregate distortion as a distortion depending -- as a sum of incremental distortion or reductions in distortions that are introduced whenever a new layer is added. Okay. So with a little bit more math, if we set R-min to be delta-R, and there is no reason why we could not do that, we could reformulate the problem as follows, and then we could group everything into one big Lagrange term, which depends on this delta-F and the rate assignments, where now we have introduced different Lagrange multipliers that are indexed by L and K, and they're dependent on the regional Lagrange multipliers. Would there be any questions at this point? Yes, Phil? >> Philip Chou: Just trying to check whether I'm understanding what's going on. so you're trying to -- so you have these different rates, and you have a set of cameras, and you're trying to determine for each rate that it's L bit rates how many bits to allocate to each camera. >> Jacob Chakareski: And which cameras to select to encode. So you do two things, because remember now we could be scalable and do two, like, let's say orthogonal axes. The first is like what viewpoints do we select to encode and how much data do we allocate to each encoded viewpoint. >> Philip Chou: So I guess, in the script, V0 is the set of cameras to encode. >> Jacob Chakareski: Yes, yes, exactly. >> Philip Chou: And R sub -- okay, that thing. >> Jacob Chakareski: R sub L means like ->> Philip Chou: The vector of rates for each of the cameras -- >> Jacob Chakareski: Exactly, that I encode. Yes, yes. There is a lot of notation. I apologize for that. So that's why I would like you to feel free to ask questions whenever they arise. >>: When you are trying to optimize, given some delta-R, there is an existing state and you are given some additional [indiscernible], and you are trying to see how the delta-R will be distributed. >> Jacob Chakareski: How I distribute the delta-Rs, exactly. Do I distribute the delta Rs to views that are already encoded? Or do I say, okay, fetch another non-encoded viewpoint and introduce it into the bit stream? >>: Or do a combination of both, part of the delta-R goes to a new view and ->> Jacob Chakareski: No, delta-R could only be used for either encoding new view or refining existing views. >>: Depending on how ->> Jacob Chakareski: Yes, yes, and you could break down this into such a small delta-R, so you could basically separate the two decisions. So solving this jointly is a very complex problem, as you can imagine, and what we resort to is a greedy optimization where we say if you fix all our prior decisions up to layer L, we could then compute what would the optimal decision be for encoding the next subsequent layer L plus one, because doing this over, across layers, it's really complex. Yes. >>: What you just said, that's strange, because if your delta is strong, then to encode a new layer, you're not going to get anything if you have just a small thing. So you're never going to start encoding a new layer if your sector is too small. >> Jacob Chakareski: Right, right, right. So it's not like tiny, tiny small delta. There's a certain impact on delta, but in the paper, we actually look at a range of values, like where the delta could make an impact on how the optimization works, and there is a flat region over which it's safe to choose this delta value. And I will show you, actually, that's a good point. So what is interesting also now -- so keep in mind that I could start now from a layer zero, compute the optimal decision, and then go to layer L plus one, compute the optimal decision and so forth, given the previous decisions. So what we could do, even like computation complexity is still manageable, we could go for two layers and compute them jointly, given previous decisions. And we observe that this is better, because we now have video and depth signals, so what the optimization -- if you are constrained to adding just one layer at a time, the optimization may really decide to encode new viewpoints, because what it wants to do is add typically like a video enhancement rate and the depth enhancement rate simultaneously. So if you do these things jointly, like with two subsequent layers together, that problem goes away. And what we do, in effect, as I denoted here in blue, we actually compute the current and the subsequent layer jointly. We could still do that. But computing all of them together, it's impossible. Yes. >>: You can do the opposite. Instead of adding layers, you could remove layers. >> Jacob Chakareski: Yes, that's true. That's true. That's true. You could like encode all the way to the end and then maybe start throwing layers. That would be another greedy approach. Which would be the least distractive layer? Yes, which would be the least distractive either rate increment being refining encoded view or like a viewpoint to be removed. You can also do that. >>: How does your distortion change with the depth precision in this model? >> Jacob Chakareski: How does it change with the distortion? >>: So, for example, the depth is not [indiscernible] if I'm exactly looking at the view that I'm interested in. If that view is coded and if I want the exact same view angle, then I don't need the depth information. So even if we didn't code the depth, the distortion on that would be only the compression distortion. >> Jacob Chakareski: Well, the thing is like, since we are looking at coding both depth and distortion, they both have impacts, like how viewpoints are reconstructed. So what we observe that in some points, at some stages, the optimization actually decides to throw in a new depth signal first before introducing new video signal first, because it realized that adding this new depth value there, it's beneficial. >>: But how does that incorporate into the distortion function? I might have missed that. >> Jacob Chakareski: Sure, no problem. We have this model which basically computes what reconstruction distortion as a function of this distortion of the two reference viewpoints. >>: You also need the distortion for the one viewpoint and the other viewpoint, and you also need the quality of depth information. >> Jacob Chakareski: That is in the distortion. >>: That's not just covering the distortion. >> Jacob Chakareski: Yes, it's both. It's both, it's both. Thank you for that question. Where did I stop? Okay, so the way our algorithm works then, to run this, we initialize this set, V0, and we initialize with the two like furthest most reference viewpoints, and we set a value to delta-R. And then we set all the assigned rates up to that point to be zero, and since we just have two viewpoints included in the bit stream, we just have four values there. And we start with the index L, meaning this will be the first layer, and then we iterate as long as l is less than capital L, what we could do, we choose from one of these two options, refine, meaning pick one of the viewpoints that are already encoded, either image or depth, assign an additional delta-R to it, or we could insert V, where V is one of the viewpoints that have not been encoded yet, and again, we could insert image or depth. And the action is the one that leads to the smallest incremental distortion. And then we increment L and we run this until we are done. So this is just an illustration how the algorithm works, so these bars, the vertical bars, the wide bars basically denote how much rate we allocate to each viewpoint, and then the axis here, R, basically denotes the progression of rates as the rate increases. So let's say we start, and we choose where we need to throw in this delta-R, so it could be in any of this bar, so to speak, so for instance, here, we assign it to V2, and then at the next rate increment, this one goes to V1, and the next rate increment, this goes again to V1 and so forth. This is just a graphical illustration how the optimization works. Now, then, we said how about we go one step forward. How about if we know actually something about the user in terms of its actions and we could even try to speculate what views the user is interested in selecting. Can we combine this with our optimization? So in order to do that, first we said, let's go first quantize this space of viewpoints that a user can select, and we generate a discrete set, V-bar, which is bigger than the set of captured viewpoints, and then we say as a first try, let's see if you can model the user actions as a Markov chain on this set V-bar. So accordingly, we could assign a transition matrix that would basically describe the probabilities of the user selecting the different viewpoints, given where the user is at that point of time in the viewpoint space. And this has like an analogies with what people have observed, for instance, in IPTV channel switching, how users flip between different channels in this kind of a model. So what will such a model allow us? First of all, if you could anticipate the user's actions, we could reduce the application latency, meaning we could send or encode only those viewpoints where we expect the user may be selecting from, and also it could enhance coding efficiency, meaning we don't expect the user to be checking certain viewpoints, we might not include them into the optimization at all and not spend bits there. So the way the formulation then proceeds, we have a horizon over the user actions, over which we say we will speculate what views the user may be selecting, and then given a viewpoint I that the user selected at time TI, we could define a state in the Markov state space, which is basically determined by the set of viewpoints that the user selected up to that point. And then, using our transition matrix, we could describe the probability of that state, how likely it is if the user samples viewpoints across this horizon it tends in a certain state. Starting from initial state, we say at some point maybe we receive feedback from the user and we know where the user has been at that point. So then we could use this approach to compute the aggregate expected distortion that can be associated to the user interacting with the scene, where we multiply the probabilities of the user being in each state times the cumulative distortion that accounts for all the viewpoints that the user experienced along that trajectory. So then, what we would like to do is basically say we would like to minimize this distortion of the whole horizon that we are looking at such that the array that we send to the user at any point of time doesn't exceed the channel capacity. In this case, the capacity of the channel may be the user's downlink bandwidth. The nice thing about this is the objective function is actually separable, so we could find these allocations of rate across all layers that we may encode for every time instance separately. And then another thing is that typically, like these user actions may lead to very sparse matrices P, simply because there's a viewpoint that they select may be concentrated in a narrow range, so that makes computing things easier. So rather than summing over all views, we could only sum over a smaller set of views that we believe the user may select from as time propagates. So another interesting point to note is that if we have, let's say, a frequency at which the user may send feedback to us, saying I have selected this viewpoint, and then after some time we select another feedback saying I am at this point, we could start to such a framework, the impact of this frequency of receiving feedback from the user, meaning that if this H is zero, meaning the horizon which is zero, that means we know the user actions at every TI, and then if age is infinity, we basically go back that first model where the server receives no feedback. So we could see how the optimization scales as the frequent at which we could hear from the user changes. So going now to experiments that I would like to show you, the results from experiments that I would like to show that illustrate the efficiencies of our methods, we will examine the coding efficient when you apply our framework to multi-image data, as well as to multi-view video sequences. And we'll use two reference schemes. The first one basically says I will do uniform allocation across all captured viewpoints, and then I will encode it and see what is the reconstruction quality of the user, meaning each of the viewpoints is treated equally. And then I'll use another method called H.264 SVC. It basically uses the latest scalable video coding extension of the H.264 standard to code across views, independent in every time instance, so that the user can switch at every viewpoint, basically to allow random access to the user. So in terms of multi-image data sets, I will use these sets that are shown here, Rocks and Middlebury that people have typically used in experiments and resolutions are given here, and they comprise five captured viewpoints. And then for the video data sets, we will use data sets that have been captured here in Microsoft many years ago, and they are Breakdancer and Ballet, and they feature eight camera viewpoints, and the frames are captured at 15 frames per second frequency. So further details about the experimental setup, we basically assumed that for the Markov model that we started, there could be three virtual viewpoints between every pair of captured viewpoints, and the way we measured video quality so the luminance mean square error. Now, for the capture viewpoints, remember, the user may also select capture viewpoints, and then we could measure what's the reconstruction error, because we have a reference. Now, how do you go about measuring reconstruction error for an interpolated view? And the way we do that, we basically synthesize that interpolated view using the original, non-coded signals, using the original signals for the two reference views, and then we compare that to reconstructing that viewpoint from encoded versions of those signals. And then in terms view selection models, we use two models. One is a balanced one, meaning like if a user is sitting at a viewpoint at some point of time, probability 0.5 the user may stay in that same viewpoint and probability 0.25 it may switch to the left or to the right of where the user is sitting. And the unbalanced one is basically pushing the user into selecting one of the views to the left, preferentially, with much higher probability. And when we explore the impact of user feedback, we say that for image data, we say that users can send feedback to us every five time slots, and for video, we extend that to 50 frame intervals. So now the graph is like really complex here, so I'll give my best to explain it to you. On the x-axis, I show rate in bits per pixels, and on the Y axis I show PSNR, and this is for the coding efficiency for the data set Rocks, so remember we are talking about image data now. Another thing to keep in mind, we are dealing with a balanced view switching model that I introduced. Now, the different graphs, what they represent is measuring video quality as a function of how much encoding rate we can spend at every time instance, as time involves. Remember the feedback that we could receive from a user is five slots, so we have a time instance from zero to four at which we measure video quality, which is described by different colors. And in this case, we compare the adaptive approach. >>: Could you talk about in video, do you do any motion prediction across planes? >> Jacob Chakareski: I will come to that, I will come to that, I will come to that. >>: These are all images? >> Jacob Chakareski: Yes, these are all images. These are all images. So the two things that we compare, the adaptive approach, meaning we say we know this probability of how a user selects views, and non-adaptive approach, where we say we'll use our optimization, but we say we don't know anything about the user, meaning each of the viewpoints is selected uniformly. And one thing to notice is that knowing these view preferences associated with the user can help a lot in terms of coding efficiencies, so you can see, for instance, that this blacker here denotes video quality at time points T1 and T2 for uniform P, and this one here denotes the same video quality for non-uniform P. And then we could also see differences in terms of whether we measure video quality for capture viewpoints or for synthesis viewpoints, which also makes sense, simply because of the fact that for the captured viewpoints, we don't do any synthesis, so their video quality will respectively be higher, as is illustrated by this red graph here with the boxes, compared to this magenta curve. Yes. >>: Sorry to be so slow. >> Jacob Chakareski: No, no, please. There are so many parameters. >>: The time, for example, so these are two different -- let's say you have T=1 and T=2, what does that mean? >> Jacob Chakareski: Exactly. So let's say time zero, like I see feedback from the user, I selected viewpoint V=4. Okay? And then at time slot T=1, we don't know where the user could be. >>: Because your horizon is five. You're going to know every five. >> Jacob Chakareski: Exactly, exactly, so then at time instance V=1, we basically say, if I sweep the rate axis, how much rate I allocate, and measure video quality of the client? This is what kind of curve you will obtain. And then at time is equal to 2, I do the same thing. Time is equal to 3, time is equal to 4, time is equal to 5. And there are also two other parameters that come into play. I could do that for uniform P, where the user could pick any viewpoint along this range of viewpoints that it could select, or I could select this adaptive. I would know what the user -- I would have a model of what the user may be likely to select. And if I use that in the optimization algorithm, there will be different rate allocations compared to the uniform case, so I'll have like two graphs for that, as well. Does it make sense? >>: So far. But you're showing one rate graph, so when you say the curve for T=1 and T=2, that means it's the same curve for both. >> Jacob Chakareski: Yes, yes, yes. Yes. >>: And V=4 means the user was at V=4 at time zero? >> Jacob Chakareski: So what V=4 means, like, rather than showing you like an aggregate of all views, what this shows is we pick one viewpoint, and then we measure video quality that the user may observe at that viewpoint, for that particular time index. So that's why I said there are so many parameters. I apologize for that. So when I say V is 4.25, that means I'm dealing with a virtual viewpoint, because we index the capture viewpoints with integers, and we index the virtual viewpoints with non-integers, meaning between 4 and 5, I could have 4.25, 4.5 and 4.75. >>: It's one quarter of the way between? >> Jacob Chakareski: Excuse me? >>: It's one quarter of the way between 4 and 5. >> Jacob Chakareski: Yes, exactly. So between 4 and 5, I have three virtual viewpoints, and then I could measure their video quality at any point in time. There are a number of parameters that come into play, and maybe this was not the most ideal way of representing them. In the paper, you can see all this information and read it at the same time. Yes? >>: [Indiscernible] four, for example, then you're doing Monte Carlo on what was on the region of the last information that the viewer sent? >> Jacob Chakareski: I could measure it anywhere. These are just a few viewpoints that I included here. I could measure it anywhere. I could measure it at viewpoint 0.25, 1, 2 and so forth. >>: Could it [indiscernible] squared? I'm measuring that T2, view 4, adaptive scheme. >> Jacob Chakareski: Yes. >>: The last time you received information about the ->> Jacob Chakareski: It was at T0, yes. >>: Was at T0. What did you receive at T0? >> Jacob Chakareski: I received the information saying at this T0 I selected -- I have to look at the paper. Maybe I selected viewpoint 4 or 3.5, I'm not sure. I have to look at it up. So this basically shows, at viewpoint 4 what would be the video quality if the user selected at viewpoint 4 at time T -- T=2, I am sorry. Balance sheet, I'm giving you what view ->>: [Indiscernible]. >> Jacob Chakareski: Yes, sure, sure. Yes, that is true. That is true. That is true. But what I'm showing here for a given viewpoint that the user selected at zero, how video quality would involve, like if I use the uniform or the adaptive model. Yes, you know what the user selected, it makes certainly a difference. But there are so many parameters here, it's really hard to combine everything. And I don't know how meaningful it will be to -- just let me finish the point, to maybe show an aggregate distortion of all the views, because we talked about this with [indiscernible], basically, some old views, and they maybe show there. Maybe you could have done that. I think maybe I have a graph later on that shows it, when I try to talk about motion compensation, but there are a lot of parameters into play, but I hope the setup is understandable by now. The user selects some ticket T0, and then its actions evolve over time, and every time point, like every sub-segment time, I say from the range of views that the user may view, what typical video qualities the user may observe is a function of how data rate we can assign to the bit stream. Maybe, I don't know, like having a 3D or 4D graph maybe would help. >>: You might have alluded to that in the last statement. So the graph, the x-axis then is basically if you were to run the exponent over and over again with different rates of publication, with different delta at each time, that would be the point. So if you got to do a delta-R with only 0.2 at each instant, that slice would be what I would observe. >> Jacob Chakareski: No, so to correct -- and I apologize, I'm not trying to be rude. Did you finish your statement? >>: Yes, I'm trying to understand. >> Jacob Chakareski: This is the total rate. That's why we're jumping ahead. This is the total rate. >>: That's the total rate, so what does it mean to say -- what I'm trying to say, are time and strength T=0 for V4, I wouldn't expect a graph. I would expect a single point. What does the graph mean. So at time T=0, V=4, depending on your settings, there will be a fixed rate associated at that time. >> Jacob Chakareski: Yes, yes. >>: In that case, then what is the x-axis representing? >> Jacob Chakareski: The x-axis represents what is that rate. The rate could take different values, the total rates. >>: So you're running this multiple times for different rate values? >> Jacob Chakareski: This is the total length of the bit stream. The bit stream may be like 0.6 pixels, maybe it would be like 1.2, 1.4 and so forth. >>: But that allocation depends on how you allocated the rate for a single occurrence. So the total rate could be 1.5, but at some point, you may not have allocated the entire thing. So at T=0, maybe you will allocate it for only 0.5. >> Jacob Chakareski: Can you clarify that? >>: This is image, right? The way I understand it is, in the beginning, you allocated all the bits needed for the picture. >> Jacob Chakareski: In the beginning, I haven't allocated anything. And then I say, as you know, increment is L, delta-L, two times delta-L, three times delta-L and so forth. I may choose to refine or insert new viewpoints, and in the end, I may end up with a bit stream that is long, I don't know, like let's say 10 megabytes. So then I compute what is the equivalent rate in bits per pixel, and then I show that on the x-axis. This is not delta-R. This is the total rate. >>: Yes, it's probably not delta-R, but at T=0, do you expect the curve to be -- I'm still trying to see. >> Jacob Chakareski: So at T=0, the video quality may be different. I may allocate 5 megabits per second, I may allocate 10 megabits per second, I may allocate 20 megabits per second to the total bit stream, right? Remember, T and delta-R are orthogonal things. It's not that I start at time T and I assign delta-R and then I go to two times T and I assign two times delta-R. They're two independent things. T describes the evolution of the user over time. He's sitting in a Markov chain, and he says I will go to select this viewpoint and this probability. I will go to select this viewpoint and this probability and so forth. Now, delta-R is orthogonal, meaning you know that your bit stream decreases by delta-R up to this value, irrespective of time. >>: I think that makes sense from a video point of view, but from an imagine point of view, once you have transferred some data, then that data is already there. It's not changing, and all you can do is to continue adding more data as time progresses, basically, and since this is an image example, that's where I'm trying to -- let's say at T=0, if you had transmitted everything. Let's say you had bits, where at T=0 you had transferred everything. Then, at later stages of evolution, you don't have anything more to send, and the graph in that case would be a flat line. Is that the correct understanding? >> Jacob Chakareski: Yes. We have spent a lot of talk about that, so maybe we could discuss it offline. It may be that the content changes itself, and also another thing that we considered is that, even though it's images, like it was hard to incorporate that feature in how the compressed bit stream is created, to have that scalability as well. Given that you know what is at the user, can you encode something further so that still you maintain the scalability? We can talk about that offline. Yes, we had that consideration in mind. Yes, sir. >>: Could you talk about the bump there on the black T1, T2 line -- the black line with pluses. >>: You have 1.2 bits per pixel. There is that bump on the black line. >>: It just jumps. >> Jacob Chakareski: Here. That's a good point. >>: Why? >>: Presumably the RT curves are roughly convex, and that's when it jumps out from ->> Jacob Chakareski: That's true, that's true, that's true. >>: It's one of the views, and then that waits to transmit enough ->> Jacob Chakareski: I suspect that's what happened. >>: You don't use it ->>: Until a point where it makes a difference, and then it goes up. >> Jacob Chakareski: Yes, yes, yes, yes. >>: The rates seem low. The quality seems low for the rate, so that's like half a bit per pixel you get at T=28. >> Jacob Chakareski: I have to see what the equivalent number is in total that you are spending for the data. Maybe it's in the paper. We can look it up. Any other questions? Thank you for all the interaction. I really appreciate it. So I have a similar set of results for the unbalanced view switching model, where now the user has a certain preference for switching let's say to the left compared to staying in the same view or going to the right. But in general we observed -sorry. We observed the same let's say differences between knowing the user action, knowing this adaptive transition probability model and the uniform P and, again, the same difference between video quality observed when the user selects a capture viewpoint compared to when the separate selects an interpolative viewpoint. Now, switching to video data, here are some results for Breakdancer, and again, we are going back to the balanced view switching model, so the three systems that we compare is ours, who I knew the probability distribution of view preferences selected by the user, a uniform P model and H.264 SVC. So what I show here on the axis is total A, but now it's not bits. It's pixels, but rather like a total bitrate is megabits per second. On the y-axis, I show the average distortion. And again, if we have like a lot of discussions with my coauthors on what to show on such a graph. You could think of multiple perspectives, so you could see all the possible trajectories that a user might select and find the average, and what we have shown here is I think the average distortion that a user experiences along the typical trajectory. Meaning we looked at how the user selects the different viewpoints along time, and then we computed what is the average video quality or the average distortion that the user will experience along that trajectory. And that is a function, of course, of how much data rate we can send. So that's what we show on the y and the x-axis. And we can see that not knowing anything about the user leads to the worse performance. Now, H.264 is in between. And we could see that our approach outperforms it, too. >>: Not knowing anything about the user, it's the same optimization without having any idea. >> Jacob Chakareski: Yes, exactly, what the user does. Exactly. >>: And how does the user select? >> Jacob Chakareski: The user selects this viewpoint according to this model. The user is a selected viewpoint, and it says with probability 0.5, at time is T plus one, I may select the same viewpoint, and with 0.25, I may switch left or right. >.: With the circles, you assume that you know that model, you assume that he is at least following that model. >> Jacob Chakareski: Yes, yes. >>: What if the user does not follow that model? >> Jacob Chakareski: Probably I don't have that here included in the slides, but we did also some analysis of that. That's a good point. Yes. Yes, John. >>: So the H.264 effectively also sort out follows the uniform distribution of the user? There's a gap between H.264 and the other curve is because of your effectiveness of scalable coding? >> Jacob Chakareski: Well, the thing is like, here is what H.264 does. It takes all the captured viewpoints and it encodes them predictively, as it will encode video, but across space, not time. And that's what it does. So it encodes all of them. Now, in our approach, we may choose not to encode all of the views, and then we may choose not to assign the same data rates to each of the encoded views. >>: So if P is uniform, why would you not encode all of the views. >> Jacob Chakareski: Well, because you could still use the capture viewpoints to interpolate other viewpoints, like you may still get some benefits, throwing away captured viewpoint and then just restructuring it from a non-captured viewpoint. >>: So why the gap between the H.264 and the uniform P? >> Jacob Chakareski: Well, in this case, recall that here the uniform allocation is basically allocating equal data to each of the views, and that's what it encodes. >>: There's no prediction of ->> Jacob Chakareski: No, no, no. So it treats all of the views equally here. Whereas here, we apply all of our optimization and take advantage of what the user may select. Yes? >>: Why is H.264 measured from 4 to 15 bitrate, and everyone else was measured ->> Jacob Chakareski: Yes, that's a good point. Well, the thing is, like the way the codec is configured, we explain that in the paper, if you use six MGS layers, like so they have predefined configurations, how many layers you can encode and how many temporal and rate layers you could have, that's what the codec provided. It's not as fine rate as ours. >>: So again, even in the blue curve, you're not using any prediction between views in the coding? >> Jacob Chakareski: We do? >>: Prediction spatially between views in the compression? >> Jacob Chakareski: We do, yes. >>: And in the red curve also there is prediction across views spatially? >> Jacob Chakareski: Here, what we do, we encode all views, but using uniform rate allocation. >>: Uniform rate allocation, but does it use prediction across views spatially? Does one view predict from the other? >> Jacob Chakareski: Yes, yes, that's a good point. I think we do. I think we do. I think we do. >>: And still we use uniform allocation for the views, though there is predictive coding between views. >> Jacob Chakareski: Actually, let me take that back, because I think we did a number of let's say uniform systems, so I think this is the most referenced one you can think of, where each of the views is encoded independently, assigning the same data rate to each viewpoint. And they're encoded H.264, like with no spatial prediction. I can look it up in the TAP paper. >>: That makes sense, because with no spatial prediction, I would expect the two graphs to go ->> Jacob Chakareski: Yes, I think this is the simplest one, uniform reference, that we could think of. And I have another graph where it shows where all these gains come from. Are there any other questions? So here is another graph that basically illustrates the temporal evolution of video quality, so I have shown you a graph where we saw video quality versus rate, and now I show you a graph which shows video quality versus time. So again, there are like three colored graphs which basically show the performance of the three reference schemes, ours, a uniform allocation and H.264 SVC for three encoded data rates. One is 14, the other one is 9.4, and the other one is 4.7, and he is like the -- I think I don't. Okay, so it's nice. So here we have also labels of one trajectory that the user selected over time, so these numbers here, 4.25, say at time zero the user selected a virtual viewpoint index 4.25. Then it went back to 4, then it went to 3.75, 4, 3.75, 3.75 and so forth, and it does this with a temporal frequency that we described. And you can see that, for instance, video quality of course improves when you add more data rate. But then, depending where the user is, like whether it selects a captured of interpolated viewpoint, that quality may vary, which is expected, because typically the video quality of interpolated view is much lower. And at the bottom, we also show the corresponding curves for SVC, so SVC is actually this dotted line, so we could see how we also output form as we see over time. And here below is that uniform approach that I described. Did you have a question, or someone else was raising his hand? Oh, it was you. Okay, please. >>: Something that doesn't sound very intuitive is that the rates, when you switch from 4 to 3.75, there's a big loss of PSNR. In 3.75 to 3.5, now the additional loss is much less. And my intuition says it should be the opposite, because 3.75 is somewhat close to 4. See, when you jump from 4 to 3.75, boom, you go down. >> Jacob Chakareski: Wait, there is no 3.75. There is 4.75. >>: No, no. The position, the user view selection. See, when you are at position 3.75 ->> Jacob Chakareski: Okay, here? >>: There. So you had a big dip in PSNR. You back to 4, the PSNR goes up again in many of the curves. >> Jacob Chakareski: Yes, yes. >>: And then when you go to 3.75, it down again. That's expected. >> Jacob Chakareski: Yes. >>: But then from 3.75, when the user two steps later goes to 3.5, there is only a small drop. And intuitively that sounds strange, because there is a big drop, because I'm interpolating. >> Jacob Chakareski: Right. This one is also interpolated. 3.5 is interpolated. >>: Right, but what I'm thinking is the asymmetry, how much more loss in PSNR in moving from 4 to 3.75 than it is from 3.75 to 3.5. >>: Well, four is the captured curves. >> Jacob Chakareski: 4 is a captured view, 4 is a captured view. And another thing to keep in mind. >>: Which means that on this one in particular, the interpolation hurts quite a bit. As soon as you switch off even a little bit from the capture view. >> Jacob Chakareski: Yes, yes, the quality drops. And it depends on the distance -- 3.75 is closer to a reference view, so you saw how the curve looks like, error versus distance. In the middle, it's typically the highest. >>: Never mind, you're right. It should be -- sorry. >> Jacob Chakareski: No problem, no problem. You also have to keep in mind that since we all have this frequency of the feedback receiving, sometimes it may even happen that the interpret view may improve in quality if it's just after we know what the user selected. Because then we have a narrower range of views over which we speculate, and then we can improve that same interpolative view quality, just because we are closer in time. You're welcome. So the last thing about performance of the framework is basically, as I said, how do all of these pieces contribute in terms of improving performance? So on the x-axis, I show you rate -- again, total rate at which we encode, and on the y-axis, I show video quality. The content is Breakdancer in the balanced view model. So the simplest, most referenced is uniform allocation, and we call it simple allocation now. That's here. Now, if we use our optimal allocation, but we say the user model is uniform, then we get bumped up up to here, and then here's H.264 SVC in between, and then we have the full optimal model we take advantage also of the user model, as well as we run the optimization. >>: I thought H.264 does not have a user model. >> Jacob Chakareski: No, no. It doesn't have. I mean, in the encoding process. >>: So your optimal allocation without the user model loses to H.264 without a user model? >> Jacob Chakareski: Yes. >>: Why? >> Jacob Chakareski: You know like in H.264, what we do here, like we encode the views, and in the optimal allocation, we don't take advantage of the user model. We don't take advantage of the user model. Everything is encoded uniformly. Yes. The reason why we believe -- I think it's maybe even explained in the paper. Here, the user model is such that the encoding is still not very efficient, like when the user switches according to this model. So knowing the model, introducing into it improves a lot. And H.264 uses all the tools that you can think of, like motion prediction and so forth, and we are using a compression algorithm that is ->>: The underlying compression is simpler than the ->> Jacob Chakareski: It is simpler. We don't use any of the HVC and SVC models. We are much better in terms of how fine you can be in terms of rate granularity, but we don't use any of the tools that they have. And I think this last slide will answer what John was asking. Then we said, okay, can we actually try to introduce temporal prediction while maintaining this finegrained scalability? And it turns out, depending on what user model you use, that may be beneficial or not. And remember, since we are using the wavelet codec, wavelet codec is actually not very good in encoding over time, because these images that you obtain when you try to find the differences between the two viewpoints, they look very noisy. So you don't get a lot of benefits when you try to include also time across here, unless you try to also estimate motion prediction and so forth. But then the codec becomes very complicated and in the bit stream it's hard to manage. So anyhow, to make the long story short, we looked at two cases. One, we said if you have the balanced view switching model, and the user doesn't switch a lot, and we try to do simple temporal prediction, meaning we take the next video frame and then we subtract it from the previous video frame, and we encode the differences. On top of what we do, like so far, we could see that with temporal prediction, we do get some gain, if we include that into or model. Because recall our model encodes things in an embedded way, but it doesn't exploit temporal relation between video frames of the same viewpoint. So if we add that, without doing motion search rendering, just like simple temporal prediction, we do get some gain maybe around 1 dB, a little bit less. Okay, but since everything is done with a wavelet codec, the wavelet codec has been shown that they are not very good in dealing with motion compensation. Now, once we switch to the unbalanced view switching model, meaning that the user switches frequently, we actually get hurt by if you try to introduce a temporal prediction. And the reason being is like when you try to do temporal prediction and then we encode these image differences, they require so much rate in order to be encoded, you actually become less efficient. And this is illustrated here on the graph, where the black curve shows how much PSNR versus rate we could achieve if I do temporal prediction compared to the case when I don't do temporal prediction. So this is still a topic to investigate and maybe try to incorporate more complicated, more motion compensation models such as those used in typical video encoders. But then, this whole concept of fine-grained scalability may be at a danger, so to speak, because it may not be possible to maintain it. So, in conclusion, how much more time do we have? Okay, perfect. I'll wrap up. I wanted to have a few more slides, but then it's not an excuse, my flight got canceled and I had some other things. I wanted to tell you how maybe this framework could be used to tackle some other problems, so first, to summarize, I have shown you a rate and view scalable multi-view coding framework, which comprised a scalable rate coder and an optimization algorithm that, on top of it, also took advantage of a view switching model that may exist about the user view selection actions. We have observed that at least to enhance efficiency and interactivity. Now, this is why I wanted to have an illustration. One could imagine that we could use this model to basically figure out how many cameras do we need to cover certain scenes sufficiently well enough. So there is something called Cisco Stadium Solutions, and what they provide is a framework where you are able to place a number of cameras, and even Comcast has this, like for sports matches, and then they deliver this to the client, where the client could only switch between the captured viewpoints. So you could think that we could use our own framework there, where we could basically figure out how many cameras Cisco needs to place in order to deliver video quality that would not go below certain criteria. And there are other applications where one could imagine using this analysis framework to figure out how many cameras we could place in a remote monitoring camera sensor networks and so on and so forth. So some of the things that we are working on is how to extend this to multicast, so let's say we have a population of clients that are interacting with the scene, and then the user, like I said, in the stadium solution, you are at a stadium and then you want to flip between views on your tablet or on your, let's say, mobile phone. And then if let's say Cisco, using the stadium solution, wants to broadcast a stream to everyone, such that everyone is able to switch between different viewpoints, how do you do this? And how do you interact, incorporate the fact that some users may have correlated actions, like they're members of the same social network? Another thing we investigate is how to add on top of it channel coding methods like FEC and network coding, if let's say this delivered over wireless environments. And there are other things that one could consider here is like doing distributed coding methods and so forth. And another issue that we tackled in the presentation is how one could maybe design efficient temporal prediction methods. Now, if you guys don't -- are there any other related questions? I wanted to show you something completely non-related. Is it okay? >>: [Indiscernible]. >> Jacob Chakareski: Yes, yes, yes. Very, very high level. It's just to tackle your brains a little bit. So this is like I'm also biased, because what I will show you, it involves me, but still, it's also a bit of a philosophical question for me, how computing has evolved. So in the past, I was not born then, but people said there was the mainframe and then people would go there and do their jobs and then finish their computing and then they'll go back home. But then, there was the era of the personal computer, where people could take their computers home and do their computing there. Now, there was another development that evolved sort of independently. That was the network. The network appeared in between, and what the network allows us is basically to use the computer not as a device for computing only, but also for communication. And that allows us in my animation is correct, to go back to the mainframe, because now we could do things remotely and basically compute everything in the cloud. So what we would like from the cloud is basically to run everything on the go and to do these things completely in the browser. And a number of possibilities one could imagine, like for people who do graphics, and they are somewhere let's say in Singapore in the airport, and they want to edit an image, and their laptop doesn't have these GPU capabilities and so forth, they could connect remotely and do this. And also, what this allows us is we could basically charge the users for using a software application, per use. The user doesn't need to install the application locally anymore and pay for it for life. He maybe wants to use Microsoft Word for two hours and pay for that. And on top of that, what we would like, or what is desirable, if you have your own app, can you embed it in a browser? So when I go to the web page, I click on it, and I want to run a recourse application. So with that, I would like to -- okay, so I think that I would like to should you a quick demo that describes that maybe this is all possible. >>: [Indiscernible] Photoshop, because Photoshop has changed the model, where you can't buy Photoshop in the client anymore. >> Jacob Chakareski: Yes, exactly. They have this Photoshop Adobe Cloud solution. >>: Hi. My name is Nikola, and I'll give you a quick preview of Mainframe2. What makes Mainframe2 special is that everything just works. >> Jacob Chakareski: Yes. I hope the audio is -- you cannot see it. Why the video doesn't play? >>: There is nothing to install. You just click. >> Jacob Chakareski: Let me try. When the guy was here, we could actually do it -- I mean, the guy from system support. I cannot see it also, which is really bizarre. I think we changed the resolution, and then it disappeared. Let me try to go back to the original resolution. When the person that was helping with the computer was here, I could see it, because I played it and we tested the audio. There it is. >>: Why don't I suggest that we -- people who are interested in ->> Jacob Chakareski: Let me try. >>: An application who want to use [Emcasa]. >> Jacob Chakareski: I'm sorry for this. >>: My name is Nikola, and I'll give you a quick preview of Mainframe2, which is a new type of cloud that lets you run any software in the browser. What makes Mainframe2 special is that everything just works. There's nothing to install and download. You just click on an application you want to use, and it comes up. All applications behave exactly as you would expect, because they're exactly the same, which also means that there are no code changes required if you want to put your application on Mainframe2. Everything is really snappy. Typing feels natural, mouse clicks are instant. While we're in private beta, you'll get a real-time response from the Western US. Everything works fine, even if you're on the other side of the world. It's just a tiny bit slower. Very soon, we'll be launching servers in many more places. It's easy to bring your own files from a cloud storage, like Box. We use it as a cloud hard drive, which means that your files are never synced to Mainframe2. They remain secure, so you don't have to worry about someone else accessing them. And since your files are already in the cloud, you don't have to upload them or manually sync to your devices. Your cloud storage already does everything for you. From a Mainframe2 application, you open a file directly from Box and then do your work. When done, you just save it back to the cloud. Everything is automatic. In fact, software often runs much better in the cloud than on a local computer. That's because our servers have faster processors, better graphics, more memory and faster disks than most computers in the world. So hopefully this gives you an idea of how much easier it is to run applications in the cloud and connect from any browser with no plug-ins required. We're still in private beta, but you can sign up and get early access at www.mainframe2.com. Thanks. >> Jacob Chakareski: I'm sorry for a little bit of PR, but since I'm involved. >>: You probably know, but we're in the business, as well. >> Jacob Chakareski: One thing to remark about Rico's question. So Adobe has this model, but they still require you to install a small client on your end when you run this. In ours, it's complete HTML5. Exactly, and we have signed up a number of customers, but we still have a few licenses for research purposes. So if people who do graphics are interested, I could set you up with an account on the website. And then you go there, you can run your things, install apps and see how your favorite graphics application works in the cloud. Thank you.