>> Kori Quinn: So we're really excited today to have Sasa Junuzovic -- it's taken me a long time to practice that name -- here to give us a talk today as part of his interview loop. Sasa just recently defended his Ph.D. at the University of North Carolina Chapel Hill. His advisor is Prasun Dewan. Sasa is no stranger to Microsoft Research. He is an MSR fellow. He has done four internships here, as well as two on product side. So he has lots of insights about things around here. I'm really excited to have him. >> Sasa Junuzovic: Okay. Thank you for the introduction, Kori. Welcome, everybody, to my talk. It's about collaborative applications. So let me start by defining what I mean by that. A collaborative application is an application shared among users in such a way that the input of one user affects the screens of all other users. Consider a shared checkers game. If one user enters a command to use a piece, then the piece moves not only on that user's machine but also on all of the other users' computers. The performance of these applications is extremely important. To illustrate, consider the Candidates' Day at UNC. Each year the computer science department at UNC invites a number of potential graduate students to view demos of the research done in the school. Unfortunately, some of these students are also invited to the Candidates' Day at Duke, which happens on the same day. So to give all of the invited students a taste of research at UNC, UNC shares the demo applications using an application-sharing system. So as long as the students at Duke have some sort of Internet-connected portable device, they can participate remotely. If the shared application doesn't respond to the actions for student -- by students at Duke and quickly or notify them of actions of others in a timely fashion, the student may get bored and quit the session. As a result, the student may not end up coming to UNC. What's much worse, the student may end up going to Duke. Now, whether we've used collaborative systems or not, we've all experienced tough performance issues in real life. And consider the commute from Seattle to Bellevue, which many people do by taking Interstate 90. So sometimes during the day, like rush hour after work, there are too many cars on the road and a traffic jam results. Nobody's happy and there's nothing we can do about it. Other times, such as early in the morning, there's hardly anybody on the road. Everybody's happy, so there's nothing we need to do about it. The rest of the time there's a sufficient number of cars on the road that the traffic is not moving as well as it could be, so some people are unhappy. But there are some things we can do. We can put in HOV or express lanes and at least improve the commute for some of them. From a computer science perspective, we would look at this performance issue in terms of available resources. So when there are too many cars on the road and the traffic jam results, we would say that the resources are insufficient and performance is always poor. When there's hardly anybody on the road, we would say that resources are abundant and performance is always good. These two cases bracket the case when resources are sufficient but scarce and it is possible to improve performance from poor to good. It's called the window of opportunity, and my work focuses on the window of opportunity on collaborative systems, which brings me to my thesis. For certain classes of applications, it is possible to meet performance requirements better than existing systems through a new collaborative framework without requiring hardware, network, or user-interface changes. So before I tell you how I built the system and what kind of classes of applications that actually helps, let me give you a flavor of what it can do. So this is an actual experimental result of a collaborative session with and without my system. The blue line shows the performance without and the red line shows the performance with my system. And as you can see, the vertical X axis says that the lower these lines the better the performance. If we look at this initial period in purple, there's no difference in performance with or without the system I've created. But then shortly after this initial period in this orange area, the self-optimizing system that I've built improves performance, and then sometime after that it improves it again. And in fact the performance goes flat, almost to zero parallel to the X axis. And one question is what aspects of performance are being improved. Two important aspects of performance are local and remote response times. There are other important aspects, such as jitter throughput, fast completion time. Bandwidth. Like I say, all of them are important, but my focus is on the response times. The local response time is defined as the amount of time that elapses from the moment a user enters a command to the moment that user sees the output for the command. The remote response times, on the other hand, are defined as the amount of time that elapses from the moment some user enters a command to the moment a different user sees the output for the command. So what this graph here is showing really are the response time differences with and without my system. And, as we can see here, the system improved response times like 21 milliseconds, by 80 milliseconds here, about 180 milliseconds here, and by 300 milliseconds here. So the next question is when are performance differences noticeable. Well, studies of human perception have showed us that people can not distinguish between local or remote response times below 50 milliseconds, but they can notice 50-millisecond increments and decrements in either local or remote response times. So according to that data, the system is noticeably improving performance. Now, the system I've created doesn't actually look at just the absolute response time differences when it decides what to do. The reason is that these differences may not tell the whole story. To illustrate, consider two-user scenario, which we have two optimizations, A and B. And optimization A is going to give 100-millisecond response times to user one and 200-milliseconds response times to user two, and the other optimization does the opposite. If we do a simple response time comparison, say the average response times, we would say that the two optimizations are equally good. But the response times of some users may be more important than those for others. So suppose user two is more important than user one. In this case we see that optimization A is not -- it doesn't give as good response times as optimization B, therefore optimization B is actually better. So to actually decide which optimization is better, we needed external criteria which are user dependent. Now, one use for criteria, as I mentioned, is to favor important users. And the data needed to enforce that criteria is the identity of the users. Another potential use for criteria is to favor local or remote response times, and the data we need for that is the identity of the users who input and those who observe. In general, the criteria can be arbitrary and require arbitrary data. So there's no way we can encapsulate all of them into any particular system. So what my system does is it basically relies on the users to provide a response time function that encapsulates the user's criteria or requirements. And then the system is going to call that function with predictive response times for various optimizations and pass in some additional required data. In its current form it supports arbitrary response time functions provided by the users, but it only provides identity of the users and identity of the those users who input and observe. So now the question is how does the system meet response time requirements better than existing systems. Well, I studied three important response time factors: collaboration architecture, multicast, and scheduling policy. Of these three factors, only the impact and response times of the collaboration architecture has been studied before. In particular, Chung has shown that he's provided -- he's created a system that supports dynamic transitions between architectures and has shown that these dynamic transitions can improve performance in a collaborative session. I extended his work by actually automating the decision when to make the switch. He never provided that. Now, the only other system, basically presented by Wolfe, et al., also presented a system that can choose the collaboration architecture automatically. That work was done concurrently with mine. It was published just recently. But the more important thing is that they use developer hints and rules of thumb to decide which architecture to use, which is different in the approach taken in my work, as you will see later. Now, the idea of multicast, to use the multicast in collaborative systems is not new. The T120 protocol advocated the use of multicast to reduce the amount of data transmitted on the network. However, how such a tree was built or what impact it had in various collaboration scenarios was never studied. I'm the first to study the impact of multicast on response times and to create a system that can automatically decide whether or not to use multicast and what kind of multicast tree to deploy in collaborative systems. I'm also the first who studied the impact of scheduling policies and response times and create a system that can automatically decide which scheduling policy to use in order to best meet users' response time requirements. To give you a flavor of my work, let me focus and dig deep into one of these factors, the scheduling policy. And let's just assume for now that a single core is available to carry out the tasks in the collaborative system. Well, if we're going to talk about scheduling, we better define the tasks that we're trying to schedule. In general, there can be two kinds of tasks: those of the collaborative system application and those of external applications. And one interesting issue is how does a scheduling of external application tasks impact the scheduling of the collaborative application tasks. Unfortunately, the typical working set of applications isn't really known. So rather than to guess a wrong working set, I just assume a working set of zero. In other words, I didn't consider external application tasks. I focused only on collaborative systems tasks. Now, these tasks are defined by the collaboration architecture. Collaboration architecture views an application as consisting of two components: the program component and the user-interface component. The program component manages a shared state, while the user-interface component allows users to interact with the shared state. Therefore, the user-interface component must run on each user's machine while the program component may or may not run on each device. Regardless of the number of program components, each user-interface component must be mapped to some program component to which it's going to send inputs and from which it's going to receive outputs. Two popular mappings have been used in the past. One is the centralized mapping in which the programming component runs on the computer belonging to one of the users, receiving inputs from and sending outputs to all of the users. The computer that runs the program component is called the master while all of the other computers are called slaves. Another popular mapping is the replicated mapping in which the program component runs on each user's machine receiving inputs from and sending outputs to only the local user-interface component. Therefore, in this case, all of the computers are masters. Now, to keep all of the program components running on different computers in sync, whenever a program component receives an input from the local user it sends that input to all of the other program components. Regardless of the actual program to user-interface mapping that we use, the masters performed the majority of the communication tasks. So in the replicated case, a master computer must send inputs to all other computers, and in centralized case it must send outputs. One question is how is this communication achieved? If we use something called push-based communication -- with push-based communication, the information producer, the master, sends information to information consumers, the other computers, whenever information is ready. In pull-based communication, which is a duel of the push-based one, the information consumer basically pulls the producer for information whenever the consumer is ready. And then a third type of communication, called streaming communication, the producer simply sends data to the consumers at some regular rate. All of these models are used in real applications we see today. And my work focused on the push-based communication. Now, when you use push-based communication, one question that pops up is whether the commands or data are unicast or multicast. To illustrate the difference between the two, consider this ten-user scenario, in which user one's computer has the send commands to all -- excuse me, to all of the other computers. With unicast this transmission is performed sequentially. And if there are not a large number of destinations, this transmission can take a long time. Okay. Multicast divides the communication task among multiple computers. So when user one's computer sends a command to user two, these two computers can start transmitting in parallel. Basically, what multicast does is it relieves any single computer from performing the entire communication task, or transition task. The net effect is that with unicast only the source has to both process and transmit commands, while with multicast any computer may have to both process and transmit commands. Yes. >> Can I ask a clarifying question? >> Sasa Junuzovic: Yes. >> So multicast in this case is not the multicast routing protocol like big in the '90s during the collaborative applications? >> Sasa Junuzovic: So I'm looking at application layer multicast. >> Okay. >> Sasa Junuzovic: Okay? Like I said, any computer may have to perform both of the tasks with multicast. Now, in addition to these two mandatory collaborative tasks, they could be other tasks related to concurrency control, consistency maintenance and awareness mechanisms. While all of these mechanisms are important, in general they're optional, so I focus on the mandatory tasks. These tasks are in fact independent and in a replicated architecture a computer can both process an input command and transmit it independently. And in the centralized case it can do the same with outputs. Now, while the processing task is kind of a simple thing to understand, it's the job the CPU must do. Transition task is slightly more complicated. So let's see what happens -what has to happen in order for a command to be transmitted to a destination. First, the CPU must copy some data buffers into some memory location from which the network card can then read the data, and then the network card can transmit it along the network. So there are really two kinds of transmission tasks: the CPU transmission task and the network card transmission task. If we think of scheduling, the CPU transmission task is scheduleable with respect to the CPU processing tasks, while the CPU transmission task is not. It simply runs -- the network card transmission task is not. It always follows the CPU transmission. And if we use nonblocking communication, in fact, it can run in parallel with the CPU processing task. So in terms of scheduling, like I say, CPU task -- CPU transmission task is scheduleable while the network card transmission task is not. Now, the order in which we execute the CPU processing and transmissions can impact the local response times. To illustrate, consider this three-user scenario, in which user one has to send data to user two and user three. And suppose user one enters a command to move the piece. Intuitively, to minimize the local response times, user one's computer should process the command first and transmit second. The reason is that the local response times don't include the CPU transmission time. Alternatively, to minimize the remote response times, the CPU should transmit the command first and process it second because in this case the remote response times don't include the CPU transmission time. So we know an intuitive way of scheduling these tasks when either local or remote response times are more important, we're either going to process first or transmit respectively. What about the case when they are equally important? Well, in this case, it may make sense to use some sort of concurrent scheduling policy. Now, one issue with these existing policies is that they trade off local for remote response times in sort of an all-or-nothing fashion. If we use process-first scheduling, users experience good local response times but poor remote response times. And if we use transmit-first scheduling, then users experience good remote response times but poor local response times. Yes. >> [inaudible] >> Sasa Junuzovic: So before I get started getting into scheduling, I said let's suppose there's a single core that can schedule these tasks. And I'll come back to the multicore case after this. And I'll address that question particularly. So then the remote response times are good, but the local response times are -- yes? >> Sorry. [inaudible] talking the abstract at this point, but what's the ballpark of the difference between these two scheduling policies in terms of time? Like you said 15 milliseconds matters. Is this one or is this a hundred millisecond difference? >> Sasa Junuzovic: So in some scenarios it can be seconds. If you -- yeah. It can be pretty high. Sometimes it's not very high depending on the scenario. But in some scenarios, it can be on the order of seconds. So but if we use this concurrent policy, then -- and I'll give results, by the way, that give some -- the kind of the absolute value differences. And if we use the concurrent policy, then both the local and remote response times can be poor. So ideally we would like a policy that can improve response times without this tradeoff. Now, since the existing scheduling policies don't allow us to really control this tradeoff, we need to control this tradeoff. We need a new policy. And since the system's approach gave us these policies, we need a new approach. And I turned to psychology. And I designed a policy called Lazy. And the idea behind this policy is to trade off unnoticeable increases in the local response times for noticeable decreases in the remote response times. In this case users stay happy with their local response times because they can't tell the difference. But they become happier with the remote response times because they can notice the improvement. Now, you can imagine a duel of this policy that reverses this tradeoff. And it's also -- it's just as valid as this one, but my focus was on this particular version of the Lazy policy. So let me give you the basic implementation I give behind it. The basic idea is that the computer is going to temporarily delay processing. And during this processing delay, it's going to transmit to as many computers as it can. But as soon as it realizes that this delay, if it transmits any more, the delay will become noticeable, it stops transmitting, processes the command. And only when the command is processed does it resume any transmissions that were left. So in this particular scenario, user two's -- the benefit compared to, say, process-first is that user two's remote response times improved, and others didn't notice a difference. More generally, the Lazy policy can improve -- can give us improvements and response times of some users without noticeably degrading response times of others. Now, if we compare it to the process-first policy, the local response times of Lazy are not noticeably worse, and this is by design. But the Lazy remote response times can be noticeably better than those of the process-first policy. So it seems here like we're getting a free lunch. Yes, Scott. >> This is probably a naive question, but in order to keep the processing below a noticeable threshold, do you need to know how long it's going to take you to do the processing? >> Sasa Junuzovic: Yes. Well, actually, no, no, you don't need to -- you could look at it that way, but you could also say, look, however long the processing takes, the fastest it's going to be is X, right? I don't really care what X is. But if I delay X by 50 milliseconds, users won't notice. If I increase X by 50 milliseconds, users won't notice. Okay. So the benefits of this Lazy policy, I've shown them rigorously in a mathematical model in my thesis. And it's fairly large and sort of complex and it supports not only single commands but also current commands and type-ahead. And I'm not going to go through all of it, but let me give you a flavor of this model by showing you how it captures the tradeoff between Lazy and transmit-first. So suppose we have replicated multicast architecture, like I've shown here, and consider just the single command. And suppose that we have a path from the source to destination as shown here. We call this path pi and we denote the computers on this path from pi 1 to pi 6. Basically our computer is pi I. And in this case the remote response time of the last user is given by this equation here. And let me break it down really quickly. The first term in this equation accounts for all of the latencies on this path. The second term in this equation accounts for the delays the message experiences on all of the intermediate computers; that is, all but the nondestination computer. And the third term in this equation accounts for the delay of the destination computer. Now, I have to consider the intermediate and destination delays differently, because the former affects the response times of remote users, while the latter affects the response times of the local users. Obviously the network latencies are going to be independent of the scheduling policy. But let's see what happens with these delays starting with the transmit-first intermediate delay, for which the equation is shown here. Basically it states that the transmit-first delay is equal to the amount of time the computer requires to forward the command to the next computer on the path. And this depends on several factors. It depends on the CPU transmission time, the network card transmission time, and the order in which this intermediate computer transmits commands. The order, the destination order. Now -- let me see here. Oh. It must include the CPU transmission time, because the network card can't start transmitting until the CPU at least transmits once. And in all of my work, the network card transmission time was always bigger than CPU transmission time, perhaps intuitively. And therefore I have to multiply the network card transmission time by the destination index of the next computer on the path in orderer to get this time. This is pretty simple. Now, let's look at the Lazy intermediate delays, which are slightly more complicated, because there are two cases I have to consider. I have to consider the case when transmission happens before processing and when transmission happens after processing. So if transmission happens before processing, the equation is given by this. I just look at the one case and plugged it here. As you can see, it's equal to the transmit-first intermediate delay. And this makes sense because in both cases the processing doesn't impact -- doesn't get accounted for in the delay, therefore the two should be the same. And basically it's going to occur when the total CPU time required to transmit the command to the next computer in the path is less than the maximum allowed processing delay. Now, this processing delay is equal to the noticeable threshold minus the sum of the intermediate delays encountered so far. So what we have -- I've tried to clean this up a little bit. What we have is that if the processing delay is larger than the total CPU transmission time required to transmit to the next computer in the path, then the Lazy intermediate delay is equal to the transmit-first intermediate delay. But let's see what happens to computer -- the Kth computer in the path, pi K, as K increases. So as K increases, the sum of the intermediate delays experienced so far also increases. Whoops. So it's going to get bigger. And as a result the processing delay is going to get smaller. Eventually this processing delay is going to be less than the total CPU time required to transmit to the next destination on the path, in which case we have to consider the second case of Lazy policy in which transmission happens after processing. Question. >> [inaudible] centralized [inaudible] to figure out whether the delay is -- it doesn't sound like it's a local policy. >> Sasa Junuzovic: It isn't. It's a distributed algorithm, yeah. >> [inaudible] >> Sasa Junuzovic: You can do it -- there are several ways of implementing [inaudible]. >> You don't want centralized because then you'll have the delays of figuring out whether -- you know what I mean? It's just [inaudible] just as bad. So centralized is [inaudible]. But I don't see [inaudible]. >> Sasa Junuzovic: So maybe we're thinking -- talking about a different thing. But when I say distributed algorithm, each computer has to make a local decision based on what it knows so far. >> But it doesn't have enough information to make the decision. >> Sasa Junuzovic: Right. So I will show you how I actually implemented this policy when I talk about the implementation. But there are various ways of implementing this Lazy policy. So one way is to say, look, forget about this distributed algorithm. Whenever you get a message command delayed by the noticeable threshold but no more, okay, that's one way of doing so. >> But that's not the math that you just showed. >> Sasa Junuzovic: This is not what I'm doing. So this is not the case I'm considering. The problem -- the reason I didn't consider that case, although it's valid, is that it kind of favors downstream computers rather than upstream computers. So the upstream computers are always going to delay their processing to give data to the downstream computers. And if you add up all of the delays, it can actually become noticeable. If everybody delays by the noticeable amount, eventually the sum of those delays becomes noticeable. It favors downstream computers. The other option is to say, look, just take the noticeable threshold and subtract -- calculate the difference from now, from the time the command was input to now. If that difference is bigger than noticeable, then don't delay and just process. This is, again, a distributed algorithm. How it's done, you need some clock synchronization, which can be done relatively simply, I mean, nowadays. But that's, again, not what I did. The reason is that if the delay -- if network latencies are high, then what's going to happen is there's not going to be much difference between this Lazy and the process-first policy since only the process-first guy is going to delay. So that's not what I did. I took the middle ground where I say, look, subtract from the noticeable threshold only the sum of the intermediate delays encountered so far. And I gave this kind of middle ground where there will be benefits to users without noticeably hurting the response times of others. >> And you'll talk about how you measured that later? >> Sasa Junuzovic: Yeah. >> Okay. >> Sasa Junuzovic: Okay. So where was I? Oh. So this is the case when the -- in the Lazy policy when transmission happens after processing. So obviously it's going to be at least the transmit-first intermediate delay, but it's also going to include the processing costs, in this case the costs of processing the input and the output. Now, eventually the delay I was talking about, the processing delay, it's going to go down to zero. And therefore these two terms that are left in this equation are going to become zero. In which case we get that the Lazy intermediate delay is equal to simply the transmit-first intermediate computer delay plus the processing time. Now, if we combine the results we have so far, we will see that basically transmit-first intermediate delays dominate that of Lazy. In other words, transmit-first is always better than Lazy so far. But let's consider the destination delays, and, again, let's look at transmit-first. The equation is given here. It's going to equal at least the processing time. Since this is a destination computer, it must process the command. But since it's transmit-first, it's also going to include the total CPU transmission time that the destination takes to transmit to all of the other destinations. In general, each computer may have to forward to other computers. Like in this case, the destination forwards to four other machines, that takes some time. Okay. So basically a simplified equation, I said, look, the transmit-first destination [inaudible] delay is equal to the total CPU transmission time it has to do, plus the processing time. Now, let's see what happens to the last computer on the path, to the destination as the number of computers to which it forwards increases. As the number of computers to which it forwards increases, so does the total CPU transmission time, and therefore so does the transmit-first destination delay. Now, if this number of destination goes really high, the delay can become very high, as big as you want, depending on the number of users in this scenario. Now, let's look at the Lazy destination delay. Here's the equation. Of course it must include the processing times, and this is the destination. But it can be more than processing time. By how much? Well, by the minimum of the total CPU transmission time and the maximum processing delay. Remember, we can never delay more than the maximum processing delay, which is captured by this minimum function here. So, again, I've just cleaned up this slide really quickly to make some room. Let's see what happens to the last destinations -- to the destination delay with Lazy as the number of computers to which the destination forwards increases. As with the transmit-first policy, as the number of destinations to which the destination forwards increases, so does the Lazy processes. So does the Lazy destination delay. However, in this case, it's capped. It can never be more than the maximum processing delay, which is by definition never more than the noticeable threshold. So now if we combine this result with our previous finding, we see that the Lazy and the transmit-first policies actually don't dominate each other. So let me give you an example to illustrate how this actually -- how these results actually hold true. And let's consider a 12-user scenario. And suppose we're using replicated multicast architecture in which user one transmits to user two, then three, four, five, six, and seven, and user two transmits to user eight, then nine, ten, eleven, and twelve. Now, the model has given us all the parameters we need to consider. We need to -suppose user one enters all the commands. The model has given us all the parameters we need to consider. The CPU transmission time, say it's 12 1/2 milliseconds. This is a theoretical example. Or hypothetical example, I should say. Network card transmission time is 25 milliseconds. Processing time of input and output combined, say it's 75. And the network latencies are 25 milliseconds and of course the noticeable threshold is 50 milliseconds. So let's analyze the remote response times of user eight with the process-first and Lazy policies. So with the process-first policy, when user one enters a command, user one's computer is first going to process the command. Then it's going to -- the CPU is going to transmit to user two, then the network card has to transmit to user two, the command has to reach user two's computer. And as soon as it reaches user two's computer, user two's computer is going to process it. Then it's going to transmit to user eight and the network card will transmit to user eight. It has to -- the command has to reach user eight's computer, and then user eight's computer is going to process the command. Okay. Now, let's see what happens with the Lazy policy. When user one's computer receives the command, first thing it's going to do is it's going to check can I delay processing and transmit instead. Well, since the transmission time is less than a noticeable delay, 12 1/2 versus 50 milliseconds, it says yes, I can. So it transmits first to user two. Then the network card transmits to user two, the command has to reach user two's computer. And when it reaches user two's computer, it does the same thing that user one's computer just did; it says can I delay this before I process, can I delay processing and transmit to at least one destination. So it says, okay, noticeable threshold is 50 minus the sum of the intermediate delay so far, which is the single CPU transmission time, the network card transmission time in user one, and says yes, I have enough time to transmit one's before I process, so it does. And then the network card transmits on user two's computer, command has to reach user eight. And because user eight's computer doesn't forward, it simply processes. So if we see here, the Lazy gives noticeably better remote response times than the process-first policy. And in fact a more interesting thing is that the Lazy improvement and the remote response times compared to process-first is additive. So as we see here, there are three processing times included in the process-first policy response time while only one in the Lazy response time. In fact, the more computers in the path that delay processing, the better this -- the more additive this improvement is. So that was process-first versus Lazy. Let's look at transmit-first. And let's look at the remote response time of user two. So with the transmit-first policy, when user one's computer receives the command, it's going to transmit it and then the network card is going to transmit it, then the command has to reach user two's computer. And when it receives it, it's going to transmit to five destinations and finally then process. With Lazy, what's going to happen is user one's computer will say can I delay processing before I transmit. Yes, as before. So it's going to transmit to user two. The command much reach user two's computer. And when it gets it, it says can I delay processing by transmitting. Well, yes. So transmits to user eight. Then it asks the question again: Can I still delay processing before transmitting. And it says no because you've used up all of the noticeable threshold -- the maximum processing delay. And therefore instead of transmitting it's going to process. And, as we see here in this example, the Lazy remote response times for user two are noticeably better than those of transmit-first. But let's hold on. This was a hypothetical example. I've carefully constructed it to illustrate the benefits of Lazy versus the other policies. The question is can these response time differences be noticeable in realistic scenarios. So to answer that question, I simulated the performance using, again, a local model I've shown you. But of course I need realistic simulations. So to do that, I simulated the distributed PowerPoint presentation for which I require a realistic values of the processing and transmission parameters. So to get those, I measured them from actual -- I measure their actual values from logs of real PowerPoint presentations. Now, in these particular logs there were no concurrent commands and there was no type-ahead, so I couldn't test that part of my model. I could test the part of the model that talks about single commands, the one we've seen. And I measured these costs on a number of single-core machines, a netbook, a P4 desktop, and a P3 desktop. Why a P3 desktop? To simulate next-generation mobile devices. Now, suppose that the presenter is using the netbook to give this talk and that the presentation is being given to 600 people who are using a variety of P4 desktops and next-generation mobile devices. Now, suppose that these computers are organized in such a way. The six computers forward to clusters of 99 computers. So in this case, these six computers here are going to each forward to 99 other computers. Of the forwards there themselves can communicate any which way they want. Suppose that the forwarders are on the same land, so the latencies between them are low. And suppose that the rest of the users are at various places around the world. So to get realistic latencies from the forwarders to the users around the world, I took a subset of actual latencies measured between 1740 machines at random locations around the world by other people, not by me. And here are the results. Let's compare first the process-first and Lazy policies. So the local response time we see here, the Lazy local response time is worse than the process-first response time -- local response time. But not noticeably worse, about 49 milliseconds. And here are the remote response time results. What the stable shows is that the number of users whose Lazy response time, remote response times are equal to, noticeably better, noticeably worse or not noticeably different than those of process-first. And as we see here, for all but one user, the Lazy response times are noticeably better than the process-first response times. But in this particular scenario by as much as 600 milliseconds. Although, it can be more in others. And the one user who the performance was not noticeably different still got better performance, but only by 36 milliseconds. >> What's special about that? >> Sasa Junuzovic: What happened in this particular case? So it can happen that these guys are not noticeably different because imagine that you're using process-first, right? And the processing costs are low. And I transmit to you first. I have to transmit to a bunch of guys, but I transmit to you first. So with process-first, I have a low processing cost, and then you're my first destination. And you also have to say transmit to a bunch of other people. As soon as you get the message you're going to process. Now, with Lazy, I'm going to delay by perhaps 50 milliseconds, you're going to delay by a little bit. So if the sum of those delays is perhaps -- can be in fact a little bit greater than the sum of the processing delays on my machine, you can get that these guys are not noticeably different. Yes. >> Is network topology on the table as far as reconfiguring it for each particular situation? Like you chose six transmits [inaudible] what if you had chosen to build more of a tree? >> Sasa Junuzovic: So that's a good question. So my system by itself can deploy any tree at once. In particular, simulation in some sense to get something realistic that's -- if you guys are familiar with WebEx, that's kind of the architecture they use. They have a bunch of forwarder machines which have low latencies between them, and they forward to everybody else. I chose to use that kind of WebEx-like architecture in this particular simulation. But you're right, the results will be different if there's more of a tree that's deployed. Right now what I'm talking about is just a scheduling policy. There's a part of my system that can decide -- calculate what tree should be deployed also, what kind of communication architecture. Okay. So now let's look at Lazy versus the other two policies, transmit-first and concurrent. Oh, sorry, let me just finish. By looking at the interesting results here, we see that the intuition is confirmed in this scenario. Lazy dominates process-first. In other words, process-first policy is obsolete. Now let's compare Lazy and transmit-first and concurrent. Here are the local response times. And as we can see, the Lazy remote response time is noticeably better than the transmit-first response time, but by about 70 milliseconds. But it's not -- in this scenario, it's not noticeably better than the concurrent one. Although it can be in other scenarios. And here the remote response times, again, the same table, just compared to transmit-first. For 407 users, the response times are equal. With transmit-first they're Lazy. For five users, Lazy gives noticeably better response times, by as much as 158 milliseconds. But for 187 users it gives noticeably worse response times, by as much as 240 milliseconds. And in this simulation the results were the same with the concurrent policy. But what we can take out of this is again if we look at the interesting things, none of these policies dominate each other. But let's look closer at the five users that get better remote response times with Lazy than with the transmit-first policy and concurrent. It turns out that four of these five users are the forwarders. These are the guys that are doing all the work. So the Lazy policy is in some sense more fair than the other policies. Because it keeps the remote response times low of the guys who are doing all the work of transmitting the commands. So if we combine all of these results here from this simulation, what we have is that if the users want to improve as many remote response -- as many response times as possible, the policy that should be used is transmit-first or concurrent. But if the users say, look, improve as many remote response times as you can but don't noticeably hurt local response times, then we should use the Lazy policy. And if the users say something like improve as many remote response times without noticeably degrading local response times or the remote response times of those users who forward, then again we should use the Lazy. We know what decision to make the Lazy policy. Now, I've actually built a system that can automatically make this decision. And this system is a collaborative system first and a self-optimizing system second. If it's going to automatically choose which scheduling policy to use, the collaborative [inaudible] better support all of the transmission and processing tasks that are possible. Therefore, this collaborative functionality must support for centralized and replicated architectures. It must support unicast to multicast. And of course all of the scheduling policies. And for now the only self-optimizing functionality is to automatically choose which scheduling policy to use. So let's see how this system shares applications to start off with. To share an application we somehow have to intercept inputs and outputs. To do so, the system interposes a client-side component on each machine between the user component and the program component. And it does so on all machines. And consider the centralized architecture first. When the client-side component of the master intercepts an input from the local user, it simply forwards it to the local programming component. But when it intercepts an output from the program component, it doesn't just send it to a local user-interface component, it sends it to the client-side component [inaudible] slave, on all of the slaves. And then all of those client-side components forward the output to their local user interfaces. And when the client-side component and the slave intercepts an input from its local user, it cannot forward it to a program component. It's not active. It has to forward it to the client-side component in the master, and then the same thing happens as before. In the replicated architecture, whenever a client-side component intercepts an input, it doesn't forward it only to the program component, it forwards it to all of the other client side components on all of the other machines who then forward it to their program component, and the rest is as you can see on this slide. Now, to support multicast -- it also supports multicast. Suppose that we want to have user one transmit to user two and user three, but we want user two to transmit to user four. Okay. So as you can see here, you can just map the client-side components to support this multicast [inaudible] any arbitrary multicast stream. And it also supports all of the scheduling policies. Each client-side component creates separate threads for the transmission and processing tasks. To support the transmit-first policy, it simply gives higher priority to the transmission task, lower priority to the processing task. It switches those priorities to enforce a process-first policy and then gives them equal priorities to enforce the concurrent policy. And this is going to address your question next. So how does it implement the Lazy policy? This is more complicated since it's a distributed algorithm. Each client-side component on a forwarding machine has to delay processing while the delay is not noticeable. So it must note a delay so far. So how do we do this? Well, in other words, to help each other deduce this information, whenever a user enters a command, the source computer timestamps the command with a local time and forwards that time along with the command. Each forwarder, then, calculates the difference between the time that's left so far and that input time -- that command input time. And if this difference is not noticeable, it's going to continue delaying processing. Of course, this needs some sort of the clock synchronization ->> [inaudible] >> Sasa Junuzovic: Yes. Well, it has to be fairly accurate. >> Well, if you don't have it accurate to milliseconds, then it's all messed up, right? I mean, otherwise you won't be able to tell what a noticeable delay was if you can't tell more than 50 milliseconds. >> Sasa Junuzovic: Yeah, 50. [inaudible] order of tens of milliseconds, if the errors are on the orders of tens of milliseconds, forget it, it's not going to work, right? But it turns out in my experiments, okay, I was able to use a very simple scheme that seemed to work. So more robust schemes are possible, and if greater accuracy is desired, we can use them. But a simple scheme worked for me. Okay. Now, let's see. Oh. So of course now to choose one of these policies, we have to use the self-optimizing part of the framework. And let's see how that looks like. It basically has a server-side component and a client-side component. The server-side component has four components of its own, as in the analytical model, which applies the model. There's the response time function which applies the user's response time function. Now, the results from the analytical model are sent to the response time function in what I call a predictive response time matrix. And when we're choosing the scheduling policy, what this matrix has, is for each scheduling policy, the predictive response times of all users for commends entered by each user. It sends that to the response time function, and the response time function says, okay, the scheduling policy you should use is this, and then the system manager on the server side kind of forces all of the clients to use that policy. Now, of course to apply the model in the first place, we have to somehow gather parameter values. That's the job of this parameter collector. It can do so either dynamically or it can measure them on the fly. Or, I'm sorry, it can use historical values. Now, to measure them dynamically, the system basically interposes a client-side component of the optimization framework which is part of the client-side component we've seen before. And that optimization, client-side component simply times all of the commands. Now, I have slides that I've -- in the back for time that tells -- that talk about how I measure all of them, and I can get back to them. But for now I'm just going to say that the client-side component is maybe to measure them. Okay. Now, finally, how do we switch between policies. Well, when the system manager is told what policy to switch, it simply tells each client-side component if you want to use -- look, if you want to use transmit-first, make sure your priorities are set so the transmission has higher priority than processing. To use process-first, it simply switches them to concurrent and so forth. And to force the Lazy policy, it tells it to apply the Lazy algorithm. Yeah. >> So let's say that I have a client that is compromised and picks a policy that deliberately screws up the system. Can your system adjust to that or detect it and pull that person out? >> Sasa Junuzovic: So I didn't look at -- what's the word, basically compromised issues of security or issues of somebody actively trying to cheat the system. That was not part of my research. But it's a very good question. I don't -- my system at the moment is not designed to handle it, and it will be kind of cool, actually, to try and figure out if there's ways to -- I mean, you can do it in some scenarios in different kind of applications. Perhaps you can also do it here also. Okay. Now, there's an interesting implementation issue that arises during the switch of scheduling policies, in particular, to the performance of the commands entered during the switch. Suppose we're switching from the transmit-first to Lazy. This switch in general takes time. So computers may temporarily use the mix of all the new policies. Now, this isn't a semantic -- semantics issue, because the scheduling policy doesn't determine how -which computers receive what commands, but it is a temporary performance issue. The reason is that the analytical model never predicted the performance for this mix of scheduling policies. So we could temporarily degrade performance, not improve it. Fortunately, eventually all of the computers switch to the new policy, and we get the performance predicted by the model. Now, so far I've shown you how I can basically choose the scheduling policy automatically, but on single cores. What about multicores? Well, if we have multiple cores available to carry out these tasks, and the local response times are more important than the remote response times, it makes sense to carry these tasks in parallel and separate cores in what I call the parallel policy. If remote response times are more important, I'll use the same policy because in either case the processing or transmission tasks don't affect each other, and I'll get the optimum response time. Now, the interesting question is, okay, can I parallelize the processing task. In general, processing task is defined by the application, and it's a black box [inaudible]. We don't understand its semantics, so we cannot parallelize it explicitly. So I don't do that. Well, what about the transmission task? My framework defines the transmission task. It can obviously use multiple cores to transmit in parallel. But there's no actual response time benefit to doing so, remote response time benefit to doing so. A single core can't -- a network card can't keep up with a single core, let alone multiple cores. And there's another reason why you should be careful to parallelize a transmission task. It makes predicting remote response times difficult. In particular, the operating system can schedule the sent calls by the cores in arbitrary order. So we don't know the order in which the network card is actually going to transmit, and therefore we can't predict the remote response times. So I don't do this either. Now, I'll just briefly say that my analytical model predicts, simulations show, and experiments confirm that if you have multiple cores, regardless of what the user's requirements are, run the parallel policy. So those are my scheduling contributions. Let's briefly look -- I'm kind of running out of time. Let me briefly go into multicast and the processing architecture, and then I'll go to my contributions. As I said, with multicast, the transmission is performed in parallel. And basically this parallel distribution can result in quicker distribution of commands, which should help response times. On the other hand, multicast paths are longer than unicast paths, so the added network latencies can actually hurt response times. So multicast may improve or degrade response times. Another interesting issue with multicast is that the traditional multicast schemes don't consider all collaboration parameters, mainly the scheduling policy. So suppose we're using the process-first policy first. And let's look at the remote response times in this unicast and multicast scenario of the user down here in the corner. With unicast the remote response time includes the processing time on only the source and destination, while with multicast it includes the processing time of all of the computers on the path. So this is yet another reason why multicast can hurt response times. But I know what you're thinking. You're thinking just use transmit-first policies, and that's supposed to help remote response times. Well, let's see what happens there. With transmit-first, unfortunately, since each computer can also have -- may also have to forward commands, with transmit-first policy, it has to forward before it processes. And if this transmission time is high, the response times can actually be worse when we use the process-first policy. So, again, it's yet another reason why multicast can hurt response times, and therefore the conclusion is we have to support both unicast and multicast in these systems. Unfortunately, traditional collaboration architectures couple the transmission and processing tasks. Remember, the masters pretty much perform all of the communication. So we have to basically decouple these tasks in order to support multicast, and I did that in what I call the bi-architecture model of collaborative systems, in which the processing architecture dictates the processing tasks the computer must carry out, and the transmission -- and the communication architecture dictates the transmission tasks that each computer must carry out. I'm not going to go too much into the results, but let me show you that this -- there's an interesting issue that happens with the self-optimizing system when it switches architectures. Again, the switch takes time, the switching the communication architectures. So what do we do during the switch? We can simply turn all the architecture off because there may be messages in transit. So what we do is during the switch we simply use old architecture, deploy the new one in the background, and when every computer has deployed the new architecture, we switch to it. Now, commands entered during the switch are going to use the old architecture, so therefore they may experience poor -- that is, old -- performance. Now, these two architectures may have to run in parallel for a while, because there may be still messages in transit, even when the new architecture is deployed. But eventually we can just get rid of the old architecture and switch to the new one. Okay. And let's quickly look at the processing architecture. Like I said, I focused on replicated and centralized architectures. Which one should be used to favor local response times? Well, it makes sense that the replicated architecture should be used. The reason is that the local response time doesn't includes the cost of remote communication. Chung has shown experimentally that this intuition is wrong. So then sometimes a centralized architecture can give better local response times. So maybe we should always use experiments and then based on the experimental data decide what to do. Unfortunately, experiments are situational, and there are infinitely many collaboration scenarios. It's not practical to gather data for all of them. So what we really need is some sort of an analytical model. And then of course a system that can automate the switch. Again, there's an issue of commands entered during the switch as with scheduling policy and communication architecture. In particular, a new master may suddenly receive an output. Now, this is not consistent with the notion of centralized and replicated architectures. So one way to combat this is to simply say I'm going to pause input during the switch. Now, while this is a simple approach, it hurts response times because any command entered during the switch is going to be delayed until the switch is done. So Chung actually showed an interesting way of running the old and new configurations in parallel, kind of like I run my communication, all the new communication architectures in parallel. And the good benefit is that it doesn't hurt performance -doesn't hurt performance. But it's not really simple. So since my focus wasn't really on optimizing the time it takes to do this switch, I was just focusing on showing that the switch is beneficial. I use a simple approach. And if we really think about it, it may not actually hurt response times. I mean, we like to stretch and we like to get a drink of water, go to the bathroom. So if we use these break times to actually perform the switch, the performance won't actually be hurt. Okay. And remember that Duke scenario I gave you guys at the very beginning, the Candidates' Day scenario. So suppose now that the professor is using the P4 desktop, the one student is using a next-generation mobile device, and then a latecomer comes in with a Core 2 Duo. Let me show you -- the result I basically showed you earlier, the performance improvements is what my system did in an actual experiment for that scenario. So what happened initially, it took some time to measure the powers of these computers. We started off with a centralized architecture in which the next-generation mobile device is the master. The system that took some time to measure the powers of these machines and other parameters and said, oh, you should really switch to the centralized architecture in which the P4 is the master. At some point here the latecomer joined, and shortly after, as soon as the system was basically able to measure the power and the other parameters related to the latecomer, it switched to the centralized architecture in which the latecomer's computer is the master. And that's how the performance actually improved. Okay. So -- yes. >> [inaudible] go to Duke? >> Sasa Junuzovic: I hope he went to UNC. I don't know. >> [inaudible] >> Sasa Junuzovic: This was doesn't actually done during Candidates' Day. But it was an experiment. It wasn't a simulation. It was I took a log of checkers played by multiple users against the computer, and I replayed that log under various conditions with and without my system. And then I looked at the performance differences. So it's an actual experiment, just not -- I just replayed a log because you can't really rely on the users to repeat everything they did every time if they do it multiple times. So let me summarize my contributions. I focused on this window of opportunity in collaborative systems. And I studied three important performance factors: collaboration architecture, multicast, and scheduling policy. In the process I was the first to evaluate the impact on multicast and response times. And in the process I introduced this new bi-architecture model because the old collaboration -traditional collaboration architectures don't support multicast. I also studied the impact on response times of scheduling policy and came up with this Lazy policy that combines psychology and scheduling results to give us interesting benefits. I also developed a model that can predict the response times for any combination of these factors. Now, this model by itself is useful. You can give it to users who are locked into a particular configuration, say a centralized architecture. And the users can use the model to decide who should be the master. It's also helpful to users who have a choice of configurations, say the centralized and replicated architecture, to decide which architecture to use. And for users who have some sort of system like mine or developed by Goopeel Chung, they can dynamically switch architectures, they can use the model to decide when to make the switch. Of course manually applying the model is tedious and error prone, so I built this self-optimizing system that can allow the model automatically and in the process identified several new implementation issues, some of which you've seen, and there are many more in my dissertation, but I didn't have time to go into all of them. And if I combine all of these factors, all of these contributions, I give you the proof for my main thesis. For certain classes of applications, it is possible to meet performance requirements better than existing systems through a new collaborative framework without requiring hardware, network or user-interface changes. Now, as with any framework, you know, this -- I haven't validated it with all applications. I don't know whether improves performance of all applications. That wasn't my goal. My goal is simply to defend the ideas that the collaboration architecture and communication architecture and scheduling policies matter for response times and that it is possible to automatically dynamically change them to improve response times. So to show that I focused on three driving problems: a collaborative board game, a distributed presentation, and instant messaging. Instant messaging is pervasive, collaborative board games are popular, and entire industries have been built around distributed presentation problems. So these problems are important. More importantly, all of these applications I've used, this instances of these problems, the checkers, the PowerPoint, and instant messaging tool I built, all of them adhere to all of the assumptions I made. For example, they all use push-based communication. None of them had concurrency control, consistency maintenance, or awareness mechanisms. And they both -- they all supported both centralized and replicated semantics. But the system I've still created -- I mean, you could still say it's very complex. I mean, it's a complex prototype that can dynamically adjust these parameters. So the question is would I recommend it to software designers as the one-to-replace-them-all system. Well, it depends on a couple of questions. First of all, is the added complexity worth it. There's going to be initial cost, deployment, bugs. You can name it. And many applications use only the centralized semantics because they're easier to implement in replicated semantics. That's not to say that replicated architectures are not used, for example, PowerPoint sharing on Live Meeting, I think -- or, no, WebEx and for sure Google Docs and spreadsheets use replicated and semi-replicated architectures respectively, so they're possible. And then to answer my own question, yes, the complexity is worth it if the performance of current systems is an issue. The other question is does this window of opportunity exist that I talked about. Well, it does in multimedia networking. If the bandwidth utilization is high but not too high, we can do some clever things and does performance from poor to good. While I also believe it exists in collaborative systems, further analysis is needed to verify that claim. Now, if it exists now will it exist in the future. Okay. So processing powers are going up. Which means that processing costs are going down, which means that choice of processing architecture won't matter. I don't quite believe that. And the reason is that as processer powers have gone up, so have the processing costs. We simply demand more complex applications. And this has been true for at least 30 years now. And as proof I offer to you the fact that if I do an edit operation in powerful -- something like Word, it can still take a long time to actually get the result. So therefore I believe the processing architecture is going to continue to matter. What about communication architecture? I mean, network speeds are going up, therefore transmission costs are going down. Choice of communication architecture will not matter. I don't think that's the case either. In particular, the transmission costs are going to go up because the demand for more complex applications comes back. If you think of -- if you think of, for example, telepresence applications coming up, we want HD video now of the other people, and that's going to require a lot of bandwidth. And cellular networks are still slow. And they're not getting that much faster anytime soon. And many reasons, one of them is power consumption because faster network connections drain more power. So I believe that the choice of communication architecture will stay important. And what about the scheduling policy? Okay. We're in the age of multicore. And my results say that whenever you have multicore computer use the parallel policy, so why would I choose the choice of -- why is the choice of scheduling policy important. Well, multiple cores usually also mean more power, which is one of the reasons why at least most of the cell phones or maybe all PDAs and netbooks still use single core -single cores. Also, even if you have multiple cores, let's come back to that high-definition video example. Just last summer, just last summer, I worked on a project here at MSR in which there was an application that was supposed to render two HD videos of two -- HD video of two remote parties. And on a quad-core machine, it was having trouble. So maybe most of the cores will be needed to render the video, perhaps leaving only one core to do the rest of the things, in which case the scheduling policy is still going to matter. Now, in addition to these contributions, I've made several others. The simulator I've built is very important because large-scale experiments are different to set up, while large-scale simulations are not. And, in fact, you may not have the resources to do large-scale experiments. Well, you don't need many resources to do large-scale simulations. The simulator I've presented is sort of like the network simulator used by networking researchers. And it's the first such system for collaborative systems. I also have a couple of teaching contributions. For example, the students can use the simulator to learn the impact of collaboration architecture, multicast and scheduling policy and response time, and then they can actually use the system to experience the difference. Now, in addition to these -- most of my dissertation contributions were theory and systems oriented. But I've also made a number of application contributions and basically user studies when I was an intern with MSR. Early in my MSR career I studied awareness issues in multiuser editors, and I found that users would like to know when others are going to make conflicting operations and when others are reading what the users are writing. So to provide the information to the users, I created these awareness mechanisms based on real-life metaphor of shadows. And a user study evaluating these mechanisms show that they can help coordinate users with their activities. The summer after I returned to my roots and I made a collaborative framework for interactive Web applications, and this was actually my only systems-oriented work at MSR. The summer after that I studied awareness in meeting replace systems. I found that -- or we found that users don't want to know only who the current speaker is, they also want to know who the speaker is talking to, looking at, and whether there are any side conversations ongoing. And this information you usually get from the spatial relationship of the participants in the room. But so, therefore, I built this 3D interface that positions the videos of all the participants in a way that preserves that spatial relationship. We did a user study and found the system is better at providing that information and other issues. It was a little confusing with all the rotations. But, more importantly, based on the user study, we found requirements and recommendations for all future meeting viewer systems. And then just recently I focused on telepresence. In the summer of 2008 Kori and Zhengyou and I studied window layouts and multiway telepresence applications that also share applications, so how we found that users prefer certain layouts while they dislike others based on a couple of user studies, and we've proposed several guidelines for future -- for future design of these systems for what kind of layouts to use. Just recently I studied how people can catch up on parts of meetings they missed in real time even while the meeting is ongoing, and we evaluated several mechanisms: text, video, and audio. And there's also data currently being published. But let me just then finally -- oh, these are [inaudible] publications, patents, several patent applications and a tech doc, so it was a very successful time why I was here. But let me tell you just briefly about my future work. Based on everything I've done, I've truly started to believe that we need to look at performance from the user requirement perspective. So a lot of systems today are built -- I have all this equipment, let me do the best performance I can give. But I really think, based on what I've seen, is that it's important to see what the users exactly want and then optimize the way the system provides them the data in a way that basically optimally meets their requirements. But I propose several useful -- potentially useful criteria, such as important users and [inaudible] local or remote response times. What do the users really want? I don't know. And user studies need to be done to discover that. It's also going to be interesting to adjust configuration of a system based on user activity. Okay. So, for example, facial expressions or attention level. If you think of something like Second Life, when the user is like this in the world, that means they haven't done anything for a while and they're probably not at their desk, therefore we can sacrifice their performance to improve the performance of others. And if you look at facial expressions, say in telepresence applications, if someone starts to frown or scratch their head or something, it could be sign of confusion. And one way that confusion could arise is because whatever I'm saying, it doesn't seem to be in sync with whatever is appearing on the screen. Therefore maybe we should -- if that user is important, we should reorganize the system to improve the performance to that particular user. Now, speaking of these other applications, I think it's really important to continue to extend my work to other scenarios, something like virtual words. In today's something like Second Life, which is fairly popular, performance issues arise, but it's few as ten users in a single location. And people want to do conferences and group meetings in Second Life. So there's something that needs to be done. Now, Second Life is -- it's a centralized architecture, so perhaps we can change that. But we can maybe then [inaudible] deploy multicast and change the scheduling policy now that everybody has to transmit and process at the same time. Not just the [inaudible] but everybody has to. And it's a step closer to we're solving these issues, the performance issues that arise. Now, of course it's important to look at -- I think it's important to apply some of these ideas at the telepresence meetings. So in telepresence you have HD video and perhaps a CPU-intensive application and the two are going to hurt each other's performance. So it would be interesting -- there's a video quality and response time tradeoff. It will be interesting to look at aspects of the Lazy policy. For example, to trade off unnoticeable response time increases for perhaps noticeably better video quality. Or vice versa. In fact, it's not quite clear. I mean, in all videoconferencing systems we have results that say audio is the most important, text is the -- or the application is the next most important, and video, if you can do it. But with these telepresence things, high-quality video is becoming really important. And therefore we have to be able to either provide it or sacrifice something else to maintain the quality that's there, and maybe this Lazy approach will work. Again, telepresence meetings, for example, if I think of -- this just came to me recently, this idea when I thought about doing meeting replay, while the meeting is ongoing, one of the ways people can review what happened in the meeting is if there's an automatic speech recognizer somewhere. Now, this ASR -- they're called ASRs -- are fairly CPU intensive, or can be. So the question is where does it run. Does it run on each user's machine? So take some processing power, but then all the machine has to do is transmit text to everybody else. Or do we run the processing on some fast, fast machine, but the tradeoff there is that even though, yes, it can be faster and take less time, now suddenly everybody has to send audio to that machine instead of everybody sending text to each other. So this kind of tradeoff is basically what my system was designed for. Anytime there's a tradeoff that can be exploited, my system tries to do it. In the scenarios that I have. But I'd love to extend it to these other applications. And I'd like to look at mobile devices. In particular I want to design these energy-aware architectures and scheduling policies. For example, what if I take longer to perform a task but use less power but the increase in the completion of the task -- the increase in the task time is not noticeable to users. Some people have already started doing this in other applications. I'd like to investigate it also. So I think -- oh, simulations and experiments. I want to do a user study with my simulator to see if it actually helps people allocate resources. So in large-scale online systems, administrators kind of use rules of thumb or kind of just try to allocate, oh, this many servers and there will be enough. What if they had my simulator? Could they make more accurate resource -- or prevision decisions? And, finally, in my work I found that it's almost impossible to do large-scale experimental testing, simply because the clusters that are available, say, PlanetLab or Amazon's EC2, they don't give you enough control to make sure that you always use the same computers, that nobody's sharing them with you and so forth. Therefore, I'd like to start some sort of collaborative effort that designs a large-scale experimental test but just for collaborative applications. And I'd like to acknowledge all of my committee members, and many people at Microsoft Research, but I only listed Kori and Zhengyou and Rajesh. But many, many others. And I'd like to acknowledge my funding. And I'm sorry I went a little bit over time. I hope it's not a big deal. Thanks, everyone. [applause] >> Sasa Junuzovic: Yes. >> So you mentioned video a lot in your talk, and I'm actually wondering, most of your applications, was the processing [inaudible] actually fairly processing what? I mean, the talk, sending over text, sending over, you know, changes in PowerPoint are few and infrequent. Have you thought about really processing intense applications? I mean, 3D games? The reason I'm saying is that if you think about the processing requirements there and if you think about running at 30 hertz, which is the standard these days, that means that your frame comes every 33 milliseconds. >> Sasa Junuzovic: Right. >> Which means your delay of 50 milliseconds is not even a remote possibility. >> Sasa Junuzovic: Right. >> Right. So have you thought about any chances of how do you address something like that where you're more bound by the continuous stream of processing and if the chance of interjecting anything is almost not possible? >> Sasa Junuzovic: So that's a good question. And, like I said, as you pointed out, I focused on the push-based communication, not quite the streaming-based communication. Now, saying the delay is not possible is not quite right, because I can delay every -- if I just offset all of the frames by a little bit, okay, I'll still get the frames at 30 hertz on my local machine, right? I can have initial perhaps delay. Now, how can I use that initial delay? Well, not all frames require the same processing power. Not the same amount of processing time. Some frames are simple to process; some are not. At least that's my understanding. So perhaps sometimes when I have a -suddenly go from -- into low -- less expensive frames, let's call them, less expensive frames in terms of processing power, perhaps there's some sort of delay I can interpose there and say, look, they're not that important anyway, so maybe I can render them a little bit later. Now, I don't know enough about signal processing and I don't know -- I don't have any experience to tell you whether it would be useful, but the delay -- >> The changes could be done in the time frame of a frame? >> Sasa Junuzovic: Changes like as in reconfiguring the system? >> Right. >> Sasa Junuzovic: No, no. No, no. No way. Especially if it's a distributed system. I mean, just think of latencies that it has to go out. There's no way. All I can say is eventually it will switch. >> Yes, I have a question. So with this research were -- was motivated from [inaudible]. >> Sasa Junuzovic: The part I focused on, yeah. >> Right. So that motivations really nice. [inaudible] did experiments to show that people indeed don't really notice the things. >> Sasa Junuzovic: So that's another good question. And I've relied on previous work to get those noticeable thresholds. It would be very interesting to see if people -- I'm banking on the fact with this system, for example, or these ideas is that all else equal, if there's noticeable performance difference, people will prefer systems that perform better. Now, whether they'll notice -- whether they'll notice an improvement of performance during the session, that would be an interesting study. For example, what if you tell the users you're optimizing performance but you're really not, would they be willing to tolerate lower performance, right? Or what if you tell them, look, we're working on it actively, would they be willing to tolerate initially poor performance to improve it in a little bit. Now, I haven't done any of these studies. Simply, I didn't have time, and to validate whether people notice these 50-millisecond differences in my scenarios, but that's an excellent idea and I'd love to work on that in the future. Yes. >> So the applications you chose were sort of dominated, I think, not only are they sort of [inaudible] but they tend to be dominate by human delays as opposed to [inaudible] delays? Because all those things you have to wait for a person to type, you have to wait for a person to go [inaudible] talk about the slide first. In I guess -- so my question goes to the question that was talked about before, which is sort of considering shorter latency type of applications or maybe one where you have to drag something across the screen so that there's actually like a video sort of component to that. And then the other question I guess that kind of goes along with it -- that really wasn't a question, but was really like do people notice performance shear? So like if some people in the -- if some people are getting a 20-milliseconds delay and other people are getting 100-millisecond delay, do they notice amongst themselves that something's wrong and might that influence the perception of the delay? >> Sasa Junuzovic: Okay. So ->> [inaudible] absolute delay for everyone? >> Sasa Junuzovic: Okay. So two questions. Let's go to the first one. Suppose that there's a dragging motion. So telepointer. Okay. The issue that's going to come up there, and usually comes up is, at least as soon as you think about it, is jitter. How can you handle jitter? Well, as for my understanding of the people and networking guys at my school, they said there's no real analytical model for jitter, if we don't know how big it can get and when it's going to get big and how long it's going to stay. So there's -- in some sense there's not much I can do when it comes to jitter. But now when it comes to -- and nobody really can. So it's a constant across all applications. But now let's think of something like a drag motion. The model I haven't presented can tell you how to -- which architecture to switch to when think times are low. Okay. So we can say when think times are low, maybe I should fan out my tree a little bit more so that fewer people have to -- each computer has to transmit to fewer people. And I should make it deeper, I'm sorry. I should make my entry deeper so that the transmission task is minimal because even though a message comes every 30 milliseconds, processing a telepointer action on Pentium 4 takes a millisecond. So I have about 32 milliseconds worth of time to do stuff. If I can bound my transmission time to 32 milliseconds, let's say, then everybody will be able to chug along and meet all of the requirements. Right? That's -- okay. Now, the second question was -- oh. Absolute response time improvement versus -- so there are several metrics, right? There's local response time, there's remote response time, then there's the difference in response times. That's another metric. And people -I'm not familiar of much work that's looked at that. Although people have speculated that it's important that everybody gets everything at approximately the same time in order to stay coordinated, there are -- you can -- I can probe -- this system doesn't do it. But it would be interesting to add algorithms that optimize configurations based on that requirement. I know some people do it -- [inaudible] and some of his students did it at Saskatchewan where they said before I'm going to send a command to a remote person, but I'm going to delay processing that command locally and see how much I can delay it. Like as latencies increase, see how much I can delay it so that we don't start making mistakes. And we don't complain too much. But basically I click something and it doesn't happen for 200 milliseconds. >> So, sorry, just ->> Sasa Junuzovic: It was a little ->> Just to consider like a [inaudible] application some people shouldn't get faster access to the [inaudible] should be -- try to be even across everybody? >> Sasa Junuzovic: Right. Very useful scenario. The system is not designed for that at the moment, although it's not designed against it either. You could plug in a multicast tree or plug in an analytical model that tries to minimize the difference in absolute response times. >> I was really intrigued on sort of using the sort of human cognition threshold. And I realize it's not the contribution, but it is important for a lot of the decisions that were made. And kind of building on that question a little bit, you know, could you leverage it even more so. For example, is it linear such that a hundred milliseconds is twice as notable and twice as bad as 50? And if it's not, you might be able to leverage that. So let say 80 is not really that noticeable from 60. Well, actually, then maybe then you don't need to be as precise with the clock synchronization. That would be a good way in which you could leverage that. Do you know is that a linear? >> Sasa Junuzovic: Again, I can only say what I've seen in other people's results. And they've measured basically first kind of tolerance levels of various skews and response times. And I don't think it's quite linear. It's more like -- so from your side, it's more like that. And eventually it reaches a threshold where people say I can't tolerate this, I'm not going to use this system anymore. And somewhere far before that there's something called, oh, I can notice it, but I may not care. Now, what the slope of that curve is exactly, I don't know. I think it approximately looks like that first half of a normal curve of some sort. But you could leverage it probably. You could say, look, again, it depends on what the user's requirements are. Maybe the user's function says I'm willing to tolerate response time increases as much as 200 milliseconds as long as some of the guys don't get response times that are better than 100 milliseconds. So, again, that's going to be incorporated into that function. >> [inaudible] >> Sasa Junuzovic: Yes. >> If you're just sharing text, if you're doing text messages, I suspected a second delay is not going to really matter. You're still -- the cognitive processing of reading the [inaudible] is going to power -- overpower your slight delay [inaudible] somebody presses enter and sends the text. But if it's -- if you're shooting somebody in a video game, then I think, you know, then you're talking about 50 to 100 millisecond response time, which isn't really going to matter. >> Yeah, that is. I agree. Although, even with text, if it's a conversation, if it suddenly feels funky or weird or whatever because -- yeah. But -- yeah. >> Sasa Junuzovic: Definitely is [inaudible] -[multiple people speaking at once] >> So around slide 114 or something, you talked about in your future work about the kind of novelty of looking at these user-driven performance goals. And yet, you know, if you talk to the search guys, they are all -- and most kind of server [inaudible] applications they're driven by quality of service contracts. And those quality of service contracts are largely driven by the user requirement. So Google or Bing or whatever is talking about we need to service a query within a millisecond. >> Or Amazon, right, like they know sales go down by this much more every 10 milliseconds. >> Exactly. >> Sasa Junuzovic: So I'm not -- yes, you're right. And I'm not saying this is a new field. This is not a groundbreaking approach to study in the future. Everything's kind of been [inaudible] to some degree, and this is not the sum degree, you're right. To a large degree people have can considered let me see what the users want and then I'm going to try to build a system that meets it. But in some cases that's not really kind of done, especially if you assume that some users are more important than others. How many systems actually try to optimize the performance of the important users versus the less important users. I don't know many. Maybe there are, but I don't know. You know, you could use maybe the idea of -- I was talking to Kori earlier about this where in virtual words there's this idea of interest. So if I'm interested in something, I should have good quality of it. And the stuff I'm not interested in I should have -- I don't care if it's not as -- the best quality in the world. And it's kind of done with the ideas of foci and nimbi and auras around people. So if my aura and your aura intersect, then perhaps we're interested in each other. And if we're not -- or if you're in front of me, then I'm interested in you but less [inaudible] even less in Kori type thing. That idea has been kind of bounced around. But, again, it was never kind of -- it was based on user behavior. And this is a requirement. If the users say what's in front of me, give me the good quality, that's exactly this kind of approach. But there are not many systems that have been able to kind of take these -- it's arbitrary requirements that the users have. At least none that I know -- not many that I know of. >> And also to the point of doing tradeoffs, I mean, you know, any Skype or communicator, any is trying to fit within -- trying to fit the audio-video compression [inaudible] compression, network string and all that stuff within both the network pike and the CPU, and so is doing those kind of delicate tradeoff balancing kind of things. >> Sasa Junuzovic: They are, yes. That's a good thing. >> Yes [inaudible]. >> Sasa Junuzovic: Right. Yes. But just to extend that, let's look at that scenario, right? Suppose that all of us are in an audio -- in a conference. And I'm doing something else and I don't really maybe care about seeing everybody. In fact, I'm happy with just getting the text from everybody. So instead of sending audio and video to me, you can save all that by sending me just the transcript, auto generated transcript. Or maybe I'm interested in the audio, so send me the audio, or maybe I have a PDA and all I can get an audio. Maybe I can't get anything else. How does -- I don't know if systems today automatically decide what to do. They might say, look, everybody gets poor audio, or everybody gets this much quality video based on the current set. But, again, this kind of -- on a 1:1 basis, I don't know many that make these tradeoffs. And the cool thing is that if you do have somebody important and somebody less important, you can't sacrifice the performance of the less important person to get -improve performance of the more important person, right? Nothing else changing. You can just say, oh, I'll give you a better quality video and Sasa, he doesn't really work here, give him poor quality video, you know, type thing. >> [inaudible] how do you know that this guy, this person is not -- doesn't care about [inaudible] detect that [inaudible]? >> Sasa Junuzovic: You can use some hints. Yes, it's a problem. Which is why I kind of [inaudible] to that user response time function. I said you guys tell me what you want, and I won't try to guess what you want. But I kind of hinted at it here. You can maybe use attention level or facial gestures to automatically decide if the user needs better or can afford to have poorer performance. >> Kori Quinn: Okay. Thank you. [applause] >> Sasa Junuzovic: Thank you, everybody.