>> Kori Quinn: So we're really excited today to... long time to practice that name -- here to give...

advertisement
>> Kori Quinn: So we're really excited today to have Sasa Junuzovic -- it's taken me a
long time to practice that name -- here to give us a talk today as part of his interview
loop.
Sasa just recently defended his Ph.D. at the University of North Carolina Chapel Hill.
His advisor is Prasun Dewan. Sasa is no stranger to Microsoft Research. He is an MSR
fellow. He has done four internships here, as well as two on product side. So he has lots
of insights about things around here. I'm really excited to have him.
>> Sasa Junuzovic: Okay. Thank you for the introduction, Kori. Welcome, everybody,
to my talk. It's about collaborative applications. So let me start by defining what I mean
by that.
A collaborative application is an application shared among users in such a way that the
input of one user affects the screens of all other users.
Consider a shared checkers game. If one user enters a command to use a piece, then the
piece moves not only on that user's machine but also on all of the other users' computers.
The performance of these applications is extremely important. To illustrate, consider the
Candidates' Day at UNC. Each year the computer science department at UNC invites a
number of potential graduate students to view demos of the research done in the school.
Unfortunately, some of these students are also invited to the Candidates' Day at Duke,
which happens on the same day.
So to give all of the invited students a taste of research at UNC, UNC shares the demo
applications using an application-sharing system. So as long as the students at Duke have
some sort of Internet-connected portable device, they can participate remotely.
If the shared application doesn't respond to the actions for student -- by students at Duke
and quickly or notify them of actions of others in a timely fashion, the student may get
bored and quit the session.
As a result, the student may not end up coming to UNC. What's much worse, the student
may end up going to Duke.
Now, whether we've used collaborative systems or not, we've all experienced tough
performance issues in real life. And consider the commute from Seattle to Bellevue,
which many people do by taking Interstate 90. So sometimes during the day, like rush
hour after work, there are too many cars on the road and a traffic jam results. Nobody's
happy and there's nothing we can do about it.
Other times, such as early in the morning, there's hardly anybody on the road.
Everybody's happy, so there's nothing we need to do about it.
The rest of the time there's a sufficient number of cars on the road that the traffic is not
moving as well as it could be, so some people are unhappy. But there are some things we
can do. We can put in HOV or express lanes and at least improve the commute for some
of them.
From a computer science perspective, we would look at this performance issue in terms
of available resources. So when there are too many cars on the road and the traffic jam
results, we would say that the resources are insufficient and performance is always poor.
When there's hardly anybody on the road, we would say that resources are abundant and
performance is always good.
These two cases bracket the case when resources are sufficient but scarce and it is
possible to improve performance from poor to good. It's called the window of
opportunity, and my work focuses on the window of opportunity on collaborative
systems, which brings me to my thesis.
For certain classes of applications, it is possible to meet performance requirements better
than existing systems through a new collaborative framework without requiring
hardware, network, or user-interface changes.
So before I tell you how I built the system and what kind of classes of applications that
actually helps, let me give you a flavor of what it can do.
So this is an actual experimental result of a collaborative session with and without my
system. The blue line shows the performance without and the red line shows the
performance with my system.
And as you can see, the vertical X axis says that the lower these lines the better the
performance.
If we look at this initial period in purple, there's no difference in performance with or
without the system I've created. But then shortly after this initial period in this orange
area, the self-optimizing system that I've built improves performance, and then sometime
after that it improves it again. And in fact the performance goes flat, almost to zero
parallel to the X axis.
And one question is what aspects of performance are being improved. Two important
aspects of performance are local and remote response times. There are other important
aspects, such as jitter throughput, fast completion time. Bandwidth. Like I say, all of
them are important, but my focus is on the response times.
The local response time is defined as the amount of time that elapses from the moment a
user enters a command to the moment that user sees the output for the command. The
remote response times, on the other hand, are defined as the amount of time that elapses
from the moment some user enters a command to the moment a different user sees the
output for the command.
So what this graph here is showing really are the response time differences with and
without my system. And, as we can see here, the system improved response times like 21
milliseconds, by 80 milliseconds here, about 180 milliseconds here, and by 300
milliseconds here.
So the next question is when are performance differences noticeable.
Well, studies of human perception have showed us that people can not distinguish
between local or remote response times below 50 milliseconds, but they can notice
50-millisecond increments and decrements in either local or remote response times. So
according to that data, the system is noticeably improving performance.
Now, the system I've created doesn't actually look at just the absolute response time
differences when it decides what to do. The reason is that these differences may not tell
the whole story.
To illustrate, consider two-user scenario, which we have two optimizations, A and B.
And optimization A is going to give 100-millisecond response times to user one and
200-milliseconds response times to user two, and the other optimization does the
opposite.
If we do a simple response time comparison, say the average response times, we would
say that the two optimizations are equally good.
But the response times of some users may be more important than those for others. So
suppose user two is more important than user one. In this case we see that optimization
A is not -- it doesn't give as good response times as optimization B, therefore
optimization B is actually better.
So to actually decide which optimization is better, we needed external criteria which are
user dependent.
Now, one use for criteria, as I mentioned, is to favor important users. And the data
needed to enforce that criteria is the identity of the users. Another potential use for
criteria is to favor local or remote response times, and the data we need for that is the
identity of the users who input and those who observe.
In general, the criteria can be arbitrary and require arbitrary data. So there's no way we
can encapsulate all of them into any particular system. So what my system does is it
basically relies on the users to provide a response time function that encapsulates the
user's criteria or requirements. And then the system is going to call that function with
predictive response times for various optimizations and pass in some additional required
data.
In its current form it supports arbitrary response time functions provided by the users, but
it only provides identity of the users and identity of the those users who input and
observe.
So now the question is how does the system meet response time requirements better than
existing systems. Well, I studied three important response time factors: collaboration
architecture, multicast, and scheduling policy. Of these three factors, only the impact and
response times of the collaboration architecture has been studied before.
In particular, Chung has shown that he's provided -- he's created a system that supports
dynamic transitions between architectures and has shown that these dynamic transitions
can improve performance in a collaborative session.
I extended his work by actually automating the decision when to make the switch. He
never provided that.
Now, the only other system, basically presented by Wolfe, et al., also presented a system
that can choose the collaboration architecture automatically. That work was done
concurrently with mine. It was published just recently. But the more important thing is
that they use developer hints and rules of thumb to decide which architecture to use,
which is different in the approach taken in my work, as you will see later.
Now, the idea of multicast, to use the multicast in collaborative systems is not new. The
T120 protocol advocated the use of multicast to reduce the amount of data transmitted on
the network. However, how such a tree was built or what impact it had in various
collaboration scenarios was never studied.
I'm the first to study the impact of multicast on response times and to create a system that
can automatically decide whether or not to use multicast and what kind of multicast tree
to deploy in collaborative systems.
I'm also the first who studied the impact of scheduling policies and response times and
create a system that can automatically decide which scheduling policy to use in order to
best meet users' response time requirements.
To give you a flavor of my work, let me focus and dig deep into one of these factors, the
scheduling policy. And let's just assume for now that a single core is available to carry
out the tasks in the collaborative system.
Well, if we're going to talk about scheduling, we better define the tasks that we're trying
to schedule. In general, there can be two kinds of tasks: those of the collaborative
system application and those of external applications.
And one interesting issue is how does a scheduling of external application tasks impact
the scheduling of the collaborative application tasks.
Unfortunately, the typical working set of applications isn't really known. So rather than
to guess a wrong working set, I just assume a working set of zero. In other words, I
didn't consider external application tasks. I focused only on collaborative systems tasks.
Now, these tasks are defined by the collaboration architecture. Collaboration architecture
views an application as consisting of two components: the program component and the
user-interface component.
The program component manages a shared state, while the user-interface component
allows users to interact with the shared state. Therefore, the user-interface component
must run on each user's machine while the program component may or may not run on
each device.
Regardless of the number of program components, each user-interface component must
be mapped to some program component to which it's going to send inputs and from
which it's going to receive outputs.
Two popular mappings have been used in the past. One is the centralized mapping in
which the programming component runs on the computer belonging to one of the users,
receiving inputs from and sending outputs to all of the users.
The computer that runs the program component is called the master while all of the other
computers are called slaves.
Another popular mapping is the replicated mapping in which the program component
runs on each user's machine receiving inputs from and sending outputs to only the local
user-interface component. Therefore, in this case, all of the computers are masters.
Now, to keep all of the program components running on different computers in sync,
whenever a program component receives an input from the local user it sends that input
to all of the other program components.
Regardless of the actual program to user-interface mapping that we use, the masters
performed the majority of the communication tasks. So in the replicated case, a master
computer must send inputs to all other computers, and in centralized case it must send
outputs.
One question is how is this communication achieved? If we use something called
push-based communication -- with push-based communication, the information producer,
the master, sends information to information consumers, the other computers, whenever
information is ready.
In pull-based communication, which is a duel of the push-based one, the information
consumer basically pulls the producer for information whenever the consumer is ready.
And then a third type of communication, called streaming communication, the producer
simply sends data to the consumers at some regular rate.
All of these models are used in real applications we see today. And my work focused on
the push-based communication.
Now, when you use push-based communication, one question that pops up is whether the
commands or data are unicast or multicast. To illustrate the difference between the two,
consider this ten-user scenario, in which user one's computer has the send commands to
all -- excuse me, to all of the other computers.
With unicast this transmission is performed sequentially. And if there are not a large
number of destinations, this transmission can take a long time.
Okay. Multicast divides the communication task among multiple computers. So when
user one's computer sends a command to user two, these two computers can start
transmitting in parallel.
Basically, what multicast does is it relieves any single computer from performing the
entire communication task, or transition task. The net effect is that with unicast only the
source has to both process and transmit commands, while with multicast any computer
may have to both process and transmit commands. Yes.
>> Can I ask a clarifying question?
>> Sasa Junuzovic: Yes.
>> So multicast in this case is not the multicast routing protocol like big in the '90s
during the collaborative applications?
>> Sasa Junuzovic: So I'm looking at application layer multicast.
>> Okay.
>> Sasa Junuzovic: Okay?
Like I said, any computer may have to perform both of the tasks with multicast.
Now, in addition to these two mandatory collaborative tasks, they could be other tasks
related to concurrency control, consistency maintenance and awareness mechanisms.
While all of these mechanisms are important, in general they're optional, so I focus on the
mandatory tasks.
These tasks are in fact independent and in a replicated architecture a computer can both
process an input command and transmit it independently. And in the centralized case it
can do the same with outputs.
Now, while the processing task is kind of a simple thing to understand, it's the job the
CPU must do. Transition task is slightly more complicated. So let's see what happens -what has to happen in order for a command to be transmitted to a destination.
First, the CPU must copy some data buffers into some memory location from which the
network card can then read the data, and then the network card can transmit it along the
network. So there are really two kinds of transmission tasks: the CPU transmission task
and the network card transmission task.
If we think of scheduling, the CPU transmission task is scheduleable with respect to the
CPU processing tasks, while the CPU transmission task is not. It simply runs -- the
network card transmission task is not. It always follows the CPU transmission. And if
we use nonblocking communication, in fact, it can run in parallel with the CPU
processing task.
So in terms of scheduling, like I say, CPU task -- CPU transmission task is scheduleable
while the network card transmission task is not.
Now, the order in which we execute the CPU processing and transmissions can impact
the local response times.
To illustrate, consider this three-user scenario, in which user one has to send data to user
two and user three. And suppose user one enters a command to move the piece.
Intuitively, to minimize the local response times, user one's computer should process the
command first and transmit second. The reason is that the local response times don't
include the CPU transmission time.
Alternatively, to minimize the remote response times, the CPU should transmit the
command first and process it second because in this case the remote response times don't
include the CPU transmission time.
So we know an intuitive way of scheduling these tasks when either local or remote
response times are more important, we're either going to process first or transmit
respectively. What about the case when they are equally important? Well, in this case, it
may make sense to use some sort of concurrent scheduling policy.
Now, one issue with these existing policies is that they trade off local for remote response
times in sort of an all-or-nothing fashion. If we use process-first scheduling, users
experience good local response times but poor remote response times. And if we use
transmit-first scheduling, then users experience good remote response times but poor
local response times. Yes.
>> [inaudible]
>> Sasa Junuzovic: So before I get started getting into scheduling, I said let's suppose
there's a single core that can schedule these tasks. And I'll come back to the multicore
case after this. And I'll address that question particularly.
So then the remote response times are good, but the local response times are -- yes?
>> Sorry. [inaudible] talking the abstract at this point, but what's the ballpark of the
difference between these two scheduling policies in terms of time? Like you said 15
milliseconds matters. Is this one or is this a hundred millisecond difference?
>> Sasa Junuzovic: So in some scenarios it can be seconds. If you -- yeah. It can be
pretty high. Sometimes it's not very high depending on the scenario. But in some
scenarios, it can be on the order of seconds.
So but if we use this concurrent policy, then -- and I'll give results, by the way, that give
some -- the kind of the absolute value differences. And if we use the concurrent policy,
then both the local and remote response times can be poor.
So ideally we would like a policy that can improve response times without this tradeoff.
Now, since the existing scheduling policies don't allow us to really control this tradeoff,
we need to control this tradeoff. We need a new policy.
And since the system's approach gave us these policies, we need a new approach. And I
turned to psychology. And I designed a policy called Lazy. And the idea behind this
policy is to trade off unnoticeable increases in the local response times for noticeable
decreases in the remote response times.
In this case users stay happy with their local response times because they can't tell the
difference. But they become happier with the remote response times because they can
notice the improvement.
Now, you can imagine a duel of this policy that reverses this tradeoff. And it's also -- it's
just as valid as this one, but my focus was on this particular version of the Lazy policy.
So let me give you the basic implementation I give behind it. The basic idea is that the
computer is going to temporarily delay processing. And during this processing delay, it's
going to transmit to as many computers as it can. But as soon as it realizes that this
delay, if it transmits any more, the delay will become noticeable, it stops transmitting,
processes the command. And only when the command is processed does it resume any
transmissions that were left.
So in this particular scenario, user two's -- the benefit compared to, say, process-first is
that user two's remote response times improved, and others didn't notice a difference.
More generally, the Lazy policy can improve -- can give us improvements and response
times of some users without noticeably degrading response times of others.
Now, if we compare it to the process-first policy, the local response times of Lazy are not
noticeably worse, and this is by design.
But the Lazy remote response times can be noticeably better than those of the
process-first policy. So it seems here like we're getting a free lunch. Yes, Scott.
>> This is probably a naive question, but in order to keep the processing below a
noticeable threshold, do you need to know how long it's going to take you to do the
processing?
>> Sasa Junuzovic: Yes. Well, actually, no, no, you don't need to -- you could look at it
that way, but you could also say, look, however long the processing takes, the fastest it's
going to be is X, right? I don't really care what X is. But if I delay X by 50 milliseconds,
users won't notice. If I increase X by 50 milliseconds, users won't notice.
Okay. So the benefits of this Lazy policy, I've shown them rigorously in a mathematical
model in my thesis. And it's fairly large and sort of complex and it supports not only
single commands but also current commands and type-ahead.
And I'm not going to go through all of it, but let me give you a flavor of this model by
showing you how it captures the tradeoff between Lazy and transmit-first.
So suppose we have replicated multicast architecture, like I've shown here, and consider
just the single command. And suppose that we have a path from the source to destination
as shown here. We call this path pi and we denote the computers on this path from pi 1
to pi 6. Basically our computer is pi I.
And in this case the remote response time of the last user is given by this equation here.
And let me break it down really quickly.
The first term in this equation accounts for all of the latencies on this path. The second
term in this equation accounts for the delays the message experiences on all of the
intermediate computers; that is, all but the nondestination computer. And the third term
in this equation accounts for the delay of the destination computer.
Now, I have to consider the intermediate and destination delays differently, because the
former affects the response times of remote users, while the latter affects the response
times of the local users.
Obviously the network latencies are going to be independent of the scheduling policy.
But let's see what happens with these delays starting with the transmit-first intermediate
delay, for which the equation is shown here.
Basically it states that the transmit-first delay is equal to the amount of time the computer
requires to forward the command to the next computer on the path. And this depends on
several factors. It depends on the CPU transmission time, the network card transmission
time, and the order in which this intermediate computer transmits commands. The order,
the destination order.
Now -- let me see here. Oh. It must include the CPU transmission time, because the
network card can't start transmitting until the CPU at least transmits once.
And in all of my work, the network card transmission time was always bigger than CPU
transmission time, perhaps intuitively. And therefore I have to multiply the network card
transmission time by the destination index of the next computer on the path in orderer to
get this time. This is pretty simple.
Now, let's look at the Lazy intermediate delays, which are slightly more complicated,
because there are two cases I have to consider. I have to consider the case when
transmission happens before processing and when transmission happens after processing.
So if transmission happens before processing, the equation is given by this. I just look at
the one case and plugged it here. As you can see, it's equal to the transmit-first
intermediate delay. And this makes sense because in both cases the processing doesn't
impact -- doesn't get accounted for in the delay, therefore the two should be the same.
And basically it's going to occur when the total CPU time required to transmit the
command to the next computer in the path is less than the maximum allowed processing
delay.
Now, this processing delay is equal to the noticeable threshold minus the sum of the
intermediate delays encountered so far.
So what we have -- I've tried to clean this up a little bit. What we have is that if the
processing delay is larger than the total CPU transmission time required to transmit to the
next computer in the path, then the Lazy intermediate delay is equal to the transmit-first
intermediate delay.
But let's see what happens to computer -- the Kth computer in the path, pi K, as K
increases.
So as K increases, the sum of the intermediate delays experienced so far also increases.
Whoops. So it's going to get bigger. And as a result the processing delay is going to get
smaller. Eventually this processing delay is going to be less than the total CPU time
required to transmit to the next destination on the path, in which case we have to consider
the second case of Lazy policy in which transmission happens after processing.
Question.
>> [inaudible] centralized [inaudible] to figure out whether the delay is -- it doesn't sound
like it's a local policy.
>> Sasa Junuzovic: It isn't. It's a distributed algorithm, yeah.
>> [inaudible]
>> Sasa Junuzovic: You can do it -- there are several ways of implementing [inaudible].
>> You don't want centralized because then you'll have the delays of figuring out
whether -- you know what I mean? It's just [inaudible] just as bad. So centralized is
[inaudible]. But I don't see [inaudible].
>> Sasa Junuzovic: So maybe we're thinking -- talking about a different thing. But when
I say distributed algorithm, each computer has to make a local decision based on what it
knows so far.
>> But it doesn't have enough information to make the decision.
>> Sasa Junuzovic: Right. So I will show you how I actually implemented this policy
when I talk about the implementation. But there are various ways of implementing this
Lazy policy. So one way is to say, look, forget about this distributed algorithm.
Whenever you get a message command delayed by the noticeable threshold but no more,
okay, that's one way of doing so.
>> But that's not the math that you just showed.
>> Sasa Junuzovic: This is not what I'm doing. So this is not the case I'm considering.
The problem -- the reason I didn't consider that case, although it's valid, is that it kind of
favors downstream computers rather than upstream computers. So the upstream
computers are always going to delay their processing to give data to the downstream
computers. And if you add up all of the delays, it can actually become noticeable. If
everybody delays by the noticeable amount, eventually the sum of those delays becomes
noticeable.
It favors downstream computers. The other option is to say, look, just take the noticeable
threshold and subtract -- calculate the difference from now, from the time the command
was input to now. If that difference is bigger than noticeable, then don't delay and just
process.
This is, again, a distributed algorithm. How it's done, you need some clock
synchronization, which can be done relatively simply, I mean, nowadays.
But that's, again, not what I did. The reason is that if the delay -- if network latencies are
high, then what's going to happen is there's not going to be much difference between this
Lazy and the process-first policy since only the process-first guy is going to delay.
So that's not what I did. I took the middle ground where I say, look, subtract from the
noticeable threshold only the sum of the intermediate delays encountered so far. And I
gave this kind of middle ground where there will be benefits to users without noticeably
hurting the response times of others.
>> And you'll talk about how you measured that later?
>> Sasa Junuzovic: Yeah.
>> Okay.
>> Sasa Junuzovic: Okay. So where was I? Oh. So this is the case when the -- in the
Lazy policy when transmission happens after processing. So obviously it's going to be at
least the transmit-first intermediate delay, but it's also going to include the processing
costs, in this case the costs of processing the input and the output.
Now, eventually the delay I was talking about, the processing delay, it's going to go down
to zero. And therefore these two terms that are left in this equation are going to become
zero. In which case we get that the Lazy intermediate delay is equal to simply the
transmit-first intermediate computer delay plus the processing time.
Now, if we combine the results we have so far, we will see that basically transmit-first
intermediate delays dominate that of Lazy. In other words, transmit-first is always better
than Lazy so far.
But let's consider the destination delays, and, again, let's look at transmit-first.
The equation is given here. It's going to equal at least the processing time. Since this is a
destination computer, it must process the command. But since it's transmit-first, it's also
going to include the total CPU transmission time that the destination takes to transmit to
all of the other destinations.
In general, each computer may have to forward to other computers. Like in this case, the
destination forwards to four other machines, that takes some time.
Okay. So basically a simplified equation, I said, look, the transmit-first destination
[inaudible] delay is equal to the total CPU transmission time it has to do, plus the
processing time. Now, let's see what happens to the last computer on the path, to the
destination as the number of computers to which it forwards increases.
As the number of computers to which it forwards increases, so does the total CPU
transmission time, and therefore so does the transmit-first destination delay.
Now, if this number of destination goes really high, the delay can become very high, as
big as you want, depending on the number of users in this scenario.
Now, let's look at the Lazy destination delay. Here's the equation. Of course it must
include the processing times, and this is the destination. But it can be more than
processing time. By how much? Well, by the minimum of the total CPU transmission
time and the maximum processing delay. Remember, we can never delay more than the
maximum processing delay, which is captured by this minimum function here.
So, again, I've just cleaned up this slide really quickly to make some room. Let's see
what happens to the last destinations -- to the destination delay with Lazy as the number
of computers to which the destination forwards increases.
As with the transmit-first policy, as the number of destinations to which the destination
forwards increases, so does the Lazy processes. So does the Lazy destination delay.
However, in this case, it's capped. It can never be more than the maximum processing
delay, which is by definition never more than the noticeable threshold.
So now if we combine this result with our previous finding, we see that the Lazy and the
transmit-first policies actually don't dominate each other.
So let me give you an example to illustrate how this actually -- how these results actually
hold true. And let's consider a 12-user scenario. And suppose we're using replicated
multicast architecture in which user one transmits to user two, then three, four, five, six,
and seven, and user two transmits to user eight, then nine, ten, eleven, and twelve.
Now, the model has given us all the parameters we need to consider. We need to -suppose user one enters all the commands. The model has given us all the parameters we
need to consider. The CPU transmission time, say it's 12 1/2 milliseconds. This is a
theoretical example. Or hypothetical example, I should say.
Network card transmission time is 25 milliseconds. Processing time of input and output
combined, say it's 75. And the network latencies are 25 milliseconds and of course the
noticeable threshold is 50 milliseconds.
So let's analyze the remote response times of user eight with the process-first and Lazy
policies.
So with the process-first policy, when user one enters a command, user one's computer is
first going to process the command. Then it's going to -- the CPU is going to transmit to
user two, then the network card has to transmit to user two, the command has to reach
user two's computer. And as soon as it reaches user two's computer, user two's computer
is going to process it.
Then it's going to transmit to user eight and the network card will transmit to user eight.
It has to -- the command has to reach user eight's computer, and then user eight's
computer is going to process the command.
Okay. Now, let's see what happens with the Lazy policy. When user one's computer
receives the command, first thing it's going to do is it's going to check can I delay
processing and transmit instead. Well, since the transmission time is less than a
noticeable delay, 12 1/2 versus 50 milliseconds, it says yes, I can.
So it transmits first to user two. Then the network card transmits to user two, the
command has to reach user two's computer. And when it reaches user two's computer, it
does the same thing that user one's computer just did; it says can I delay this before I
process, can I delay processing and transmit to at least one destination.
So it says, okay, noticeable threshold is 50 minus the sum of the intermediate delay so
far, which is the single CPU transmission time, the network card transmission time in
user one, and says yes, I have enough time to transmit one's before I process, so it does.
And then the network card transmits on user two's computer, command has to reach user
eight. And because user eight's computer doesn't forward, it simply processes.
So if we see here, the Lazy gives noticeably better remote response times than the
process-first policy. And in fact a more interesting thing is that the Lazy improvement
and the remote response times compared to process-first is additive.
So as we see here, there are three processing times included in the process-first policy
response time while only one in the Lazy response time.
In fact, the more computers in the path that delay processing, the better this -- the more
additive this improvement is.
So that was process-first versus Lazy. Let's look at transmit-first. And let's look at the
remote response time of user two. So with the transmit-first policy, when user one's
computer receives the command, it's going to transmit it and then the network card is
going to transmit it, then the command has to reach user two's computer. And when it
receives it, it's going to transmit to five destinations and finally then process.
With Lazy, what's going to happen is user one's computer will say can I delay processing
before I transmit. Yes, as before. So it's going to transmit to user two. The command
much reach user two's computer. And when it gets it, it says can I delay processing by
transmitting. Well, yes. So transmits to user eight.
Then it asks the question again: Can I still delay processing before transmitting. And it
says no because you've used up all of the noticeable threshold -- the maximum processing
delay. And therefore instead of transmitting it's going to process.
And, as we see here in this example, the Lazy remote response times for user two are
noticeably better than those of transmit-first.
But let's hold on. This was a hypothetical example. I've carefully constructed it to
illustrate the benefits of Lazy versus the other policies. The question is can these
response time differences be noticeable in realistic scenarios.
So to answer that question, I simulated the performance using, again, a local model I've
shown you. But of course I need realistic simulations. So to do that, I simulated the
distributed PowerPoint presentation for which I require a realistic values of the
processing and transmission parameters.
So to get those, I measured them from actual -- I measure their actual values from logs of
real PowerPoint presentations. Now, in these particular logs there were no concurrent
commands and there was no type-ahead, so I couldn't test that part of my model. I could
test the part of the model that talks about single commands, the one we've seen.
And I measured these costs on a number of single-core machines, a netbook, a P4
desktop, and a P3 desktop. Why a P3 desktop? To simulate next-generation mobile
devices.
Now, suppose that the presenter is using the netbook to give this talk and that the
presentation is being given to 600 people who are using a variety of P4 desktops and
next-generation mobile devices.
Now, suppose that these computers are organized in such a way. The six computers
forward to clusters of 99 computers. So in this case, these six computers here are going
to each forward to 99 other computers. Of the forwards there themselves can
communicate any which way they want.
Suppose that the forwarders are on the same land, so the latencies between them are low.
And suppose that the rest of the users are at various places around the world. So to get
realistic latencies from the forwarders to the users around the world, I took a subset of
actual latencies measured between 1740 machines at random locations around the world
by other people, not by me.
And here are the results. Let's compare first the process-first and Lazy policies. So the
local response time we see here, the Lazy local response time is worse than the
process-first response time -- local response time. But not noticeably worse, about 49
milliseconds. And here are the remote response time results. What the stable shows is
that the number of users whose Lazy response time, remote response times are equal to,
noticeably better, noticeably worse or not noticeably different than those of process-first.
And as we see here, for all but one user, the Lazy response times are noticeably better
than the process-first response times. But in this particular scenario by as much as 600
milliseconds. Although, it can be more in others.
And the one user who the performance was not noticeably different still got better
performance, but only by 36 milliseconds.
>> What's special about that?
>> Sasa Junuzovic: What happened in this particular case? So it can happen that these
guys are not noticeably different because imagine that you're using process-first, right?
And the processing costs are low. And I transmit to you first. I have to transmit to a
bunch of guys, but I transmit to you first. So with process-first, I have a low processing
cost, and then you're my first destination. And you also have to say transmit to a bunch
of other people.
As soon as you get the message you're going to process. Now, with Lazy, I'm going to
delay by perhaps 50 milliseconds, you're going to delay by a little bit. So if the sum of
those delays is perhaps -- can be in fact a little bit greater than the sum of the processing
delays on my machine, you can get that these guys are not noticeably different. Yes.
>> Is network topology on the table as far as reconfiguring it for each particular
situation? Like you chose six transmits [inaudible] what if you had chosen to build more
of a tree?
>> Sasa Junuzovic: So that's a good question. So my system by itself can deploy any
tree at once. In particular, simulation in some sense to get something realistic that's -- if
you guys are familiar with WebEx, that's kind of the architecture they use. They have a
bunch of forwarder machines which have low latencies between them, and they forward
to everybody else.
I chose to use that kind of WebEx-like architecture in this particular simulation. But
you're right, the results will be different if there's more of a tree that's deployed.
Right now what I'm talking about is just a scheduling policy. There's a part of my system
that can decide -- calculate what tree should be deployed also, what kind of
communication architecture.
Okay. So now let's look at Lazy versus the other two policies, transmit-first and
concurrent. Oh, sorry, let me just finish. By looking at the interesting results here, we
see that the intuition is confirmed in this scenario. Lazy dominates process-first. In other
words, process-first policy is obsolete.
Now let's compare Lazy and transmit-first and concurrent. Here are the local response
times. And as we can see, the Lazy remote response time is noticeably better than the
transmit-first response time, but by about 70 milliseconds. But it's not -- in this scenario,
it's not noticeably better than the concurrent one. Although it can be in other scenarios.
And here the remote response times, again, the same table, just compared to
transmit-first. For 407 users, the response times are equal. With transmit-first they're
Lazy. For five users, Lazy gives noticeably better response times, by as much as 158
milliseconds. But for 187 users it gives noticeably worse response times, by as much as
240 milliseconds.
And in this simulation the results were the same with the concurrent policy. But what we
can take out of this is again if we look at the interesting things, none of these policies
dominate each other. But let's look closer at the five users that get better remote response
times with Lazy than with the transmit-first policy and concurrent. It turns out that four
of these five users are the forwarders. These are the guys that are doing all the work.
So the Lazy policy is in some sense more fair than the other policies. Because it keeps
the remote response times low of the guys who are doing all the work of transmitting the
commands.
So if we combine all of these results here from this simulation, what we have is that if the
users want to improve as many remote response -- as many response times as possible,
the policy that should be used is transmit-first or concurrent. But if the users say, look,
improve as many remote response times as you can but don't noticeably hurt local
response times, then we should use the Lazy policy.
And if the users say something like improve as many remote response times without
noticeably degrading local response times or the remote response times of those users
who forward, then again we should use the Lazy. We know what decision to make the
Lazy policy.
Now, I've actually built a system that can automatically make this decision. And this
system is a collaborative system first and a self-optimizing system second. If it's going to
automatically choose which scheduling policy to use, the collaborative [inaudible] better
support all of the transmission and processing tasks that are possible. Therefore, this
collaborative functionality must support for centralized and replicated architectures. It
must support unicast to multicast. And of course all of the scheduling policies. And for
now the only self-optimizing functionality is to automatically choose which scheduling
policy to use.
So let's see how this system shares applications to start off with. To share an application
we somehow have to intercept inputs and outputs. To do so, the system interposes a
client-side component on each machine between the user component and the program
component. And it does so on all machines.
And consider the centralized architecture first. When the client-side component of the
master intercepts an input from the local user, it simply forwards it to the local
programming component. But when it intercepts an output from the program component,
it doesn't just send it to a local user-interface component, it sends it to the client-side
component [inaudible] slave, on all of the slaves. And then all of those client-side
components forward the output to their local user interfaces.
And when the client-side component and the slave intercepts an input from its local user,
it cannot forward it to a program component. It's not active. It has to forward it to the
client-side component in the master, and then the same thing happens as before.
In the replicated architecture, whenever a client-side component intercepts an input, it
doesn't forward it only to the program component, it forwards it to all of the other client
side components on all of the other machines who then forward it to their program
component, and the rest is as you can see on this slide.
Now, to support multicast -- it also supports multicast. Suppose that we want to have
user one transmit to user two and user three, but we want user two to transmit to user
four. Okay. So as you can see here, you can just map the client-side components to
support this multicast [inaudible] any arbitrary multicast stream.
And it also supports all of the scheduling policies. Each client-side component creates
separate threads for the transmission and processing tasks. To support the transmit-first
policy, it simply gives higher priority to the transmission task, lower priority to the
processing task. It switches those priorities to enforce a process-first policy and then
gives them equal priorities to enforce the concurrent policy. And this is going to address
your question next.
So how does it implement the Lazy policy? This is more complicated since it's a
distributed algorithm. Each client-side component on a forwarding machine has to delay
processing while the delay is not noticeable. So it must note a delay so far.
So how do we do this? Well, in other words, to help each other deduce this information,
whenever a user enters a command, the source computer timestamps the command with a
local time and forwards that time along with the command. Each forwarder, then,
calculates the difference between the time that's left so far and that input time -- that
command input time.
And if this difference is not noticeable, it's going to continue delaying processing.
Of course, this needs some sort of the clock synchronization ->> [inaudible]
>> Sasa Junuzovic: Yes. Well, it has to be fairly accurate.
>> Well, if you don't have it accurate to milliseconds, then it's all messed up, right? I
mean, otherwise you won't be able to tell what a noticeable delay was if you can't tell
more than 50 milliseconds.
>> Sasa Junuzovic: Yeah, 50. [inaudible] order of tens of milliseconds, if the errors are
on the orders of tens of milliseconds, forget it, it's not going to work, right? But it turns
out in my experiments, okay, I was able to use a very simple scheme that seemed to
work. So more robust schemes are possible, and if greater accuracy is desired, we can
use them. But a simple scheme worked for me.
Okay. Now, let's see. Oh. So of course now to choose one of these policies, we have to
use the self-optimizing part of the framework. And let's see how that looks like. It
basically has a server-side component and a client-side component. The server-side
component has four components of its own, as in the analytical model, which applies the
model. There's the response time function which applies the user's response time
function.
Now, the results from the analytical model are sent to the response time function in what
I call a predictive response time matrix. And when we're choosing the scheduling policy,
what this matrix has, is for each scheduling policy, the predictive response times of all
users for commends entered by each user.
It sends that to the response time function, and the response time function says, okay, the
scheduling policy you should use is this, and then the system manager on the server side
kind of forces all of the clients to use that policy.
Now, of course to apply the model in the first place, we have to somehow gather
parameter values. That's the job of this parameter collector. It can do so either
dynamically or it can measure them on the fly. Or, I'm sorry, it can use historical values.
Now, to measure them dynamically, the system basically interposes a client-side
component of the optimization framework which is part of the client-side component
we've seen before. And that optimization, client-side component simply times all of the
commands.
Now, I have slides that I've -- in the back for time that tells -- that talk about how I
measure all of them, and I can get back to them. But for now I'm just going to say that
the client-side component is maybe to measure them.
Okay. Now, finally, how do we switch between policies. Well, when the system
manager is told what policy to switch, it simply tells each client-side component if you
want to use -- look, if you want to use transmit-first, make sure your priorities are set so
the transmission has higher priority than processing. To use process-first, it simply
switches them to concurrent and so forth. And to force the Lazy policy, it tells it to apply
the Lazy algorithm. Yeah.
>> So let's say that I have a client that is compromised and picks a policy that
deliberately screws up the system. Can your system adjust to that or detect it and pull
that person out?
>> Sasa Junuzovic: So I didn't look at -- what's the word, basically compromised issues
of security or issues of somebody actively trying to cheat the system. That was not part
of my research. But it's a very good question.
I don't -- my system at the moment is not designed to handle it, and it will be kind of
cool, actually, to try and figure out if there's ways to -- I mean, you can do it in some
scenarios in different kind of applications. Perhaps you can also do it here also.
Okay. Now, there's an interesting implementation issue that arises during the switch of
scheduling policies, in particular, to the performance of the commands entered during the
switch.
Suppose we're switching from the transmit-first to Lazy. This switch in general takes
time. So computers may temporarily use the mix of all the new policies. Now, this isn't
a semantic -- semantics issue, because the scheduling policy doesn't determine how -which computers receive what commands, but it is a temporary performance issue. The
reason is that the analytical model never predicted the performance for this mix of
scheduling policies.
So we could temporarily degrade performance, not improve it. Fortunately, eventually
all of the computers switch to the new policy, and we get the performance predicted by
the model.
Now, so far I've shown you how I can basically choose the scheduling policy
automatically, but on single cores. What about multicores? Well, if we have multiple
cores available to carry out these tasks, and the local response times are more important
than the remote response times, it makes sense to carry these tasks in parallel and
separate cores in what I call the parallel policy.
If remote response times are more important, I'll use the same policy because in either
case the processing or transmission tasks don't affect each other, and I'll get the optimum
response time.
Now, the interesting question is, okay, can I parallelize the processing task. In general,
processing task is defined by the application, and it's a black box [inaudible]. We don't
understand its semantics, so we cannot parallelize it explicitly. So I don't do that.
Well, what about the transmission task? My framework defines the transmission task. It
can obviously use multiple cores to transmit in parallel. But there's no actual response
time benefit to doing so, remote response time benefit to doing so.
A single core can't -- a network card can't keep up with a single core, let alone multiple
cores. And there's another reason why you should be careful to parallelize a transmission
task. It makes predicting remote response times difficult.
In particular, the operating system can schedule the sent calls by the cores in arbitrary
order. So we don't know the order in which the network card is actually going to
transmit, and therefore we can't predict the remote response times. So I don't do this
either.
Now, I'll just briefly say that my analytical model predicts, simulations show, and
experiments confirm that if you have multiple cores, regardless of what the user's
requirements are, run the parallel policy.
So those are my scheduling contributions. Let's briefly look -- I'm kind of running out of
time. Let me briefly go into multicast and the processing architecture, and then I'll go to
my contributions.
As I said, with multicast, the transmission is performed in parallel. And basically this
parallel distribution can result in quicker distribution of commands, which should help
response times.
On the other hand, multicast paths are longer than unicast paths, so the added network
latencies can actually hurt response times. So multicast may improve or degrade
response times.
Another interesting issue with multicast is that the traditional multicast schemes don't
consider all collaboration parameters, mainly the scheduling policy. So suppose we're
using the process-first policy first. And let's look at the remote response times in this
unicast and multicast scenario of the user down here in the corner.
With unicast the remote response time includes the processing time on only the source
and destination, while with multicast it includes the processing time of all of the
computers on the path. So this is yet another reason why multicast can hurt response
times. But I know what you're thinking. You're thinking just use transmit-first policies,
and that's supposed to help remote response times. Well, let's see what happens there.
With transmit-first, unfortunately, since each computer can also have -- may also have to
forward commands, with transmit-first policy, it has to forward before it processes. And
if this transmission time is high, the response times can actually be worse when we use
the process-first policy.
So, again, it's yet another reason why multicast can hurt response times, and therefore the
conclusion is we have to support both unicast and multicast in these systems.
Unfortunately, traditional collaboration architectures couple the transmission and
processing tasks. Remember, the masters pretty much perform all of the communication.
So we have to basically decouple these tasks in order to support multicast, and I did that
in what I call the bi-architecture model of collaborative systems, in which the processing
architecture dictates the processing tasks the computer must carry out, and the
transmission -- and the communication architecture dictates the transmission tasks that
each computer must carry out.
I'm not going to go too much into the results, but let me show you that this -- there's an
interesting issue that happens with the self-optimizing system when it switches
architectures.
Again, the switch takes time, the switching the communication architectures. So what do
we do during the switch? We can simply turn all the architecture off because there may
be messages in transit. So what we do is during the switch we simply use old
architecture, deploy the new one in the background, and when every computer has
deployed the new architecture, we switch to it.
Now, commands entered during the switch are going to use the old architecture, so
therefore they may experience poor -- that is, old -- performance.
Now, these two architectures may have to run in parallel for a while, because there may
be still messages in transit, even when the new architecture is deployed. But eventually
we can just get rid of the old architecture and switch to the new one.
Okay. And let's quickly look at the processing architecture.
Like I said, I focused on replicated and centralized architectures. Which one should be
used to favor local response times? Well, it makes sense that the replicated architecture
should be used. The reason is that the local response time doesn't includes the cost of
remote communication.
Chung has shown experimentally that this intuition is wrong. So then sometimes a
centralized architecture can give better local response times. So maybe we should always
use experiments and then based on the experimental data decide what to do.
Unfortunately, experiments are situational, and there are infinitely many collaboration
scenarios. It's not practical to gather data for all of them. So what we really need is some
sort of an analytical model. And then of course a system that can automate the switch.
Again, there's an issue of commands entered during the switch as with scheduling policy
and communication architecture. In particular, a new master may suddenly receive an
output. Now, this is not consistent with the notion of centralized and replicated
architectures.
So one way to combat this is to simply say I'm going to pause input during the switch.
Now, while this is a simple approach, it hurts response times because any command
entered during the switch is going to be delayed until the switch is done.
So Chung actually showed an interesting way of running the old and new configurations
in parallel, kind of like I run my communication, all the new communication
architectures in parallel. And the good benefit is that it doesn't hurt performance -doesn't hurt performance. But it's not really simple.
So since my focus wasn't really on optimizing the time it takes to do this switch, I was
just focusing on showing that the switch is beneficial. I use a simple approach.
And if we really think about it, it may not actually hurt response times. I mean, we like
to stretch and we like to get a drink of water, go to the bathroom. So if we use these
break times to actually perform the switch, the performance won't actually be hurt.
Okay. And remember that Duke scenario I gave you guys at the very beginning, the
Candidates' Day scenario. So suppose now that the professor is using the P4 desktop, the
one student is using a next-generation mobile device, and then a latecomer comes in with
a Core 2 Duo.
Let me show you -- the result I basically showed you earlier, the performance
improvements is what my system did in an actual experiment for that scenario. So what
happened initially, it took some time to measure the powers of these computers. We
started off with a centralized architecture in which the next-generation mobile device is
the master. The system that took some time to measure the powers of these machines and
other parameters and said, oh, you should really switch to the centralized architecture in
which the P4 is the master.
At some point here the latecomer joined, and shortly after, as soon as the system was
basically able to measure the power and the other parameters related to the latecomer, it
switched to the centralized architecture in which the latecomer's computer is the master.
And that's how the performance actually improved.
Okay. So -- yes.
>> [inaudible] go to Duke?
>> Sasa Junuzovic: I hope he went to UNC. I don't know.
>> [inaudible]
>> Sasa Junuzovic: This was doesn't actually done during Candidates' Day. But it was
an experiment. It wasn't a simulation. It was I took a log of checkers played by multiple
users against the computer, and I replayed that log under various conditions with and
without my system. And then I looked at the performance differences.
So it's an actual experiment, just not -- I just replayed a log because you can't really rely
on the users to repeat everything they did every time if they do it multiple times.
So let me summarize my contributions. I focused on this window of opportunity in
collaborative systems. And I studied three important performance factors: collaboration
architecture, multicast, and scheduling policy.
In the process I was the first to evaluate the impact on multicast and response times. And
in the process I introduced this new bi-architecture model because the old collaboration -traditional collaboration architectures don't support multicast.
I also studied the impact on response times of scheduling policy and came up with this
Lazy policy that combines psychology and scheduling results to give us interesting
benefits.
I also developed a model that can predict the response times for any combination of these
factors. Now, this model by itself is useful. You can give it to users who are locked into
a particular configuration, say a centralized architecture. And the users can use the model
to decide who should be the master.
It's also helpful to users who have a choice of configurations, say the centralized and
replicated architecture, to decide which architecture to use.
And for users who have some sort of system like mine or developed by Goopeel Chung,
they can dynamically switch architectures, they can use the model to decide when to
make the switch.
Of course manually applying the model is tedious and error prone, so I built this
self-optimizing system that can allow the model automatically and in the process
identified several new implementation issues, some of which you've seen, and there are
many more in my dissertation, but I didn't have time to go into all of them.
And if I combine all of these factors, all of these contributions, I give you the proof for
my main thesis. For certain classes of applications, it is possible to meet performance
requirements better than existing systems through a new collaborative framework without
requiring hardware, network or user-interface changes.
Now, as with any framework, you know, this -- I haven't validated it with all applications.
I don't know whether improves performance of all applications. That wasn't my goal.
My goal is simply to defend the ideas that the collaboration architecture and
communication architecture and scheduling policies matter for response times and that it
is possible to automatically dynamically change them to improve response times.
So to show that I focused on three driving problems: a collaborative board game, a
distributed presentation, and instant messaging. Instant messaging is pervasive,
collaborative board games are popular, and entire industries have been built around
distributed presentation problems. So these problems are important.
More importantly, all of these applications I've used, this instances of these problems, the
checkers, the PowerPoint, and instant messaging tool I built, all of them adhere to all of
the assumptions I made. For example, they all use push-based communication. None of
them had concurrency control, consistency maintenance, or awareness mechanisms. And
they both -- they all supported both centralized and replicated semantics.
But the system I've still created -- I mean, you could still say it's very complex. I mean,
it's a complex prototype that can dynamically adjust these parameters. So the question is
would I recommend it to software designers as the one-to-replace-them-all system.
Well, it depends on a couple of questions. First of all, is the added complexity worth it.
There's going to be initial cost, deployment, bugs. You can name it. And many
applications use only the centralized semantics because they're easier to implement in
replicated semantics. That's not to say that replicated architectures are not used, for
example, PowerPoint sharing on Live Meeting, I think -- or, no, WebEx and for sure
Google Docs and spreadsheets use replicated and semi-replicated architectures
respectively, so they're possible. And then to answer my own question, yes, the
complexity is worth it if the performance of current systems is an issue.
The other question is does this window of opportunity exist that I talked about. Well, it
does in multimedia networking. If the bandwidth utilization is high but not too high, we
can do some clever things and does performance from poor to good. While I also believe
it exists in collaborative systems, further analysis is needed to verify that claim.
Now, if it exists now will it exist in the future. Okay. So processing powers are going
up. Which means that processing costs are going down, which means that choice of
processing architecture won't matter. I don't quite believe that. And the reason is that as
processer powers have gone up, so have the processing costs. We simply demand more
complex applications. And this has been true for at least 30 years now.
And as proof I offer to you the fact that if I do an edit operation in powerful -- something
like Word, it can still take a long time to actually get the result. So therefore I believe the
processing architecture is going to continue to matter.
What about communication architecture? I mean, network speeds are going up, therefore
transmission costs are going down. Choice of communication architecture will not
matter.
I don't think that's the case either. In particular, the transmission costs are going to go up
because the demand for more complex applications comes back. If you think of -- if you
think of, for example, telepresence applications coming up, we want HD video now of
the other people, and that's going to require a lot of bandwidth. And cellular networks
are still slow. And they're not getting that much faster anytime soon. And many reasons,
one of them is power consumption because faster network connections drain more power.
So I believe that the choice of communication architecture will stay important.
And what about the scheduling policy? Okay. We're in the age of multicore. And my
results say that whenever you have multicore computer use the parallel policy, so why
would I choose the choice of -- why is the choice of scheduling policy important.
Well, multiple cores usually also mean more power, which is one of the reasons why at
least most of the cell phones or maybe all PDAs and netbooks still use single core -single cores.
Also, even if you have multiple cores, let's come back to that high-definition video
example. Just last summer, just last summer, I worked on a project here at MSR in which
there was an application that was supposed to render two HD videos of two -- HD video
of two remote parties. And on a quad-core machine, it was having trouble. So maybe
most of the cores will be needed to render the video, perhaps leaving only one core to do
the rest of the things, in which case the scheduling policy is still going to matter.
Now, in addition to these contributions, I've made several others. The simulator I've built
is very important because large-scale experiments are different to set up, while
large-scale simulations are not. And, in fact, you may not have the resources to do
large-scale experiments. Well, you don't need many resources to do large-scale
simulations.
The simulator I've presented is sort of like the network simulator used by networking
researchers. And it's the first such system for collaborative systems.
I also have a couple of teaching contributions. For example, the students can use the
simulator to learn the impact of collaboration architecture, multicast and scheduling
policy and response time, and then they can actually use the system to experience the
difference.
Now, in addition to these -- most of my dissertation contributions were theory and
systems oriented. But I've also made a number of application contributions and basically
user studies when I was an intern with MSR.
Early in my MSR career I studied awareness issues in multiuser editors, and I found that
users would like to know when others are going to make conflicting operations and when
others are reading what the users are writing. So to provide the information to the users, I
created these awareness mechanisms based on real-life metaphor of shadows. And a user
study evaluating these mechanisms show that they can help coordinate users with their
activities.
The summer after I returned to my roots and I made a collaborative framework for
interactive Web applications, and this was actually my only systems-oriented work at
MSR.
The summer after that I studied awareness in meeting replace systems. I found that -- or
we found that users don't want to know only who the current speaker is, they also want to
know who the speaker is talking to, looking at, and whether there are any side
conversations ongoing.
And this information you usually get from the spatial relationship of the participants in
the room. But so, therefore, I built this 3D interface that positions the videos of all the
participants in a way that preserves that spatial relationship. We did a user study and
found the system is better at providing that information and other issues. It was a little
confusing with all the rotations. But, more importantly, based on the user study, we
found requirements and recommendations for all future meeting viewer systems.
And then just recently I focused on telepresence. In the summer of 2008 Kori and
Zhengyou and I studied window layouts and multiway telepresence applications that also
share applications, so how we found that users prefer certain layouts while they dislike
others based on a couple of user studies, and we've proposed several guidelines for
future -- for future design of these systems for what kind of layouts to use.
Just recently I studied how people can catch up on parts of meetings they missed in real
time even while the meeting is ongoing, and we evaluated several mechanisms: text,
video, and audio. And there's also data currently being published.
But let me just then finally -- oh, these are [inaudible] publications, patents, several
patent applications and a tech doc, so it was a very successful time why I was here.
But let me tell you just briefly about my future work. Based on everything I've done, I've
truly started to believe that we need to look at performance from the user requirement
perspective. So a lot of systems today are built -- I have all this equipment, let me do the
best performance I can give.
But I really think, based on what I've seen, is that it's important to see what the users
exactly want and then optimize the way the system provides them the data in a way that
basically optimally meets their requirements.
But I propose several useful -- potentially useful criteria, such as important users and
[inaudible] local or remote response times. What do the users really want? I don't know.
And user studies need to be done to discover that.
It's also going to be interesting to adjust configuration of a system based on user activity.
Okay. So, for example, facial expressions or attention level. If you think of something
like Second Life, when the user is like this in the world, that means they haven't done
anything for a while and they're probably not at their desk, therefore we can sacrifice
their performance to improve the performance of others.
And if you look at facial expressions, say in telepresence applications, if someone starts
to frown or scratch their head or something, it could be sign of confusion. And one way
that confusion could arise is because whatever I'm saying, it doesn't seem to be in sync
with whatever is appearing on the screen. Therefore maybe we should -- if that user is
important, we should reorganize the system to improve the performance to that particular
user.
Now, speaking of these other applications, I think it's really important to continue to
extend my work to other scenarios, something like virtual words.
In today's something like Second Life, which is fairly popular, performance issues arise,
but it's few as ten users in a single location. And people want to do conferences and
group meetings in Second Life. So there's something that needs to be done.
Now, Second Life is -- it's a centralized architecture, so perhaps we can change that. But
we can maybe then [inaudible] deploy multicast and change the scheduling policy now
that everybody has to transmit and process at the same time. Not just the [inaudible] but
everybody has to.
And it's a step closer to we're solving these issues, the performance issues that arise.
Now, of course it's important to look at -- I think it's important to apply some of these
ideas at the telepresence meetings. So in telepresence you have HD video and perhaps a
CPU-intensive application and the two are going to hurt each other's performance.
So it would be interesting -- there's a video quality and response time tradeoff. It will be
interesting to look at aspects of the Lazy policy. For example, to trade off unnoticeable
response time increases for perhaps noticeably better video quality. Or vice versa. In
fact, it's not quite clear. I mean, in all videoconferencing systems we have results that say
audio is the most important, text is the -- or the application is the next most important,
and video, if you can do it.
But with these telepresence things, high-quality video is becoming really important. And
therefore we have to be able to either provide it or sacrifice something else to maintain
the quality that's there, and maybe this Lazy approach will work.
Again, telepresence meetings, for example, if I think of -- this just came to me recently,
this idea when I thought about doing meeting replay, while the meeting is ongoing, one
of the ways people can review what happened in the meeting is if there's an automatic
speech recognizer somewhere.
Now, this ASR -- they're called ASRs -- are fairly CPU intensive, or can be. So the
question is where does it run. Does it run on each user's machine? So take some
processing power, but then all the machine has to do is transmit text to everybody else.
Or do we run the processing on some fast, fast machine, but the tradeoff there is that even
though, yes, it can be faster and take less time, now suddenly everybody has to send
audio to that machine instead of everybody sending text to each other.
So this kind of tradeoff is basically what my system was designed for. Anytime there's a
tradeoff that can be exploited, my system tries to do it. In the scenarios that I have. But
I'd love to extend it to these other applications.
And I'd like to look at mobile devices. In particular I want to design these energy-aware
architectures and scheduling policies. For example, what if I take longer to perform a
task but use less power but the increase in the completion of the task -- the increase in the
task time is not noticeable to users. Some people have already started doing this in other
applications. I'd like to investigate it also.
So I think -- oh, simulations and experiments. I want to do a user study with my
simulator to see if it actually helps people allocate resources. So in large-scale online
systems, administrators kind of use rules of thumb or kind of just try to allocate, oh, this
many servers and there will be enough. What if they had my simulator? Could they
make more accurate resource -- or prevision decisions?
And, finally, in my work I found that it's almost impossible to do large-scale
experimental testing, simply because the clusters that are available, say, PlanetLab or
Amazon's EC2, they don't give you enough control to make sure that you always use the
same computers, that nobody's sharing them with you and so forth.
Therefore, I'd like to start some sort of collaborative effort that designs a large-scale
experimental test but just for collaborative applications.
And I'd like to acknowledge all of my committee members, and many people at
Microsoft Research, but I only listed Kori and Zhengyou and Rajesh. But many, many
others. And I'd like to acknowledge my funding. And I'm sorry I went a little bit over
time. I hope it's not a big deal. Thanks, everyone.
[applause]
>> Sasa Junuzovic: Yes.
>> So you mentioned video a lot in your talk, and I'm actually wondering, most of your
applications, was the processing [inaudible] actually fairly processing what? I mean, the
talk, sending over text, sending over, you know, changes in PowerPoint are few and
infrequent. Have you thought about really processing intense applications? I mean, 3D
games?
The reason I'm saying is that if you think about the processing requirements there and if
you think about running at 30 hertz, which is the standard these days, that means that
your frame comes every 33 milliseconds.
>> Sasa Junuzovic: Right.
>> Which means your delay of 50 milliseconds is not even a remote possibility.
>> Sasa Junuzovic: Right.
>> Right. So have you thought about any chances of how do you address something like
that where you're more bound by the continuous stream of processing and if the chance of
interjecting anything is almost not possible?
>> Sasa Junuzovic: So that's a good question. And, like I said, as you pointed out, I
focused on the push-based communication, not quite the streaming-based
communication.
Now, saying the delay is not possible is not quite right, because I can delay every -- if I
just offset all of the frames by a little bit, okay, I'll still get the frames at 30 hertz on my
local machine, right? I can have initial perhaps delay.
Now, how can I use that initial delay? Well, not all frames require the same processing
power. Not the same amount of processing time. Some frames are simple to process;
some are not. At least that's my understanding. So perhaps sometimes when I have a -suddenly go from -- into low -- less expensive frames, let's call them, less expensive
frames in terms of processing power, perhaps there's some sort of delay I can interpose
there and say, look, they're not that important anyway, so maybe I can render them a little
bit later.
Now, I don't know enough about signal processing and I don't know -- I don't have any
experience to tell you whether it would be useful, but the delay --
>> The changes could be done in the time frame of a frame?
>> Sasa Junuzovic: Changes like as in reconfiguring the system?
>> Right.
>> Sasa Junuzovic: No, no. No, no. No way. Especially if it's a distributed system. I
mean, just think of latencies that it has to go out. There's no way. All I can say is
eventually it will switch.
>> Yes, I have a question. So with this research were -- was motivated from [inaudible].
>> Sasa Junuzovic: The part I focused on, yeah.
>> Right. So that motivations really nice. [inaudible] did experiments to show that
people indeed don't really notice the things.
>> Sasa Junuzovic: So that's another good question. And I've relied on previous work to
get those noticeable thresholds. It would be very interesting to see if people -- I'm
banking on the fact with this system, for example, or these ideas is that all else equal, if
there's noticeable performance difference, people will prefer systems that perform better.
Now, whether they'll notice -- whether they'll notice an improvement of performance
during the session, that would be an interesting study. For example, what if you tell the
users you're optimizing performance but you're really not, would they be willing to
tolerate lower performance, right? Or what if you tell them, look, we're working on it
actively, would they be willing to tolerate initially poor performance to improve it in a
little bit.
Now, I haven't done any of these studies. Simply, I didn't have time, and to validate
whether people notice these 50-millisecond differences in my scenarios, but that's an
excellent idea and I'd love to work on that in the future. Yes.
>> So the applications you chose were sort of dominated, I think, not only are they sort of
[inaudible] but they tend to be dominate by human delays as opposed to [inaudible]
delays? Because all those things you have to wait for a person to type, you have to wait
for a person to go [inaudible] talk about the slide first.
In I guess -- so my question goes to the question that was talked about before, which is
sort of considering shorter latency type of applications or maybe one where you have to
drag something across the screen so that there's actually like a video sort of component to
that.
And then the other question I guess that kind of goes along with it -- that really wasn't a
question, but was really like do people notice performance shear? So like if some people
in the -- if some people are getting a 20-milliseconds delay and other people are getting
100-millisecond delay, do they notice amongst themselves that something's wrong and
might that influence the perception of the delay?
>> Sasa Junuzovic: Okay. So ->> [inaudible] absolute delay for everyone?
>> Sasa Junuzovic: Okay. So two questions. Let's go to the first one. Suppose that
there's a dragging motion. So telepointer. Okay. The issue that's going to come up there,
and usually comes up is, at least as soon as you think about it, is jitter. How can you
handle jitter?
Well, as for my understanding of the people and networking guys at my school, they said
there's no real analytical model for jitter, if we don't know how big it can get and when
it's going to get big and how long it's going to stay. So there's -- in some sense there's not
much I can do when it comes to jitter. But now when it comes to -- and nobody really
can. So it's a constant across all applications.
But now let's think of something like a drag motion. The model I haven't presented can
tell you how to -- which architecture to switch to when think times are low. Okay. So
we can say when think times are low, maybe I should fan out my tree a little bit more so
that fewer people have to -- each computer has to transmit to fewer people.
And I should make it deeper, I'm sorry. I should make my entry deeper so that the
transmission task is minimal because even though a message comes every 30
milliseconds, processing a telepointer action on Pentium 4 takes a millisecond. So I have
about 32 milliseconds worth of time to do stuff. If I can bound my transmission time to
32 milliseconds, let's say, then everybody will be able to chug along and meet all of the
requirements. Right? That's -- okay.
Now, the second question was -- oh. Absolute response time improvement versus -- so
there are several metrics, right? There's local response time, there's remote response
time, then there's the difference in response times. That's another metric. And people -I'm not familiar of much work that's looked at that.
Although people have speculated that it's important that everybody gets everything at
approximately the same time in order to stay coordinated, there are -- you can -- I can
probe -- this system doesn't do it. But it would be interesting to add algorithms that
optimize configurations based on that requirement.
I know some people do it -- [inaudible] and some of his students did it at Saskatchewan
where they said before I'm going to send a command to a remote person, but I'm going to
delay processing that command locally and see how much I can delay it. Like as
latencies increase, see how much I can delay it so that we don't start making mistakes.
And we don't complain too much. But basically I click something and it doesn't happen
for 200 milliseconds.
>> So, sorry, just ->> Sasa Junuzovic: It was a little ->> Just to consider like a [inaudible] application some people shouldn't get faster access
to the [inaudible] should be -- try to be even across everybody?
>> Sasa Junuzovic: Right. Very useful scenario. The system is not designed for that at
the moment, although it's not designed against it either. You could plug in a multicast
tree or plug in an analytical model that tries to minimize the difference in absolute
response times.
>> I was really intrigued on sort of using the sort of human cognition threshold. And I
realize it's not the contribution, but it is important for a lot of the decisions that were
made. And kind of building on that question a little bit, you know, could you leverage it
even more so. For example, is it linear such that a hundred milliseconds is twice as
notable and twice as bad as 50? And if it's not, you might be able to leverage that.
So let say 80 is not really that noticeable from 60. Well, actually, then maybe then you
don't need to be as precise with the clock synchronization. That would be a good way in
which you could leverage that. Do you know is that a linear?
>> Sasa Junuzovic: Again, I can only say what I've seen in other people's results. And
they've measured basically first kind of tolerance levels of various skews and response
times. And I don't think it's quite linear. It's more like -- so from your side, it's more like
that.
And eventually it reaches a threshold where people say I can't tolerate this, I'm not going
to use this system anymore. And somewhere far before that there's something called, oh,
I can notice it, but I may not care.
Now, what the slope of that curve is exactly, I don't know. I think it approximately looks
like that first half of a normal curve of some sort. But you could leverage it probably.
You could say, look, again, it depends on what the user's requirements are. Maybe the
user's function says I'm willing to tolerate response time increases as much as 200
milliseconds as long as some of the guys don't get response times that are better than 100
milliseconds.
So, again, that's going to be incorporated into that function.
>> [inaudible]
>> Sasa Junuzovic: Yes.
>> If you're just sharing text, if you're doing text messages, I suspected a second delay is
not going to really matter. You're still -- the cognitive processing of reading the
[inaudible] is going to power -- overpower your slight delay [inaudible] somebody
presses enter and sends the text.
But if it's -- if you're shooting somebody in a video game, then I think, you know, then
you're talking about 50 to 100 millisecond response time, which isn't really going to
matter.
>> Yeah, that is. I agree. Although, even with text, if it's a conversation, if it suddenly
feels funky or weird or whatever because -- yeah. But -- yeah.
>> Sasa Junuzovic: Definitely is [inaudible] -[multiple people speaking at once]
>> So around slide 114 or something, you talked about in your future work about the
kind of novelty of looking at these user-driven performance goals. And yet, you know, if
you talk to the search guys, they are all -- and most kind of server [inaudible] applications
they're driven by quality of service contracts. And those quality of service contracts are
largely driven by the user requirement. So Google or Bing or whatever is talking about
we need to service a query within a millisecond.
>> Or Amazon, right, like they know sales go down by this much more every 10
milliseconds.
>> Exactly.
>> Sasa Junuzovic: So I'm not -- yes, you're right. And I'm not saying this is a new field.
This is not a groundbreaking approach to study in the future. Everything's kind of been
[inaudible] to some degree, and this is not the sum degree, you're right. To a large degree
people have can considered let me see what the users want and then I'm going to try to
build a system that meets it.
But in some cases that's not really kind of done, especially if you assume that some users
are more important than others. How many systems actually try to optimize the
performance of the important users versus the less important users. I don't know many.
Maybe there are, but I don't know.
You know, you could use maybe the idea of -- I was talking to Kori earlier about this
where in virtual words there's this idea of interest. So if I'm interested in something, I
should have good quality of it. And the stuff I'm not interested in I should have -- I don't
care if it's not as -- the best quality in the world.
And it's kind of done with the ideas of foci and nimbi and auras around people. So if my
aura and your aura intersect, then perhaps we're interested in each other. And if we're
not -- or if you're in front of me, then I'm interested in you but less [inaudible] even less
in Kori type thing.
That idea has been kind of bounced around. But, again, it was never kind of -- it was
based on user behavior. And this is a requirement. If the users say what's in front of me,
give me the good quality, that's exactly this kind of approach. But there are not many
systems that have been able to kind of take these -- it's arbitrary requirements that the
users have. At least none that I know -- not many that I know of.
>> And also to the point of doing tradeoffs, I mean, you know, any Skype or
communicator, any is trying to fit within -- trying to fit the audio-video compression
[inaudible] compression, network string and all that stuff within both the network pike
and the CPU, and so is doing those kind of delicate tradeoff balancing kind of things.
>> Sasa Junuzovic: They are, yes. That's a good thing.
>> Yes [inaudible].
>> Sasa Junuzovic: Right. Yes. But just to extend that, let's look at that scenario, right?
Suppose that all of us are in an audio -- in a conference. And I'm doing something else
and I don't really maybe care about seeing everybody. In fact, I'm happy with just getting
the text from everybody.
So instead of sending audio and video to me, you can save all that by sending me just the
transcript, auto generated transcript. Or maybe I'm interested in the audio, so send me the
audio, or maybe I have a PDA and all I can get an audio. Maybe I can't get anything else.
How does -- I don't know if systems today automatically decide what to do. They might
say, look, everybody gets poor audio, or everybody gets this much quality video based on
the current set. But, again, this kind of -- on a 1:1 basis, I don't know many that make
these tradeoffs.
And the cool thing is that if you do have somebody important and somebody less
important, you can't sacrifice the performance of the less important person to get -improve performance of the more important person, right? Nothing else changing. You
can just say, oh, I'll give you a better quality video and Sasa, he doesn't really work here,
give him poor quality video, you know, type thing.
>> [inaudible] how do you know that this guy, this person is not -- doesn't care about
[inaudible] detect that [inaudible]?
>> Sasa Junuzovic: You can use some hints. Yes, it's a problem. Which is why I kind of
[inaudible] to that user response time function. I said you guys tell me what you want,
and I won't try to guess what you want. But I kind of hinted at it here. You can maybe
use attention level or facial gestures to automatically decide if the user needs better or can
afford to have poorer performance.
>> Kori Quinn: Okay. Thank you.
[applause]
>> Sasa Junuzovic: Thank you, everybody.
Download