Document 17865055

advertisement
>> Sharad Agarwal: All right. Thank you for coming. We will get started. It's my pleasure to
have George Porter here. George is from UC San Diego. He got his PhD back in 2005>> George Porter: 2008.
>> Sharad Agarwal: 2008, sorry, from Berkeley; and he's spent the majority of his time since
then at UCSD. You know George from his data center work. He's been doing a bunch of work
with a bunch of fabulous students at UC San Diego. He's going to tell us about that. George
also took over pretty much all of the data center research from Amin Bada[phonetic] when he
left and went to Google. And George is also helping manage and run the center for Network
Systems at UCSD. Awesome. Thank you.
>> George Porter: All right. Thanks Sharad. And thank you all for the opportunity to come and
give you a talk today. It's a real pleasure to be here. I don't think I need to tell this audience
this, but please ask questions during my talk if you have any questions or comments.
All right. So my name is George Porter. I'm going to do be talking today about building
resource-efficient data-intensive applications. So it's incredible how much of our lives has
moved online from information gathering, entertainment, collaborative tools, healthcare,
medicine, government, effectively everything we do in some way has moved online; and each
of these applications is fundamentally driven by data. The quality of your user experience using
all of these sites depends on, in some way, the quantity of data that each of these applications
can process over. And it's not just that these things are driven by data. It's sort of like if you
think of the way Amazon uses data to build, say product recommendations, Spotify uses data to
build custom radio stations, Bing uses data to build personalized search. And it isn't just that
they're data driven but they’re data driven on a per user basis. So, for example, if we look at
the way Amazon works, when you visit the main sort of landing page on Amazon.com, the page
that gets generated is customized to you. And, in fact, each time you access this page there's
over 100 underlying applications that are doing things like collaborative filtering, consulting
with ad networks, your previous purchase histories, preferences, etc. And each of these
applications is driven by data. And so there's an enormous amount of data processing at IO
that goes on ahead of time, behind the scenes, before you arrive at these pages in order to
generate all the data that's needed to consult during that request so that this content can be
customized to you.
And so all of these different applications and data processing requirements have driven the
need for very large data center deployments. So in order to scale up to meet the needs of all of
these different users companies like Microsoft, Google, Facebook, and others have developed
these large data centers which are warehouse scale buildings housing sort of tens to hundreds
of thousands of servers, storage, networking gear, cooling power, etc.
And the result of these data centers has been an incredible amount of scalability. And so, for
example, Google served out 100 billion searches per month, Facebook has 1 billion active users,
Amazon has, actually is closer to about 200 million users today. And this incredible scalability
result has had to, has been arrived at in an incredibly short amount of time. So if you, actually
last month was the 25th anniversary of the web; and about 20 years ago the first sort of
mainstream web browser was released. And back then really everything was basically smallscale. So Google's first data center fit on a folding card table, and supposedly the first Facebook
server was a sort of run out of a dorm room type environment. And so those numbers that I
just quoted to you, 100 billion searches a month and 1 billion active users has had to been
developed in about 15 years and 10 years respectfully. And I think even at established
organizations and established companies you're going to see similar scaling results for anything
that’s kind of user-facing.
And so these organizations have really had to be driven by a relentless focus on scalability. And
so in order to close the gap between no users and effectively the Internet connected world's
population in order 10 years, everything that's been developed as had to focus on scalability.
So the data centers I described, the applications that run in those data centers, the
infrastructure that underlie all of those applications, the storage infrastructure, all of that is
really driven to be able to grow as fast as possible and effectively grow at any cost. And the
costs are incredible. I don't have to tell you that there's enormous capital expenses in terms of
building each of these data centers. And so the way you can think of that is of course that
every time you want to roll out a new application or every time you want to grow to a new set
of users you have to stamp out effectively one of these billion-dollar buildings.
But they’re also incredibly expensive to operate as well with sort of industry estimates at tens
of billions of kilowatts, hours, kind of industrywide. And yet, underlying all of these impressive
scalability results lies an enormous amount of inefficiency. So again, industry estimates about 6
to 12 percent of power actually gets translated into productive work. And the question
becomes a sort of why does that gap exist? Why is there all of this inefficiency in terms of
these data intensive applications? And one of the main sources of inefficiency really comes
down to IO. It’s really about input, output.
You can think of this in terms of IO bottlenecks between distributed applications and the
underlying data that lives on the storage layer underneath them, and there's also enormous
amounts of bottlenecks in between nodes sort of in a distributed cluster shuffling between
each other. And these bottlenecks result in servers that end up waiting for data. So this is
referred to as like a pipeline bubble or a pipeline stall wherein one node is waiting for another
node to complete or before it can make progress it has to wait till that data arrives from that
other node. And so this can cause these cascading performance failures wherein large-scale
systems end up spending a lot of time waiting on data.
This can also kind of manifest in terms of requiring a much larger compute in the storage
footprint than you would otherwise need if you just looked at the amount of processing needed
to make these applications work. And so what we really need to do is to focus on recapturing
IO efficiency, and this boils down to kind of a very simple application of Amdahl’s Law which is
we want to, if we look at data-intensive applications IO is really the bottleneck we need to
eliminate any unnecessary IO’s that we can, and for those IO’s that are necessary we want to
make sure they are as efficient as possible. So the rest of the talk I'm going to talk about this in
two different domains which I'll get to in a second.
So stepping back for a second, if we kind of look at the last 25 years we've really been focusing
on this goal of scale and achieving systems that are able to scale, and as we kind of pivot and
look towards the future it's important that we develop systems that are able to scale efficiently
in order to deal with growing user populations and growing data set sizes. And so the work that
I'm going to describe falls into these two domains. The first is on IO efficient data processing;
and I'm going to talk about some work that my students, myself, and my colleagues have
worked on in terms of building very large scale efficient sorting systems and using those to
build large scale data processing systems. And the second domain is on the node to node,
getting rid of node to node bottlenecks in the system, so focusing on IO efficient data-intensive
networking. And we've been looking at data center interconnect designs that rely on circuit
switching in addition to sort of more traditional packet switching models and combining those;
and we've been able to show that this approach will allow you to sort of cut the cost of your
network infrastructure by 2.8X and the power by 6X. And I'll go into these in a little bit in a
second. Yes.
>>: Are there some numbers to show IO is kind of a big problem in [inaudible]?
>> George Porter: So are there numbers, is there a quantitative way to show that IO is
important?
>>: The IO is, let’s say the big problem, one of the big problems in the [inaudible]. Because you
can imagine, for example, [inaudible] rapidly>> George Porter: Yeah. So that's a great question. There are. So when we sort of, I guess you
could look at this in two different ways, and one of them, like you said, is to make systems
somehow power proportional so the amount of power they draw matches the kind of resource
utilization that they are in. And there’s definitely work in that, and that's an important line of
work to do; but it seems like, at least in today's systems even if you, either things aren't very
power proportional, so there’s a lot of overhead, a lot of costs associated with keeping all the
machines running, and so you're really better off trying to drive as much throughput through
your system as you can and sort of max out all of your hardware. But that's a sort of a more
efficient I guess point in this space. And I hope that in kind of describing this work, I'm going to
go through some quantitative analysis of where some of these bottlenecks are, and I think I'll
answer your question that there is in fact, you can actually see where those bottlenecks are.
Okay. I want to talk first about IO efficient data processing, and I want to start by kind of
defining what I mean when I say data-intensive. I've used that term a couple of times already.
Over the last 30 years or so the definition of what makes something data-intensive has changed
quite a bit. So sort of the mid-80s it might be, let’s say 100 megabytes is a data-intensive job,
and today it's maybe 100 terabytes or even a petabyte. And so over this 30 year time span
there's been effectively 1 million fold increase in what we mean by data-intensive. And so over
that time period the types of applications that have been used to solve these jobs has changed
quite a bit. And so today if you talk about data-intensive computing a lot of times what you
mean is, for example, MapReduce. This is a representative example of the sort of a dataintensive framework for doing processing. And MapReduce is not actually not a new idea. If
you're a list programmer you've been using it for some time, but if you're not familiar with it I'll
really briefly describe it right now.
So in the MapReduce program you're given this input set of key value pairs and you start by
applying a user supplies Map function to each of these pairs. You then group the results of that
function application, by keying you sort each group, and finally you apply a user supply Reduce
function to each of these groups. So if we kind of zoom into what's going on in the
implementation of this programming model we see that the application of the Map function
and the application of the Reduce function, which I'm going to call Map tasks and Reduce tasks,
this is what's called embarrassingly parallel meaning that we can execute these functions
entirely node local without any network communication. So what we are really left with in
terms of the building a MapReduce framework is exactly this group I in this sort operation. And
in a lot of ways this is really the hard part about building these large scale systems because it's
almost exactly opposite of embarrassingly parallel. Generally speaking, data from each of these
nodes has to be shuffled, conveyed, and delivered to each of these destinations, and the
application of these functions really is bottlenecked based on all of these IO’s completing. And
so managing and dealing with all of this IO is an enormous challenge.
And we're not the first people to really identify that it’s a huge challenge. Actually, in the mid80s, that now late Jim Gray wanted to focus people's attention on the importance of the IO
subsystem when building data processing systems. So looking beyond sort of how many
floating point operations per second and starting to look at a holistic view of the system where
you see kind of IO's as a part of this. And so the way that he did this was actually really cool. So
he proposed a contest, a sorting contest, in which the idea was to see who could sort 100
megabytes of data the fastest. And this was great for two reasons. One of them is that we all
learn about sorting sort of freshman year of college, and so everyone involved in these industry
and academic efforts kind of had a good sense of what it meant to sort of data. The second
reason this was really cool is that across a variety of different data processing applications we
now have some representative benchmark or stand-in for what the IO performances, what the
resource efficiency of these applications is. So we can kind of do an apples to apples
comparison.
So obviously things have changed quite a bit since the 80s. So by the late 90s this had grown to
a terabyte, tera-sort record or contest which was won by the Berkeley NOW Project. And then
when we started our project in 2009 the idea was sort to sort 100 terabytes of data. And so
this was held by the Hadoop open source MapReduce Project that was being hosted at Yahoo.
So now we've got this really great benchmark of IO efficiency and I was talking about how
systems in practice are not efficient. And so now that we’ve got this benchmark let's see how
deployed systems do in practice. Yeah.
>>: So the thing about the [inaudible] is true with the sort, but aren't there many other jobs
where [inaudible] be part of the data since that is the bottleneck?
>> George Porter: Yeah. So that's a great question too. The kind of selectivity of that Map
operation, it can vary quite a bit. Obviously if you’re searching for data in a very large data set
and you're looking for kind of a needle in a haystack the output of that Map will be a small set
data set size. I guess the thing about sort is that the selectivity of that is one to one. So every
output record, so it's kind of a worst case from the point of view of IO performance. And so for
a lot of jobs that are low CPU to data item ratio it looks a lot like sort. And so that's why we've
been kind of focusing on it is because often times what you're doing is comparing items,
ranking items, things like that that don't require a lot of CPU per item but you do end up having
to convey all this information somewhere else.
>>: [inaudible] some modification for which how much is 1 to 1 [inaudible]?
>> George Porter: So that's a good point too. I'm trying to think, so there are, you can find
evidence of this in various published pieces of work and it varies depending on the
organization. I think the audience in this room would have a much better sense of what exactly
that CDF looks like than I would. So I would love to chat with you about it to the extent you can
talk about it.
Okay. So, great. Now we've got this benchmark of what we mean by resource efficiency and
let's see how well the deployed systems do in practice. So in 2010 these two researchers from
HP labs look at the results of that great sort contest and they look at all the winners, and the
way they analyze that data was as follows, what they did was they took the delivered
performance of each of those sorting systems and they compared it to the inherent capabilities
of the underlying hardware platform and they looked at that difference. And the results were
surprising and incredibly discouraging.
So on average, 94 percent of disc IO was idle, and about a third of CPU capacity was idle. And
this is among the winners. So this is definitely not good. And if we kind of look more
specifically at this 2009 Yahoo result they were able sort 100 terabytes of data with 3452 nodes
in about three hours, which is quite impressive, but if you actually kind of look at what each
node is contributing to that overall result what you see is that each of the discs in the system is
in some sense running at approximately 1 percent efficiency. So we are in this situation where
we are able to achieve these very impressive results via scalability but we are running at sort of
low efficiency. And remember, these data centers are incredibly expensive to build and
operate, and so the goal that we kind of like to set out is to be able to achieve the same data
set size result in the same amount of time but with effectively [inaudible] magnitude, fewer
resources.
>>: This measurement seems strange to me in the sense like I'm not sure [inaudible] your
hardware, especially in the workload, it seems like you always have some resource that is not a
bottleneck and [inaudible] utilized. So in that sense what is measure really showing us?
>> George Porter: So what we'd like to get is a fully perfectly balanced the system, right, where
all our resources are balanced with each other so that no one resource is, or I should say if we
were to reduce the amount of any one resource the entire system should slow down. That's
some goal that's implicit in the work that I'm describing. Now, as you point out, there's a lot of
heterogeneous jobs. And so in one set of workloads or one set of jobs you may end up with
one resource as the bottleneck and then some other particular type of job you might end up
with a different resource that’s the bottleneck. I do think though that focusing on storage as
the bottleneck is the tack that we've taken because a lot of systems really are storage IO
limited, and so using this as a way to start solving some of the system’s problems of getting the
storage IO up is one of the goals that we've had. I think that can be good.
>>: When you use storage IO limited you mean that they are a major source of inefficiency in
the sense that it is idling? Is that what you mean by that?
>> George Porter: So, yeah. Either they’re idling or they’re fully being utilized and there's not
enough discs to actually keep the workload, to keep all the CPU’s busy, or third, they're fully
being utilized but there are extra IO's being issued that aren't necessary. That’s another way
you can look at it. So I hope that in some way addresses your question.
>>: Another metric could be [inaudible] sort of [inaudible] storage or CPUs [inaudible] standard
servers scheme for your 100 terabytes for X [inaudible]. Would you still have similar problems?
Because that's [inaudible] hundred percent as long as I optimize for the metric under some
standard to keep them [inaudible].
>> George Porter: So I think what you're saying is sort of that you’ve sort of settled in a certain
sense on a binding of compute, memory, network, and storage, and then you're sort of
replicating this unit to the data set size that you need, and I think that's the way people build
real systems, right? You sort of provision a server model, you kind of scale that out and that's a
cluster that you build, and what I would say is that inherent in making that binding of compute,
storage, networking, and memory you already have an idea of what that balance is between
CPU and IO. Either IO to the network or IO to the storage. So to a certain extent you've already
kind of had this sense that there's some ratio and that's why you build a platform in a certain
way. And so when we started this work, and I didn't describe it in the slides here, but we knew
that the types of jobs we were going to be working on were low CPU to IO and so we wanted
two-use servers that has as many discs as possible in them. So at the time we were able to get
machines that had 16 discs, but we had 25 discs that would have been even better. Absolutely.
>>: [inaudible] you said that your goal is to build a systems [inaudible] balance [inaudible]. Any
one of these [inaudible] impacts the same performance. I don’t feel that's a good [inaudible]
because resources don’t cost equal. So suppose this IO was much cheaper than CPU, I would
be happy [inaudible] as long as CPU [inaudible]. So, for example, PennySort was essentially
geared towards students [inaudible]. So why don’t use that as a goal frame [inaudible] data at
the least possible price?
>> George Porter: We actually get to that. I think I'm going to revisit this because one of the
things we were focused on was per node efficiency and this sort contest, you mentioned
PennySort, there's also like a JouleSort contest that we entered as well, and I guess what we
found is that the reason that we've been focused so much on disc IO is that otherwise you end
up with all of this CPU in memory that's sort of waiting on basically on discs. And you need a lot
of discs whenever you have data-intensive applications just because of the capacity issue, and
so what we found was that by kind of driving up the efficiency of that resource we end up
getting energy efficiency. And I'll describe that in just a minute.
>>: [inaudible] define efficiency[inaudible].
>> George Porter: Yeah. So, again, when I talk about the evaluation I’ll mention this is a little
bit, the sorting contest is not just an absolute performance contest. There are these different
categories. And I get to this issue so there is work done per watt and there’s work done per
dollar or for penny and there's simply who can do the work the fastest. And those aren't
always, they don't always lead you to the same system design. A case in point of this that’s kind
of interesting is that for this eco-sort, for this JouleSort contest there’s kind of like two solutions
to this equation, and we've got one of the solutions and Dave Anderson’s group at CMU has the
other solution. And so we've been focusing on if we can build a system that could just handle
raw throughput what we end up with is even though our servers are 300 watts each and we've
got 10 gig networking and stuff you end up with a very highly efficient system. On the other
hand, you can focus on atoms and things like that and get a different solution. It just depends
on your assumptions. Absolutely. So I'll expose that in just a second.
And we've been talking a little bit about this balance issue and I just want to mention, not right
now, which is that to a certain extent this project is in the context of a larger set of work on
looking at trying to come up with the right balance of these different resources in terms of
addressing different types of jobs. And one aspect of balance has to do with the data itself, so
in an ideal world we'd like to divide our cluster up into these groups and we would like to
distribute our data in a very uniform way across each of these nodes, and then we've got a
variety of processing elements on each node, which I'm going to represent with these funnels,
which represent in some sense a CPU or disc type resource. And if we are able to kind of very
uniformly distribute all this data on all these resources then all of the data can be processed
uniformly and we don't end up with a lot of pipeline stalls because data gets generated as it’s
needed. And that's a very efficient way to design systems. Of course the real world is not as
kind to us as we would like it to be and data can be incredibly non-uniform.
So if you look at, for example census data, Seattle is going to have a lot more entries in it then
Driftwood, Texas or something like that. And so this imbalance can end up causing some of the
nodes to become bottlenecks which causes this cascading ripple effect. But resources are also
highly heterogeneous as well. So some discs are faster than others, but even if you bought all
exactly identical discs all with the same part numbers, you’re going to end up seeing this very
wide variance in delivered performance based on just the fact that you have so many of these
resources put into a single cluster. And one of the things is that this imbalance, one of the
effects of this imbalance is you end up with not just an inefficient IO but actually wasted IO’s.
So the thing that's interesting about sorting is that any external, there's this well-known lower
bound which is that any external sorting algorithm requires that you read and write each data
item at least twice in the worst case, and what we say is that any system that actually meets
that lower bound has this two IO property; and that's one of the goals that we set for ourselves
when we started this work.
Now, that imbalance that I just showed on the previous slide can result in extra reads and
writes that aren’t necessary, and this is due to what’s called intermediate data materialization,
meaning that you don't have enough memory to, for example, keep your entire working set in
DRAM and so you end up issuing reads and writes to process that data iteratively. And that's
what we mentioned before about you can have your discs running at 100 percent sort of load
even though you end up with extra IO’s that are cutting that effective load down to something
like one percent. So it's not that in that Yahoo cluster the discs were only being operated at
one percent of the time, it’s just that one percent of their performance got delivered into the
aggregate performance of the system. And just like the data can cause imbalance the
imbalanced discs can lead to this exact same problem for the reason I just mentioned.
So we’d like to restore balance; and we do that in two ways, statically, before the job begins,
and then at runtime. So we are going to borrow techniques from the database community to
sample our data to get a sense of where these partition boundaries are going to be. So this is
research, these are things that databases do all the time, and that's how we figure out what
these partition boundaries are. The key thing is that at runtime we still need to impose bounds
because even if our data has been statically allocated correctly into these partitions, because
the on disc layout of the data can have non-uniform data in it we have to handle that at
runtime, and that's what I'm going to describe in this part of the talk right now.
Okay. So we built a 2-IO storage system called TritonSort that we presented in NSDI 2011, and
it is structured as follows. So instead of many fine-grained tasks processing the data in a divide
and conquer approach we have two phases of operation. So in the first phase, the distribution
phase, we divide our data up into these partitions based on those samples, and then we read all
of our data in parallel and we assign each data item to one of these partitions, and in phase 1
we send it over the network to the node it belongs to and we store it in one of these on-disc
partitions. So at the end of phase one all the data is on the right node and it’s in the right
partition, but each of these partitions isn't sorted. And so in phase 2, in parallel across the
cluster, we read in each of these partitions, sort it in memory, and write it back out again.
We've also sized our partitions so that we can ensure that each of these are going to fit into
memory.
So to see that in action we start by reading a buffer data offer input disc. We have a process
that's assigning it to these different partitions and copying it into in-memory buffers designed
for the different destination nodes, and then when these buffers get full we have some code
that sends them over the network to the node they belong to. Then on the receiving side as
data arrives we append it to a variety of these on-disc partitions. And just to give you a sense
of the numbers involved here, we've got eight output discs on our machine and each disc has
about 300-400 of these partitions on it. So you can think of it as 300-400 on-disc files that store
the data. In phase 2 we're going to read one of these unsorted partitions into memory, sort it,
and write it out.
So the bulk of that pipeline, if the details are in our paper, and it’s actually pretty
straightforward to implement, and the real complexity of that system is exactly this partition
appending module because we have to ensure that we're writing out of data to these discs in
large enough batches that the discs deliver good performance. And so I'll just describe very
briefly how we do that now.
So this module has given us input on the left, a buffer of these key value pairs, and on the right
we've got a set of discs, each of them is holding a couple hundred of these partitions and
there's a thread before each one of these that’s ready to write out the data to the disc.
So the first thing we did was we implemented the kind of most straightforward way of doing
this which is to sort of scan through this buffer of key value pairs and rely on the operating
system to deliver and manage the IO for us. So we just issued writes or scattered writes or sort
of the fancier writes. The result was that the system performed, we had low performance. And
the reason for that was really just due to the fact that there wasn't enough buffering handled
automatically by the OS to ensure that the writes getting delivered to these discs were
sufficiently large to run near their sequential speed.
So what we did was we sort of scrounged up as much of the memory as we possibly could,
about 80 percent of the memory on each node which was 20 gigabytes, and we managed all
the buffering ourselves. So we divided this memory up across all are different partitions, we
copied data into these partitions, and when they get full we write them out to disc. But I
mentioned that there's this non-uniformity of the input data. And so what ends up happening
is these partitions are either really hot or very cold. And so taken as a whole our memory was
not particularly well-utilized. So the result of that meant that our writes were not particularly
very large.
So what we ended up doing was we building a load balancer that ran at runtime in front of our
disc and it works as follows, we took that same 20 gigabytes of memory and now we divided it
up into 2 million little ten kilobyte buffers that we stick in a memory pool, and so as data starts
arriving from the network we basically are going to copy it into these little buffers and stick it
into a data structure here. And the way this data structure is organized is that we have a row
for each of our 2500 or so partitions, and each of these rows which corresponds to one
partition, what we're going to do is we grab a buffer, put our data in there, then we add it to a
list or a chain of these buffers per partition. And the nice thing about this data structure is that
as partition’s popularity varies during a run even on short timescales, we can extend some of
these chains to become longer and then some of the less popular partitions have shorter chains
but none of our memory is actually being dedicated that isn’t actively being used.
In parallel with this there is a process that's constantly scanning this data searcher and it’s
looking for the longest length chain which represents at that instant in time the largest write
that we can issue to the disc at a given time. So what we do is once we find this we pull it out
of this table, we send it off to this thread which is going to write it off to the appropriate on-disc
partition, and then it’s going to take all of these buffers, add them back into the pool, and this is
going to be the back pressure mechanism that we use to push back pressure back to the
producing side of this pipeline.
Now this handles non-uniformity in the actual input data, but I mentioned resources can be
non-uniform as well. And so imagine that we have a couple discs that are slower than other
discs. The way, this same data structure actually handles the problem without requiring any
modifications. So this process is constantly scanning for these chains is only actually looking for
the subset of chains that could be issued at that given time. So what that means is if you have a
slower disc, the chains behind it are going to build up and you're going to end up issuing larger
writes to them which is going to a little bit help mitigate this non-uniformity in some way.
Okay. So we’ve looked a little bit of how we handle IO in TritonSort, and I want to talk about
our evaluation. So when we began the project, in terms of the hundred terabyte GraySort, the
Hadoop, MapReduce Project had been able to sort data .578 terabytes per minute, and then
with TritonSort in 2010, 11, and 12 we were able to sort at .725 terabytes per minute using a
cluster of 52 nodes and it was just based on these issues that I just talked about. Now, as is the
case in any particular type of contest, eventually your record is taken back again. And at least
in terms of 100 terabyte GraySort, Hadoop, was able to run at 1.42 terabytes per minute on
2200 nodes in 2013 with a much more recent version of Hadoop. There's some other
categories I didn't describe here like the Indy[phonetic] benchmark which you guys took back
from us. And so we're working I guess to see if we can to retake that.
But I mentioned that it's not just raw performance that we were really interested in. The point
of this project was to focus on resource efficiency. And the community identified that as a
really important metric as well, and so in 2010 they added a JouleSort category which exactly
captures this eco-efficiency and we were able to capture that in 2011, 12, but also maintain it in
2013. And the reason for that is because even though we were beat out in terms of absolute
performance, if you just look at the quotient of the amount of work we did times the number of
nodes we are able to push about two orders of magnitude more throughput through each of
our servers. So that's why we were able to keep that kind of performance.
Now I want to briefly mention fault tolerance at this point because any system you build has to
be fault-tolerant. I don't have a lot of time to go into the details here, but what I would say is
that the fault tolerance approach that you adapt really depends on the failure assumptions that
you have in the system. So if failures are really common you want a pessimistic approach to
fault tolerance, and if failures are rare you want an optimistic approach.
The [inaudible] pipeline I described relies on very aggressive pipelining. So we are not
materializing data at all whereas a something like Hadoop requires materializing intermediate
data in the common case. And so what I would say is that we looked at some published results,
for example from Google in 2010, and kind of what you see is that at sort of a 10,000 node
cluster size you're seeing failures like every two minutes or something like that and so it's really
important that you need to, you actually want tasks to be able to, you want jobs to be able to
survive individual faults. So materializing job state probably a really good idea. We actually
talked to the people at Cloudera and the average Hadoop cluster sizes are order of something
like 30-200 nodes. And even adding an order of magnitude to this number of nodes you end up
with failure rates in this sort of double digit hours, hundreds of hour’s timeframe.
And so here what we are going to argue is that it's actually okay to do job level fault tolerance
where you re-execute jobs on failure as long as the performance improvement you get by
running without fault tolerance is high enough to overcome the occasional job re-execution.
There's no hard and fast rule here, and I think there's sort of a dividing line depending on
exactly what you're failing rule is but this is something that we’ve been looking at. In the future
work that we've been, actually Alex Ratherson[phonetic] which is the lead student on the Trace
and Sort Project, is we've been looking at taking trace data and selectively re-executing parts of
the pipeline that have failed or that parts of the pipeline that depend on a failure to mitigate
the cost of re-executing the sort. Maybe we can talk about that off-line. So, yeah.
>>: How much of your gains in terms of I think efficiency come from resource balance versus
fault tolerance? Like compared to your competitors?
>> George Porter: Well, if we look at Hadoop, for example, there is one extra materialization
that happens after the Map task which we get rid of. But actually there are three different
places in the pipeline where materialization happens, and two of those places are actually just
due to data skew. So what I would say is effectively a third of the IO we get rid of is due to fault
tolerance and two thirds is due to resource imbalance and data set imbalance.
>>: So why doesn’t Hadoop just do this? I mean are there practical reasons why they don't do it
because their developers are jerks?
>> George Porter: No, no, no, no, no.
>>: I guess what I’m wondering is, so you show like all these very nice results and so you would
think the Hadoop guys would say we're just going to nom, nom that, get in there and>> George Porter: The Hadoop people are great. I've worked, so I have a patch that’s been in
there since 2008 for this and part of the problem is that, it’s actually concerning, in a sense they
are doing this. So there has been a huge move to these in-memory, completely in-memory
data processing sort of applications. And in these cases you're getting rid of effectively all the
data materialization at the expense of more cost. But I think that, so that's kind of one
extreme, sort of getting rid of all the data in materialization. I think though that if you look at
actually where Hadoop’s going with projects like Tez, which is like this data flow thing, you're
seeing that they're giving users much more control over what materializations they do. So right
now you kind of fit into the MapReduce model, but if you're doing something like an iterative
job there is now a lot of support for being able to control exactly when those materializations
happen. So I think that's happening.
>>: But just to follow up on this, you take your own skew that you build, you run one job at a
time, so you're really optimized to [inaudible] disc. But if you now have a different cluster with
many jobs of different sizes, maybe not a large sort, how, could this directly be applied, would
you lose some efficiency, would we have to tweak things to make it really work?
>> George Porter: Yeah. So this is great because at a very high level this first part of the talk we
are giving up statistical multiplexing, and what we're doing is we are focusing on individual task
efficiency. So rather than sort of taking lots of different tasks that have heterogeneous
requirements, putting them on a system at a time, and then co-scheduling them, we are
dedicating resources without using [inaudible]. And one thing that I would say is that if you've
got a petabyte cluster and you have a bunch of 10 terabyte jobs you have a lot of opportunities
through doing [inaudible]. But if you're resource constrained, let's say you are a research
group, let's say you're a startup, you are working in biology or something like that, you may
want to solve petabyte scale jobs but you only have the resources to run petabyte scale
clusters. And so the question of is [inaudible] than not [inaudible] is an interesting question
when you’re not resource-constrained. But if you are resource-constrained you can't even ask
that question. So I think that it's good to focus on things like job and cluster scheduling,
obviously if you're doing [inaudible], but it doesn't hurt to also look at can you actually kind of
pull in as much efficiency as you can out of individual tasks when you don't have the
opportunity to do [inaudible].
>>: But it’s not clear how you're actually [inaudible].
>> George Porter: Yeah. I think>>: If you can get better>> George Porter: Yeah. So just to answer this real quick, if you assume that the compute in
the storage are co-located with each other you don't have a ton of choice in that matter. But if
you separate them, for example like with the Blizzard work from NSDI last week where you
actually get to sort of logically separate storage and compute, you could imagine dedicating a
very tightly connected set of machines to storage, getting full bandwidth of that storage, and
then when that job’s done now maybe an order of magnitude more computers can access that
same amount of storage. So you get light binding on that. I guess it depends.
I don't want to run too long, so I'm going to move a little bit forward. Sorting isn’t all we care
about, we also care about the data processing. So we built this system called Themis, I'm sorry,
so we implemented all of these different applications here and what I want to show, I don't
have time to go into the details, what you see on this graph is the performance of our
MapReduce system where the Y axis is throughput in terms of megabytes per second per disc in
each of our phase, the X axis is all the different jobs that we've done and different levels of
skew. What you see is for the vast majority of these jobs we’ve pushed our storage
performance similar to our record settings to our performance. Now I said almost all, there is
this Cloudburst example where in the first phase of Cloudburst it is IO bound and so we see that
performance improvement, but the second phase isn't IO bound which kind of exposes this
point we talked about at the beginning of the talk which is that when you get rid of one
bottleneck you can oftentimes push it somewhere else. And a place that you typically push it is
the network.
Now for us this wasn’t a huge problem because Cisco donated one of these big data center
switches to our group and we only had 52 nodes and we had enough ports to give full bisection
bandwidth, but if you’ve got 150, 2000 nodes that’s not such an easy problem. And that kind of
leads to the second part of my talk which is focusing on the data center interconnect.
So just like applications have changed, the network has changed quite a bit as well, and we've
seen this enormous growth in terms of data rates. So the types of networks that people have
built to address this growth and performance has changed quite a bit. My first exposure to
these kind of networks was in 1994 when I worked at ISP[phonetic] in Houston and the way
that a lot of networks were built then and even today is as these tree type structures. And
you've got nodes along the bottom and then you have the layers of switching getting
increasingly powerful as you move towards the root. And if you imagine let's say one hundred
thousand data center at 10 gigabits a second that's a petabyte of aggregate bandwidth
demands.
Now the real problem is that you simply can't, from a technology point of view, buy core
switches that are fast enough to actually handle all of this bandwidth, and so researchers have
actually looked back to the 1940s and taken ideas from Bell Labs kind of in the 40s and 50s and
adapted them for data center designs. And so this is what is called a folded-Clos multi-rooted
tree, and some version of this is deployed in many types of data centers, and this was proposed
in [inaudible] 2008. The key thing here is that we don't have these really powerful switches in
the middle of the network. Instead, if we have is 10 gigabyte a second servers, all of the
switches on our network are 10 gigabits a second. And we get all of that bandwidth by relying
on multi-pathing to deliver an aggregate amount of bandwidth.
So if you have enough links in the network you can load balance and distribute traffic
appropriately to get a higher amount of aggregate bandwidth, and what you've done is traded
off impossible to by switches with a very challenging but solvable with money problem of
adding lots of links into this network. So a 64,000 node data center has about 200,000 links in
it. And these links are incredibly expensive to kind of deal with installing them and managing
them. They’re also very expensive in terms of cost. And as we move from 10 to 40 to 100
gigabits a second of Ethernet they're going to get disproportionately more expensive. And the
real reason for that is that we can't rely on the copper cables that we know and love and we
have to move to fiber optics. The reason for that is because of a property of copper cables
called the copper skin effect, which roughly speaking, says that the faster the data rate of the
cable the shorter it has to be.
So at a gigabit you can buy spools of hundred meters worth of Ethernet. The second you go to
10 gigabits you're down to order 10 meters, and at 100 gigabits you’re talking about a couple of
meters in length. And remember, these are warehouse scale buildings and so we have to
overcome this length limitation in some way. And so the way people do that is to rely on optics
which don’t have this copper skin effect. So you can send very high-bandwidth, you can create
very high bandwidth links at very long lengths this way
Now the problem with optics is that you have to have some way to convert between the
electrical signals that the switch understands and the optical signals inside the fiber and so you
need a transceiver at either end of this cable that has a laser, a photo receiver in it, which is
used to make this conversion. And these transceivers are sort of ballpark 100 dollars, maybe 10
watts at 100 gigabits a second. And I know that several of you would have much better precise
information about the pricing. This is sort of based on external information and papers and
stuff like that. But the point is that they are not trivial in terms of cost, and you need two of
these for each of these say 200,000 cables. So it adds up to a lot of money and a lot of power.
To look at the implications of that, if we imagine 100 gigabit a second multi-rooted tree here
and we look at the path from a given source to a given destination, what we see is that the
packets transiting this path are constantly being converted to and from optics at each of these
switch hops, at each layer of switching from the leaf up to the core and then from the core back
to the leaf. So the implication of this is that for every device attached to the network there’s
roughly speaking 4-8 of these transceivers in the network kind of conveying the traffic for that
device. And so at 100,000 nodes that's like a megawatt of power and tens of millions of dollars
or more.
And if we step back from that for a second I think it's worthwhile asking why are we doing all of
this packet switching? Why are we doing that? And what I would say is that these folded-Clos
networks, the service model they actually provide is that they allow you to make a different and
a unique fording decision for each packet that you send in the network. But that service model
is, I'm going to argue, too strong for many data centers; and as a result there’s a gap between
the service model we are providing and the service model we could potentially provide, and
this gap is how we are going to get resource efficiency.
And to say what I mean in more specificity, there's a lot of locality in data centers. And actually
Microsoft has been great about publishing actual results from your networks. And this is a
picture that's reproduced from one of these papers. And it’s a little bit dated at the moment,
but it does show kind of the rack to rack traffic at an incident time. And although the details
change over time, what you can see is that a bulk of the traffic is going to a relatively small
number of output ports. So there's a certain amount of spatial locality in these systems.
But if we kind of look bottom up as well, we see that there's a lot of temporal locality as well.
And so my student, Rishi Kapor[phonetic], published a paper in Conex last year where he
looked at a 10 gigabit server, and he deployed a variety of kind of representative applications
on top of it, and he measured the packets leaving that server at micro-second timescales. What
you saw was that because of all the batching the happens in applications, in system calls, in the
operating system kernel, in the NIC hardware, all of that sort of buffering and batching ends up
translating into tens of hundreds of packets that are correlated in nature. So you tend to see
when servers send large amounts of data from one place to another they tend to do it in these
kind of correlated bursts. And so the key idea behind this second part of the work is to use this
temporal and spatial locality to build cost-effective networks by adopting circuit switching in
addition to packet switching.
So if you're not familiar with circuit switching I'll give you a very brief example. This shows a
one input port, two output port circuit switch; and you can think of this as just an empty box
that has some mirrors inside of it. And light enters the input port, it reflects off these mirrors,
and it leaves an output port. If you want to make a circuit switching decision there are tiny
motors underneath these mirrors that can move them and this changes the angle of reflection
and causes the light to leave out of a different port.
Now this is great. You don't need any transceivers. We're not doing this conversion. And it
supports effectively unlimited bandwidth, meaning as we go from 10 to 40 to 100 gigabits a
second this technology doesn't have to be changed. But circuit switching is an incredibly
different service model than packet switching. And so this isn't just a drop in replacement for
your packet switches, you really have to kind of rethink the entire network stack. And to give
you kind of an example of that I'm going to talk about one aspect of the service model that has
changed which is called the reconfiguration delay.
So the reconfiguration delay Delta is the amount of time it takes to change the input to output
mapping of that circuit switch. And it's, roughly speaking, the time to move those little mirrors.
And this determines how much locality you need for these circuit switches to be applicable in
your network. If Delta is really large it means that it's incredibly expensive to change the circuit
mapping, there's a very high overhead, and so you only ever really want to support very large,
highly stable, long-live collections that in the networking world we call elephant flows.
Now on the other hand, if Delta is really small, you could very rapidly reassign circuits on short
timescales and so you can support very highly bursting unpredictable traffic called mice flows.
Now I want to point out Delta is not fundamental, it’s a technology-dependent parameter. It
depends on how you build these mirrors and some other aspects of the technology. But it is a
very important parameter that determines this mixture of circuits to packets. Yeah.
>>: Did you also account for the delay for mirroring the traffic? You have a mirror of something
and make the decision to switch [inaudible]?
>> George Porter: Yes.
>>: The mirroring time can be much longer than this.
>> George Porter: Yes, yes, yes. This is great. So this is talking about the actual data plane.
Now you're talking about a control plane issue about how do you figure out what signals to
send. I'll describe that in a minute, but we started by an observe, analyze, act approach. That
became too slow, and so we ended up with a proactive approach that I'll describe in the talk.
This is an exact problem we dealt with.
Okay. So we have the sense that the majority of the traffic has locality, but not all of it. So the
way you can think about that pictorially is as follows. Imagine that we have been, in a network
with N connected devices there's N squared possible connections. So we could rank order all of
those N squared connections by the amount of traffic per connection; and because of there’s
locality, the picture looks, roughly speaking, like this where a bulk of the traffic is in a relatively
small number of these connections. And so this leads to what we’re saying is a hybrid design
where we are going to rely on both circuit switching and packet switching. So what we like to
do is take the head of this distribution and send it over these circuit switches, and then this
relatively long tail that has a lot of different connections but not a lot of bandwidth we’re going
to send over a less expensive, lower speed packet switch network. And it’s exactly this Delta
value that determines this mixture of packets and circuits.
Okay. So I told you Delta is technology-dependent, and when we started the project we had to
get some sense of what that value was, so we obtained an optical circuit switch that was
developed in the late 90s for the Telecom industry and we characterized it in our lab. And what
we found was that the Delta value is about 30 milliseconds, and what this means is that you
need to keep circuits up for hundreds of milliseconds to seconds or longer to amortize that
overhead. And so it’s really only appropriate for very highly stable, long-live traffic, and the
place you see long-live traffic in the network, generally speaking, at the core we have a lot of
aggregation. And so this led to the development of the Helios Project which was presented in
[inaudible] 2010, and I showed you that multi-rooted tree before. Imagine we're just focusing
on the core switching layer only and we're going to get rid of most of the packet switches in
that switching layer and we're going to replace them with a smaller number of these circuit
switches; and let’s abstract the rest of the network away into these things we call pods, which
are roughly speaking, about 1000 servers or so. So the servers, the links, the switches, these
are in these pods.
The idea behind the Helios Project was to support this type of an environment. And so what we
do is we start by sending traffic over our packet switches and then there's a process that's
looking for these elephant flows, and whenever it finds one of these elephant flows it’s going to
add updated flow roles down here to move it over to the circuit switch. So this is what you
were talking about, about the time it takes to do that. Because we've got 30 milliseconds it’s
like all the time in the world. So it's not a particularly big deal. So we end up, there's details in
the paper about exactly how we do that, but finding those elephant flows, moving them over to
here, we can achieve all of that in this kind of tens of milliseconds time bound. Yeah.
>>: I feel like there's something of a tension between the two halves of your talk which is that
the first part of the talk is about efficiency, which by definition will try to use all things equally,
and it seems like it's exactly the opposite of what you want [inaudible].
>> George Porter: Yeah, yeah. So there's a commonality which is that this is also getting rid of
[inaudible], if you want to think of it that way; but to get to your specific point, I think that one
of the main things here is that what we're doing here is we are in a sense trying to build a
network that matches the average case utilization even though that average case is rapidly
changing and the set of nodes that need high bandwidth is also rapidly changing. Today, you're
really only option is to effectively provision for the worst case. And so in this way we are able
to, as applications, as their communication patterns change we are able to migrate things. But
even in that TritonSort case, if you think about the two phases, in phase 1 we were fully utilizing
the network. But in phase two we actually had no network at all. This model would allow us to
take that resource away and move it to another instance that does need the network. So if we
were over to overlap phase ones and phase twos in two different clusters we could actually
share the network between the two.
All right. So the result of the Helios Project was that we were able to get rid of one of these
transceivers in the core which doesn't seem like a big deal, but it actually represents a very
large cost complexity and power savings because when we looked at this original network we
were looking at 10 gigabit networks and so all of these pods we could entirely interconnect
them internally with electrical cabling. You only really needed optics for these core switch
layers and that's where all the transceivers were. But as we want to start moving to 100
gigabits a second we’re not going to be able to make that assumption anymore because we’re
going to have to start putting optics inside of these pods just because of the length limitation.
So we need to start pushing circuit switching closer to the host into these pods, and that led us
to our second project which is called Mordia. And I want to say that the thing is that if we were
able to aggregate over 1000 servers with this 3-D MEMS technology that was relatively slow, if
we want to put circuit switching down to the host we need a technology that's, roughly
speaking, 1000 times faster. And so we identified such a technology which is a different kind of
circuit switch device called binary MEMS. It's a little bit different, and what I will say is that the
advantage of this binary MEMS technology is it's very fast. It's about two microseconds
seconds, three orders of magnitude faster, but the downside about it is it's not scalable. You
can only buy switches that are maybe four, eight ports in size. Yeah.
>>: [inaudible]. Are you making a strong assumption where predictability [inaudible]. There is
[inaudible] keep informing [inaudible] mislabeled mice and elephant you're getting a
[inaudible]?
>> George Porter: Yeah. Implicit in this particular design is the idea that what we need,
because Delta is so high, we are only going to consider traffic for this that's stable for over a
second. So these little bursts, too fast because it’s 30 milliseconds just to assign a circuit. And
so for this the only kind of traffic that we can actually support this is traffic that’s going to be
stable for order a second or longer. Now with this technology, because we can reconfigure it in
two microseconds, we actually only need traffic that’s stable for about 100 microseconds to
assign a circuit to here. And so that's one of the major issues here is that what we mean by
circuit traffic or locality traffic depends on the Delta value. And so now we can actually support
the burst of a given server using circuits. But, in getting to this point, we can't do things like
measure assign the flow roles, etc. because we only have a couple microseconds to do that. So
we have to be proactive, and that’s what this project deals with. Yeah.
>>: [inaudible] of the flow is very, like a few microseconds. That's going to [inaudible] overhead
on the switches.
>> George Porter: So we don't, so instead of estimating traffic based on the looking at packet
counters and switches what we actually do is we measure demand by looking at hosts. So we
actually look at the hosts, what the send buffers and the hosts are to see what their demand is
going to be in the future. And we can>>: And you can collect the statistics from the host and that could also take longer than
[inaudible]?
>> George Porter: So it takes order microseconds, tens of microseconds let's say, to have the
host send this data out. And I don't have time to get to it in the talk, and I actually don't have
slides to this, but it turns out that in this Mordia design you can think of a pair of ToRs, two
ToRs connected to each other, so they only actually have exchange information on a pairwise
basis. So you don't have to collect this globally, do a decision, and send it back out again.
Maybe we can talk about that off-line.
>>: Sure. So [inaudible]>> George Porter: I have a couple more minutes to go and then we can get to all these
questions. And in fact, there's details in the paper about how we built Mordia, and I'm actually
going to skip over how we did it, but the key idea is that these switches are able to support
multiple wavelengths of light. And by making a copy of the light by tapping some of the light
out of a fiber we can actually replicate the signals across multiple of these stations and each of
these switches can make orthogonal switching decisions. And this is the key idea that we used
to scale up our design. And by adding a variety of these switches into one or more of these ring
networks you're able to support order 600 ToRs or so with this design.
Okay. Now we built Mordia over at UCSD using these switches that we got from this start up,
and we connected them to our servers, and we measured the switching time of the compose
system, and it is in fact these two microsecond results. We are able to keep that fast switching
time even though we scaled up to, in this case, 24 ports. The key idea here is that with two
microsecond switch time we only need 100 microseconds of stability for something to be
circuit-friendly, and that means we can actually support the traffic of a single server. And this
led to our most recent project, which we just presented last Wednesday at NSDI over across the
water there, which is exactly building a top-erect switch, hybrid switch that speaks to both
circuits and packets at the ToR layer.
And this is the premise of that project. So it's very simple. If we've got a 10 gigabit packet
switch network and we overlay a 100 gigabit circuit-switch network into our data center, these
effectively can be put together to build a 100 gigabit packet switch for data center workloads,
meaning that if there is sufficient temporal and spatial locality defined by 100 microseconds
worth of bursts then you can deliver service model akin to this extremely expensive network
using two much less expensive and lower cost network technologies. And we built REACToR
using this as our, we built an eight port REACToR prototype and we hooked it up to our Mordia
network, and I just want to show you one of the graphs from that paper and then I'll sort of
conclude.
So the idea behind REACToR is to give the performance akin to a 100 gigabit packet switch but
using circuit switching. So what we did is we deployed eight nodes and we have seven of the
nodes sending data to the eighth node, and this is the view from the eighth node. So what it's
seeing is that x-axis is time in seconds and the Y axis is throughput, and what you're seeing is
each of the incoming flows is relatively stable, very nicely fair, very uniform, looking like kind of
a very smooth packet switch network, but in reality if we zoom into this at microsecond
timescales what we see is we are actually rapidly multiplexing small bursts of data from all of
these different hosts and delivering them to the end host. The key idea is that we are able to
rapidly multiplex that link fast enough that the transport protocol and the OS doesn't realize
anything’s going on. And the analogy here is to process scheduling where you can just schedule
stuff fast enough nobody notices that they don't have access to that resource.
So the key idea behind this line of work has been to focus on the predominant sources of cost
and power in these networks which is very surprisingly and sort of counter-intuitively cabling
costs and transceiver costs. And so we built a variety of projects that have dropped that
number down to close to one transceiver per host at 100 gigabits a second, and what's nice is
that as we go even behind 100 gigabits a second this same approach should be able to apply as
well. And I just briefly want to mention that what I've described thus far has been taking what
our existing building blocks and prototyping them to build new types of networks, but we are
also wanting to complete that loop to build new building blocks as well. And so we started with
commercial technology, we built some prototype technology, and now we are interested in
building novel devices designed for data center environments. So we are doing that inside of
NSF.
There’s an engineering research center called CIAN which is about 12 institutions and about 30
PIs most of which are photonics and physicists. So they’re photonics people and physicists, and
the idea is that we are taking all of these organizations and actually going back to a lot of the
building blocks we used were designed for the Telecom industry. So what we're doing is we’re
kind of unwinding that decision tree back to the assumptions that were made in building optical
devices, and instead of targeting them towards the Telecom world what we are now doing is
targeting them towards the data center world which has a very different set of assumptions. So
there are people in the center that are building new devices, and this is one example that sort
of has come together in the last month and then I'll conclude.
So I mentioned in the Mordia design that we have these binary MEM switches in the network
that are making switching decisions, and they're basing those decisions on the wavelength of
light that is entering them. Now that's kind of an expensive design actually because these
switches are very expensive. Now what you can do instead is there’s researchers in the center
that have been working on silicon photonics tunable lasers, which sounds very sci-fi but this
actually not much more expensive than building regular transceivers today, but the cool thing
about this is that by changing the frequency at the source you actually don't need those
switches. You can build an entirely passive interconnect network and simply have the sources
change the frequencies that they transmit on. And so, through the center we were able to take
that research and send it to a fab and build it onto a chip and then partner with this company in
Berkeley to package that in an SOP module that we can then reinserted into the Mordia switch.
And so that happened about three weeks ago, and so that's kind of what the future work is for
this project which is to lower the cost of that network.
Okay. So in summary, it's important to sort of pivot away from the question of scaling to
scaling in a resource efficient way, and I’ve talked a little bit about IO efficient data processing,
data-intensive networking. And before I conclude I want to acknowledge the incredible
students that I’ve had the opportunity to work with that have been driving much of this
research. And with that I’d like to thank you for your time and open the floor to questions.
Yeah.
>>: It's good that you got the switching [inaudible] so small, but we’ve also found that if you're
willing to deal with extra traffic hops that you could kind of get away with the need for a lot of
switching by [inaudible] a lot of effort.
>> George Porter: Yeah.
>>: Have you used any of that for>> George Porter: Yeah, yeah, yeah. So this is things like OSA and other projects that rely on
kind of overlay or multi-hop. It's another degree of freedom. So all of the designs that I’ve
talked about have been effectively either zero hop or one hop depending on how you look at it.
And the second that you can start forwarding traffic through intermediaries, maybe using
things like RDMA or something, it gives you this additional degree of freedom where now you
have a scheduling decision which is, do I wait until I get a circuit assigned to me, or do I sort of
send data to some intermediate point which can then send it on my behalf? And it's actually
quite interesting because nothing that we’ve talked about precludes that, it's just that that
hasn’t been something that we've been focusing on.
>>: You mentioned a few microseconds. That doesn't make sense to really use it for them?
>> George Porter: So this is an interesting point because we were servicing in some sense
fundamental here. I think that there's an interesting sweet spot at the kind of 0.1 microsecond
timescale because at 10 gigabits of packets, you know, 1 .2 microseconds or so, and in order to
use circuit switching, I should say we've moved, we haven't looked at optical packet switching
at all. We've really been focusing on circuit switching because it's something that’s practical
and that can be deployed. And so what you really need is a burst size. And if you're talking
about 10 or 40 or even 100 gigabits a second, one microsecond gives you reasonable burst size
that seem to match well with what servers are able to generate. If we were to push that
switching speed much lower we would end up kind of hitting up to the packet boundary and
building effectively a packet switch which isn’t really what we want to do; and if we pushed it to
be a higher value we would end up needing so much burstyness[phonetic] that it would be very
difficult to get servers to be able to generate that. So it happens to be a particularly attractive
spot at sort of 0.1 microsecond or so. So we haven't looked at trying to make switching faster.
There are technologies that could do that. There's SOA, switches and stuff that operate in
nanoseconds, but from our point of view we lose our circuit switching benefit from adopting
those techniques.
>>: So why aren't people building just like, almost like RF-inspired optical switches. Like RF
doesn't need any switching. You could just have the transceiver side they just try things.
>> George Porter: So you mean like wireless?
>>: Yeah.
>> George Porter: Well, I would say, I mean the work on say 60 gigahertz wireless in the data
center in a sense needs switching because you're either physically moving things or you're
somehow choosing a different target to transmit to I guess.
>>: So that’s [inaudible]. I was thinking like an optical switch that instead of like, I think part of
the thing it seems to me like this scheduling control plane idea is [inaudible]. It has complexity.
>> George Porter: Yes.
>>: And that complexity is coming from the fact that you have this kind of switching schedule
like no matter how fast it still [inaudible].
>> George Porter: Yes. That's right.
>>: That doesn't show up like in cellular domain or whatnot. Like my phone can't [inaudible] at
any time without actually having like a global schedule across all transverse>> George Porter: Well, you channel mitigation, so that's one thing. Imagine this network, so
with optics it's like a wireless network where every terminal’s a hidden terminal. So think of it
that way. Imagine a wireless network where every node was hidden. The tunable laser idea
that I just talked about, one of the things that's interesting is that you end up in a situation
where if two, let's say ToRs choose the same frequency to transmit on that interference will
cause data loss. And so what we're going to do to solve that is a very simple kind of brute force
approach which is to create a registry service wherein you opportunistically acquire a channel
and then you register with the service that you get the channel, and if two devices end up
conflicting on that one of them will win and the others will stop sending, and we can bound the
amount of time given the amount of, like the slot time, the contention time can be small. This
is just techniques from the wireless domain, and then we can code over that to make use of, so
we don't lose data. So this is very much like a carrier sense approach, just in the optical
domain. My understanding, and I'm not a physicist or an optics person really I’m a computer
scientist, but my understanding from talking to the optics people, because asked a lot about
like their CDMA, for example, could we do OFDMA or whatever, and my understanding is it's
not a very promising approach. But that's a little bit out of my domain.
>>: Just to be clear what he's saying, couldn’t you choose a different lambda? That's what he's
talking about.
>> George Porter: Oh yeah, that's what we do. We pick>>: [inaudible] do it that way?
>> George Porter: Yeah. So our approach is basically that you give it a list of in preference
order of the lambdas you want and it tells you back you can have three and you can have five or
whatever.
>>: But presumably, there are fewer lambdas than N squared.
>> George Porter: Oh yeah, yeah. Absolutely. So the number of lambdas>>: [inaudible].
>> George Porter: Current technology, the number of lambdas is 0 of 100. Think of it that way.
>>: [inaudible]?
>>: Not for N squared but just [inaudible].
>> George Porter: If you want to support say 600 ToRs, let's say you'd need to add a space
switching into that as well. That's what we're going to do. Because can't really fit more
lambdas into a fiber, but you can have multiple fibers effectively, and what you can do is
choose which fiber you’re going to send data on and then which frequency you're going to send
within that fiber. And that's where I was mentioning that point about how we don't need a
global scheduling decision because we do need a global scheduler that decides what ToRs are
connected to which other ToRs, but once those two ToRs are connected to each other the
specific frequencies that are being used can be negotiated on a pairwise basis which is a much
simpler problem to deal with. Yeah.
>>: [inaudible] hybrid [inaudible]? [inaudible]
>> George Porter: Yeah. So this is actually a great point. There's two things I can say about
that. One is that this stuff is super reliable because it’s built by the Telecom industry; it's very
expensive. So they had these goals of like 10 to the minus 18th bit error rates. The reason that
this stuff is expensive, and I mentioned before that we’re winding back that decision tree to
build devices that are [inaudible] to data centers, it’s just we don't need to 10 to the negative
18th error rate. We have other ways of dealing with errors that if they made the network
substrate significantly cheaper and more integrated it might be a good trade-off.
So I what I would say is the Telecom stuff we are using has been incredibly reliable. And then
there was another point that I was going to say about, so, yeah. The technology we’ve used has
been fine so far. But in terms of handling failures in this model, because of this 100 channels
per fiber, if we moved to a multi-fiber or multi-ring type network the way you can think of that
is that I'm now spreading traffic over multiple rings; and so if one of these rings were to fail in
some way that proportionally reduces my bandwidth by N minus one. And so there’s kind of
nice failure recovery model there as well. So you don't lose, it's all or nothing. You can imagine
degrading our service based on the number of failures that you get. So if a laser fails or
something like that you might lose a fifth of your bandwidth or something like that. But you
don't lose 100 percent of your bandwidth.
>>: Have you also seen [inaudible] same thing. [inaudible]?
>> George Porter: I don't have any concrete quantitative data on failure rates between the two
of them but what I would say is that the optical stuff, especially the stuff in the Telecom
industry, is a really reliable. So there are failures that occur, of course, but it hasn't been a big
problem for us. Now one of the things that I will say is that these tunable lasers, the way they
actually tune is that they have very small heaters next to the laser that change the temperature
because the frequency that they transmit depends on the temperature. Now the flip side of
that is in the data center if you changes in temperature you have to have some way to stabilize
that temperature. And so one of the areas that's really, that a lot of the optics people are
working on is on devices that don't require active cooling and stabilization of temperature. But,
yeah. We didn’t run into that problem at all. So we have a sort of a data center like the size of
this room at UCSD and it’s got some chillers and stuff in it, and we've not experienced any
failures in three years. But we are very small scale.
>>: For example, like vibration effects [inaudible]?
>> George Porter: No. Nothing like that. We also, just as a side point, for the sorting record we
have 1000 spinning discs. We never saw any vibration effects at all in the time we used them.
Yeah.
>>: You measure the scale effect copper?
>> George Porter: Yes.
>>: At lower frequencies they’re fixed or making error. So replacing a single cable by a grade of
[inaudible] cable [inaudible] or does that get too ridiculous as the frequency?
>> George Porter: My understanding is that it does. And again, you all are the experts here,
but the labor involved in the plugging all these cables in is very nontrivial. It’s gotten to the
point where organizations like Google are building these robots to build cable assemblies so
they can plug one layer of switching into another. And the second you say I want to take 10 big
fat cables and put 500 of them together into a bundle this big, I think people don't like that.
>>: All right. Let's thanks George once again.
>> George Porter: Thank you.
Download