>> Jim: Okay. It's my pleasure to introduce... Berkeley who is actually pretty well-known. I think you've...

advertisement
>> Jim: Okay. It's my pleasure to introduce Zhangxi Tan who is a PhD [laughter] from UC
Berkeley who is actually pretty well-known. I think you've been here several times and you've
got a Microsoft Fellowship.
>> Zhangxi Tan: No.
>> Doug Berger: No? You didn't? Well, you should have [laughter]. And has done a lot of very
interesting work using FPGAs to simulate data centers and he'll talk about that work today.
>> Zhangxi Tan: Thanks, Jim, thanks for the introduction. So the talk runs about 50ish minutes
so feel free to ask any questions I can answer on the line. The topic I am going to present is my
dissertation work I've done in Berkeley with Professor Kriste Asanoic and David Patterson. The
topic presented is called Diablo, Using FPGAs to Simulate Novel Datacenter Network
Architectures at Scale. So this is the outline, agenda of my talk. I'm going to briefly talk about
motivations of this work and since we're building emulation on the test batch, so we treat data
centers as the whole computer system, so during the course of the project we're systematically
look at how computer systems should be evaluated and we have our own methodologies. And
then I'm going to talk about the beef part of the talk which is the Diablo which stands for data
center in the box at low cost and talk about incantations, talk about architectures and then I'll
show you three very exciting case studies that we've done. Some of them you wouldn't be able
to see with current tools and finally since this is a really long time system building project, there
are lots of experience and life lessons we learned. I'll talk about that and then conclude my
talk. So data center network is a very interesting topic. The first I want to show is a traditional
data center network architecture. This is from a Cisco perspective. This is a traditional one. So
this slide is from Cisco. As you can see, data center network is usually a tiered architecture.
They have multiple tiers down at the bottom, a rack of servers and each rack of servers contains
roughly around 440 each machines and they have a top of the rack switch. And further this top
of rack switch will aggregate a switch people call AS switch or aggregate switch and there are
multiple levels of those. And just to give you an idea about the size of the whole network, a
typical center usually you will see about 10,000 or more machines and plus like thousands of
switches, so this is really, really a large network and there is no doubt about that. Data center
networking infrastructure is apparently very, very important. This slide I stole from Jim
Hamilton from Amazon. Basically a high level he built a data center network infrastructure is
the SUV of the data center. This is an analogy of the car. SUV car is large and heavy and gets
poor gas mileage. First of all, network infrastructure is expensive, right, it accounts for the
third-largest cost of your whole center. And more importantly, it's not just the cost; it's about
supporting applications and supporting data intensive jobs like MapReduce types of jobs. And
so they are very important, right? So therefore, there are lots of people looking at this
networking design space. Some people from research side, I believe Microsoft and also there
are some people from product sides like from the Google’s efforts and now Facebook. They are
looking at like 40 gig, even 100 GB switch designs. If you look at all of these designs and you
will find lots of distinct design features. For example, intensive switch design. People have
different packet buffer micro-architectures and even the sizing of the buffers are completely
different. Also, some of the switches have the programmable flow tables and some don't. Even
in terms of applications and the protocols people support it, they are quite different. Some
people will support different versions of TCP and has the congestion bit support and so on and
so forth. However, if you look at all designs, they are pretty nice. Everybody claim they have
the best design compared to others. But the problem is if you look at all of the work done so
far there is one problem that has been largely ignored so far, which is the evaluation
methodology. If we look at how people evaluate their novel designs there are a couple of
issues. First of all the scales. The scale people have right now is way smaller than real data
center network I showed you in the first slides. If you look at what the experiment people have
done so far typically, you will see a test batch of less than 100 nodes. And especially for
academia things are even worse because they don't have much money. They just use less than
20 machines and put some of the virtual machines and they pretend they have lots of nodes.
The second issue is about the programming that people use. People tend to use syntactic
programs or micro-benchmarks. Even some people claim they won't use micro-benchmarks
because micro-benchmarks can drive their switch to full and that's a really bad assumption. If
you look at real data center application programs, what they are running is really the web
searches Gmail and MapReduce types of things. Last, is in terms of the switch design itself. So
there are also networking researchers today who are trying to use off the shelf switch and
unfortunately those switches are really proprietary and there is almost nothing you can change
there. For example, you couldn't actually change like the link performance delays and even the
sizing of the buffers. You want to change the computer switch? Okay. Good luck. Have some
talk with [indiscernible]. They allow you to do that. It's not very easy to do it. So here I raise a
really valid question. How do we really enable network architecture innovations at large-scale
like in order of thousands of nodes? It's not hundreds of nodes without really spending lots of
money building a real data center, large scale data center infrastructure. So before I start a
project I visit a couple of places including here. And also talk to people like large corps like
Microsoft or Google and Amazon and asked them what kind of features you want if you want to
do networking evaluations. Apparently one common conclusion is that networking evaluations
is at least as hard as evaluating other large-scale system design. So first of all you need decent
scales. So data center on order of tens of thousands of nodes, but you need at least a couple of
thousand nodes to see really interesting phenomenon. Secondly, if you look at the real designs,
the real thing you want to look at I say are switches, right? They are massive parallel. If you
want to run something for this design's slate, that means you really need some decent
performance in your simulators. A think about large high rated switch are usually, 48,
sometimes close to 100 ports. Think about number of flow you would support and the virtual
queues [indiscernible] concurrent events just per clock cycle and the thing trying to look at.
And also if you look at network itself, it is really in nano second time accuracies. It's really
accurate. Think about the typical user’s case for a 10 gigabit ethernet. Usually, say transmit a
64 byte packet on 10 gigabit ethernet, it takes about 50 ns, so this is really comparable to
DRAM access. That means if you want to have a simulators, or similar things, you will see lots
of many fine-grained synchronizations just during the simulations. Last but not least is people
really want their production software. They need extensive application logic. People are not
content with micro-benchmarks at all, especially from products side. So here is my proposal.
I'm going to use FPGAs. So this sounds like crazy. For you who are not familiar with FPGAs, let
me just give you a very brief overview about what is FPGA. This is from digital circuit design
class from Berkeley. FPGA is a configurable, reconfigurable set of architectures. You can see
there is an interconnect and there is a big functional block. If you zoom in this block you see a
structure like this and typically you'll have lookup table with a fixed number of input. That will
be used to implement any [indiscernible] logic and to store some states in the circuit, and you
have a flip-flop. The people use FPGAs for many things. For example, people use FPGAs to
build initial hardware prototypes and people use FPGAs to do basic design validations. You can
really use this highly structured device to build a target hardware system today very easily. My
idea is trying to use FPGAs a little bit different compared to people who use FPGAs for compute
and people who use FPGAs for verification. What I did here is I built abstract execution
performance driven module. Performance model is realistic enough to run the full software
stack. And overall our built system when calculated the cost per simulated data center node is
really low. It's like $12 compared to the other is really high, right, so you spend like 25 bucks
you can buy lots of those, but you still have to buy the interconnect and there's nothing you can
change there. It's just implementation. Also I believe there are similar efforts to build
hundreds nodes test batch using atom processors in academia. So I mentioned like Luiz Barroso
from Google that data center is really a warehouse computer, so we going to take the data
center problem is a computer system problem. So we're going to look at how people deal with
this computer system validations. One common approach is computer architecture use.
Computer architecture use is trying to run some simulations. Before I start this talk let's get
familiar with the jargon and terminology that I will like to use. In simulation world people often
like to mention host versus target. What I mean by target is the system is actually being
simulated. In my case I'm looking at a data center design, so that means it's real servers and
switches and that work and the interface card. And the host usually when talking about
simulation means the platform on which the simulator itself runs. In our case it actually FPGAs.
We use FPGAs as the architecture simulators. If you are interested in the software simulators,
that's going to be actually six machines. We actually look at many computer architecture
simulators system simulators out there. Primarily, we think there are two types of major
simulators. So the first one is pure, the most popular ones, based on software. We call it a
software architecture model execution, in short SAME. The second one is like our approach is
to use FPGAs and we call that in short FAME. Actually I wrote a paper about it a couple of years
ago just talking about differences and why you have to do the FPGA architecture FAME style
simulations to exploit the design space, why this offer is bad. So first let's look at a current
state of the art. The validation [indiscernible] software architecture model execution, SAME.
Almost everyone will use it. There are lots of issues with software simulators. So the number
one issue is the performance. So to show you the problem we actually look at how computer
architecture, the simulations in their research. By doing that we're looking at the papers
published from the premier architecture conference which is ISCA a decade ago and also all of
the papers people publish a couple of years ago. I think the data is slightly old, but I don't think
there is significant change today in terms of the way people simulate their benchmarks. So
we're actually looking at [indiscernible] instructions people simulated for benchmark across all
the paper people publish. Also look at the target, the complexity of target design in terms of
number of number of CPU cores it supports. Also calculate the number of instructions you
actually study per core. So here's some interesting thing we found. Back a decade ago, like in
the last millennium, right, everybody is looking at a single core design and people simulate
around the medium instruction people simulate for benchmark is around 267 million
instructions. And going back to a few years ago moving to the multicore [indiscernible]
everybody looking at the multicore design, so the median number of cores people study is
around 16 cores. Given the complexity of the system grows over the past decade. People do
simulate more instructions like close to a billion instructions, 825 million, but if you divide by
the number of ports people are looking at this number a hundred million seconds not going up
but actually going down. Sadly, think about cost frequencies people are running today. For the
processor back a decade ago so everybody runs a couple of hundred megahertz, this recently
processor is actually running at a couple of gigahertz. If you convert that's 100 million
instructions to the raw walkoff time and this is really sad. This only translates to less than 10
milliseconds. So what does that mean? That means you couldn't even simulate a typical
operating system [indiscernible] which is typically around a millisecond. And if you look at data
center applications and I/O's and all sorts of things, it at least has to simulate a couple hundred,
in just a few minutes to see really interesting phenomenons. And also another big problem of
the software simulator is usually, people tend to build unrealistic models. For example, we
have the infinite fast CPUs. Everything just consume one cost cycle which is very dangerous.
And so therefore we are really betting on this FPGA architecture emulation. Just I'll have you
know there are simulators built on FPGAs, so now that we confuse it with the standard FPGA
computers seen regularly and not be confused with FPGA-based accelerators people are
interested. So there are actually simulators, right, architecture simulators. So when will be
architecture simulator design a FAME, the architecture found a lot of these things the design
features and there are lots of dimensions, right? So we summarize it like three basic
dimensions, right? So the first one is direct versus decoupled. Decoupled means you can
actually decouple the host cycle from the target cycle you use. So for example, you use
multiple host cycles, the simulate one target cycle. And the second one is for RTL versus
abstracted RTL so for [indiscernible] verification guide it tends to use a modified [indiscernible]
RTL but in terms of FPGA simulation in the interest of performance you can actually use like
abstract performance, actually performance models so that has many FPGA friendly structures
to simplify your design and also increase your density and along with this decoupled timing
[indiscernible] and you have a much simpler design [indiscernible] just take FPGA cycles similar
to one target cycle. And the last feature is single threaded versus multithreaded host, and this
is really crucial for me to release [indiscernible], to be able to simulate a couple of thousand
server nodes with switches. I'll show you what I mean by multithreaded hosts. Think about this
case, what we call host multithreading, by the way. So very trying to simulate your target is
trying to simulate four independent CPUs. One of the easiest ways to build your model is you
build four CPU pipelines, but that will consume more resources. And a simple thing you can do
is build a single processor pipeline but you put four hardware threads on it. You use each of the
hardware threads dedicated to one of the CPUs in target and you have a tiny model so you can
synchronize all the hardware threads and make sure they on terms of simulation they act as if
in parallel. So as a proof of concept I actually build a FAME simulator and we call it Ramp Gold
and the goal is to simulate 64 SPARC v8 core with a shared memory hierarchy. It's a really low
cost that runs the $750 Simics as [indiscernible] in the board. And also it has sufficient
hardware units. They have hierarchy FPUs and MMUs and it's compatible with the Linux. We
actually compared the simulator with the state-of-the-art software simulator which is Simics.
We found that it's actually to the order of magnitude faster. It's more efficient. And this is also
the basic building block for my data center project, Diablo. So next let's look at the data center
project, the Diablo simulator. So I'll highlight all of Diablo. So Diablo is really a wind tunnel for
the data center networks built with of course FPGAs and so the goal of Diablo was very
aggressive. We're trying to simulate tens of thousands of nodes. In our prototype we have a
couple of thousand nodes running and each node in the system is actually running a real
software stack. We're not running micro-benchmarks. And also we're looking at switch
interconnects so focusing on switch interconnect we be able to simulate a couple hundred even
thousands of switches with Diablo at all levels and with enough details, architecture details and
accurate timing. So I wanted to point out, so in Diablo we there are real instruction executing.
We are really moving real bytes in the network. We're not sending fake pointers. And also
Diablo is fast enough given like overnight is capable of running a scale of 100 seconds, just a
few minutes. And also we know FPGAs are really hard to program so we prebuilt many run
time comfortable architecture parameters for the user to modify with their control software.
There is no need [indiscernible] this. We can allow users to change the link performance, link
throughput, link bandwidth and link latencies and even the switch buffer layout and policy,
buffer drop in policy, things like that. Of course, this will be built with the FAME technology I
mentioned. So there are actually three -- so I mentioned Diablo is all about performance
models. So there are actually three basic models in Diablo. The first one is the server model.
It's actually built with a RAMP Gold simulators. For the server models right now are using a
very simple timing model, assuming it's a fixed CPI timing and the all of the instructions take
one second and the rest of the memory instruction takes fixed number of cycles. It supports
SPARC v8 ISA. As I mentioned this is actually SPARC’s certificate processor, so we run there
verification suite. And this runs the full Linux, so we run the full Linux 2.6 there. Right now the
Linux 3.5 which is very recent. Also, in terms of Diablo the second types of models we have in
Diablo is the switch models. If you look at data center natural data centers, there's really two
types of switches. The first one, the circuit switch which is more like from the research side,
and the second one is more conventional. It's just like packet switching like production data
center run. So we actually have both models. We build both models, the abstracted models,
but our focus is more on the switch buffer configurations instead of the switch routings. The
switch, the actual model is looking after a Cisco Nexus switch, the Nexus 5000 and plus a
Broadcom patent, I point out by Microsoft Research. Also the last model in this Diablo is the
natural interface card models. Realize that natural interface card model is really crucial, so we
have very sophisticated natural interface card model, the NIC model. It supports scather gather
DMAs and zero copy drivers like many features you find the performance optimization features
you find the production NICs. They also support the polling APIs from the latest kernels, so this
is not trivial. So now we have three models and let's see how I map these three models to use
the three models to model a real data center. Just recall on the first slide that this is the
abstracted view of your data center. I'm using, it's like a factory close topology, just for
illustration. You have thousand servers like thousand switches and [indiscernible] centers.
Instead of building like tens of thousands of different FPGA bit files, I only have like two kinds of
bit files. This is very marginalized design. So the first type of bit files I used is to simulate a rack
of servers and on top of a rack switch. Given the current setup which is allowing for
[indiscernible] I was able to simulate about four racks of servers plus a top of rack switch. And
the second type of FPGA design is we dedicate just the simulate switch itself. We use the FPGA
to simulate the monster switch; it's like tens of thousands of switches designs. And for
aggregated switch and array switch and data center switch, so it can really do a detailed
architecture modeling there. We connect these two type FPGAs using the high-speed SERDES
links. Now the connection is based on the physical topology. Say you want to have a
hypercube, then you just connect FPGA in a hypercube way. There is no restriction, just wiring.
So we actually build this Diablo prototype using multiple BEE3 boards. This is actually
developed by Berkeley and Microsoft Research. You probably have heard of it. Each BEE3
board has a Xilinx Virtex5 FPGAs, so six boards I use like 24 FPGAs in total. So this is the photos
of our cluster. We actually populate all of the boards the maximum memory, so this has decent
memory capacity. It's 384 gigabytes DRAM in total, but think about number of nodes trying to
simulate, that's still not that much. It's roughly around 128 megabytes per simulated node, but
we think it's kind of interesting enough to run some naturally occurring applications. We're
actually working on a second-generation trying to solve this problem. And also, if you look at
the bandwidth of the memory is really great. There are 48 [indiscernible] run through channels,
so you have roughly 180 gigabytes per second of bandwidth and each of the FPGAs has this one
gigabit. You see this yellow cable? This is one gigabit links. We use this link just for the control,
so you have a 24x1 gigabit links. We have a couple, actually six servers just to drive all of the
boards which will be used for like say the consoles similar in Linux and also provide some disk
functionalities. We connect these boards using -- sure.
>>: I have two questions. One is I notice on the first bullet that you use 128 megabytes per
node. Do the workloads have to be scaled down to map to that?
>> Zhangxi Tan: I think yes. For the one like memory cache that we're looking at, it doesn't
actually matter, so in terms of the workload size.
>>: Okay.
>> Zhangxi Tan: I think this is limited by the board design itself.
>>: It seems really small [indiscernible] that size [indiscernible]
>> Zhangxi Tan: Yeah. When I talk to some like natural [indiscernible] people they think that
when you have 100 megabytes then you can run most of the interesting protocol studies. But I
know for something like star applications you definitely need to gigabytes because just for the
caches this is very large.
>>: So if you scale it down to that size, then the software simulate a bit faster?
>> Zhangxi Tan: No. [indiscernible]
>>: Okay. And the other question I had is on a previous slide you showed how you partitioned,
and so is this a balance system and if I'm a user do I have to manually make these partitioned
decisions to get it through [indiscernible]
>> Zhangxi Tan: You don't have to. We just use the naïve partitions like this. You don't have to
worry about that too much. And I can tell you the performance simulator is very scalable.
When we try to simulate 1000 nodes versus 2000 nodes, we saw zero performance loss.
>>: I'm thinking like if I'm trying to evaluate different types of work architectures, I might have
a different topology altogether. And how I'm drawing these boxes [indiscernible] some places.
>> Zhangxi Tan: I don't think so. We don't play tricks with like fancy partitioning, so we just use
mostly just one, mostly the before one. That will still give you some performance. And also I
can tell you the performance simulation bottleneck is not actually simulating the switches. It's
actually simulating the server computations and so we have to optimize your server
computations simulations in that case. Okay. So back to this. We actually connect the
prototype. We actually connect all the boards using this like, you can see these server cables
around our own custom protocol, server protocols. There's lots of those. Each link is running
like 2.5 gigabits per second. We designing a second version which will have a passive back link
to handle all of the connections using these cables. And overall our whole system is pretty nice.
It consumes only like over 1 kW, 1.2 kW, but in terms of simulation capacity what you saw here
is that BEE is capable of simulating slightly over 3000 servers and they are also 96, 3000 servers
and 96 racks and there are also 96 simulated switches there. So think about there are 3000
copies of Linux running there running real applications. It's very interesting. In terms of
instructions simulation throughput, the whole system can deliver roughly around 8.4 billion
instructions per second. Just a little bit about the limitations, because there are lots of things
going on there. What I'm showing here is a dye photo of the FPGAs. I use a similar rack of
servers. So on high-level, this is a full custom FPGA design. There's lots of our design there.
We use minimum third-party IPs, so the only thing we don't design ourselves is the designing
FPUs and this is pre-patent full design so we stretch the chip to the limit and so it's like a 90
percent of the BRAM utilization and also we close the [indiscernible] close to 95 percent look up
tables that are being used. And keep in mind that the FPGAs we use here are like five years old,
2007 FPGAs. If you have this year's you could potentially have more pipelines there. And we
are actually running the whole circuit like on 90 MHz, like 180 MHz. This is double pump data
path to make sure we have enough pores on the BRAMs. And if you look at the design closely,
there are basically two partitions. Also, each partition has one host DRAM controller of 16 GB
DRAMs and it has two server run goal pipelines plus two switch pipelines there. And in terms of
size of this model, probably you can see. So the NIC models and switch models and processor
[indiscernible] pipeline is roughly the same size and with the [indiscernible] is double the size,
so you can guess I have a large FPGA how much I can put there. And so this FPGA has four
server pipelines and four switches and slash NIC pipelines there. And it also has a couple,
sometimes one to five, 2.5 Gbps transceivers there. It depends on which type of FPGA, which
type of topology you want to use. So I show you some prototype with 6 BEE boards. Let's say
what if you want to go with a 10,000 node system. There is some type of [indiscernible]
calculations. If we stick with the current technology which is very old, 2007, five-year-old
technology, it probably ends of using like 22 BEE boards means 88 FPGAs. But if you are looking
at this year, this is really promising. Look at the recent technology, 28nm generations.
Simulating 10,000 CPU systems, you only probably need 12 boards. You actually calculate the
cost of the boards assuming each board costs around $5000. We are actually doing the board
design right now. The total cost is about 120k. That includes the board costs with the DRAM
cost and the cost is roughly half-and-half. And next I'm going to show you three exciting case
studies. This is very interesting. So the first one is we took Diablo to simulate a novel circuit
switching network. This is actually work I've done with MSR in Mountain View with Chuck
Thacker. I was trying to use Diablo to simulate the initial version of his circuit switching
network. Probably you are familiar with Chuck Thacker's work. To bring everybody back up to
speed, and this is a circuit switching network works more like an ATM. It has multiple levels of
the switching design. During the period of three months of my internship there, we actually
modeled all levels of the switches. And also we tried to run hand coded Dryad Terasort kernels
with some device drivers, so we think this is very interesting because Diablo can actually
capture some performance bottlenecks in the early designs. For example, we found some
problems in the circuit switching set in the tear down circuit. Also, we think there are lots of
interesting things from a software perspective. Chuck's design is a last leaf network and the
network doesn't draw any packets but we found actually that because everything is handled by
the processor eventually, and the software processing speed can actually lead to packet loss.
And the second case study idea is going back to the traditional packet switching world, so I'm
looking at a classic problem people call TCP Incast throughput collapse issue. This problem is
actually reported by CMU. They claim they found it in the real production gigabits network
attached storage. The set up of the problem is very simple. You have, this is the set up
topology. You have a shared switch and there's a shared [indiscernible] failed server and also a
couple senders connect to the same switch. So when you measure the throughput, the
bottleneck link, which is the link between the shared server and the shared switch, and you plot
the throughput curve versus number, different number of senders, or receive or whatever,
number of clients and you see a curve something like this. This is the throughput and this is the
number of senders and ideally, you won't see the throughput curve is -- when you increase the
number of senders gradually going up and actually to this peak, but in reality and practice, this
is the curve you saw. And people concluded, people thought the network simulations using
NS/2 and people try their experiment on small-scale clusters and researchers conclude that this
is caused by small switch buffers and also TCP transmission timeout in the kernel is too long
and so that's their conclusion. Okay. So when I look at this problem I actually talked to a
couple of people especially on the industry side and I asked them if that's a real problem. Some
of them told me they are pretty artificial. Only happens on this specific set up, so we actually
looked at how people did their experiment, and we found this problem actually only happens
under really specific like preset up. It happens with small block size, usually less than 256 kB of
request size and it happens only on a switch with shallow buffers. So the first thing we wanted
to do with Diablo is to see if we re-create this set up, like this critical set up. Can we reproduce
this throughput collapse problem? So this is the graph, this result you see from Diablo on the
simulation. We are actually trying to model the the Gbps switch, with shallow buffer Gbps
switch and we are trying to run the TCP Incast code actually measuring just to be fair, we're not
using Berkeley code. We're using code from Stanford networking research group. This is the
throughput curve you will see, and also we tried to change conversion points in the system,
scaling up the CPU performance to see if that affects. And also changed OS syscall, people
using applications. We actually didn't see a lot of difference at the gigabit level, so this is
throughput collapse and that's what you see on a real comparable physical set up, so this is
great. So we can reproduce the problem. Okay. Since Diablo is really a simulator, the question
to asked is what if we operate the whole interconnect to a 10 gigabit, change the switch stack
and change even the server performance? How will the system perform what if we have a
faster server? And also there are lots of things you can play with, right? You can say recall this
program. The original program is written using standard pthread plus [indiscernible] blocking
syscalls but if you look at most recent server applications, they are actually using epoll versions
which is more responsive versions of OI syscall which is use this called epoll and so what if we
use epoll? Many current servers like memcached are actually using this versus pthread. And
what if we change this, we do the same experiment what throughput collapse we will get. So
here I show you a simulated 10 gigabytes switch in which a bunch of things I mention, like say
I'm simulating four gigahertz servers and also versus 3 GHz servers. If you look at the orange
color, using the Stanford unmodified code and just change your switch, this is the one you get.
You still see a throughput collapse at the very early stage and there's not a significant difference
between the 4 GHz and the 2 GHz server with 4 GHz slightly better, but not that much. But
once you start looking at OS, let's change the OS code that is used. Let us change the syscall to
be epoll, right, you have got a drastic difference curve around the left-hand side. You see the
great difference between 4 GHz versus 2 GHz. Moreover, you see this like collapse on site point
actually shift to the right. So this was very interesting. First, we're happy because we can
reproduce this simple problem even if people call it artificial. Second, we found once you try to
scale up the whole system, say change your switch, your system performs a dramatic shift to a
very different scale and rethink looking at OS implementation, even application [indiscernible]
is extremely important. Much of the current work done with this simple problem, people just
run NS/2 there's no computation model. There's no OS at all. Also we think fixing the
transmission timeout timer is not really like fixing the real problem. You have to look at where
the root of the problem is, the real problem of where the throughput collapse is. Then you
start developing some methodology or some probably new system design to solve this
problem. So the last case study we think is very interesting and we actually run a modified
memcached service on a really large-scale set up. So we think this is exciting because it
reproduced some of the phenomenon that people saw in a real large-scale environment in
production, and also this is probably the first time academia will be able to look at a problem at
thousands of scales. And previously I talked to people and people can only afford to see a
couple of hundred nodes from academia, but now we're moving a whole magnitude better.
This is like thousands versus hundreds. So memcached just like probably you are familiar with
that, it's just like very popular distributed keyvalue store application used by many large
websites, let's say like Facebook, Twitter and any large website; it's memcached basically like its
name. And actually the code we use is unmodified memcached. This is not a memcached tiny
or mini or whatever; this is unmodified code, downloaded from their site and compiled on our
platform with a compiled SPARC and the clients we use is actually coded with [indiscernible] on
top of libmemcached. This is publication level that will be used to handle application level
[indiscernible] and things like that. In terms of the workload we use, ideally we want to have a
real workload by, you know, universities. We have some limited access to the logs, but
Facebook actually published workload statistics in last year's SIGMETRICS, so we look at the
paper and build our own workload generator and verify our own according to real Facebook
data. So we think this is probably the best thing we can get. The first thing that we did with
this memcached set up is to try and validate our results at small scale. Large-scale is pretty
hard [indiscernible] but small-scale we're going to have a relative equivalent system, physical
system set up and run the same experiment, the same programs on both systems and see if we
can get the shape of the curve right. So in our validation what we did is we were trying to build
a 16 node cluster with a 3 GHz Xeon processor, pretty regular, with a 16 port Asante Intracore
switch and in terms of configurations all the 16 servers we used two of them to be the
memcached server and the rest of them will be the clients and we'll scale the number of clients
from 1 to 14 and then plot the throughput curve and latency curves and so on and so forth.
Also, memcached [indiscernible] application [indiscernible]. There are lots of parameter
configurations that you can just change from the server side. What you can change here is you
can actually change the protocol use TCP versus UDP and which is better and also change the
number of worker threads in the server and say from four threads to eight threads. And keep
in mind the simulated server we have a single core model run at 4 GHz with a fixed CPI timing
model versus this guy. And it's a different ISA and different CPU performance. We don't expect
it will get the absolute number right, but we can actually expecting, what we're looking at is the
shape of the curve, the trend and scale, so that's what we care about most. The first thing we
look at in trying to do the validation about from the server’s perspective, look at the server
application throughput. This slide shows you the application throughput we measured Diablo
which is simulated versus the thing we really measured on the real cluster we set up. The x-axis
of this graph is number of clients, like from 1 to 14. The y-axis is the application throughput
measured from memcached to [indiscernible] interface. So it's measured in kilobytes per
second, so we can see under different set up, Diablo can successfully reproduce the trend of
the curve and to our surprise, because this is more I/O bounded, the absolute value is more
close to the real thing, so we're happy. So now let's look at from the client side. We measured
for each of the clients we ran we measured the log of all requested queries. We measured the
client root query latencies and we did a similar comparison graph as the previous slides and
again, on the right-hand side is the real cluster and the left hand side is the Diablo. As you can
see again, we've got most of the trend right, so most of the queries finished reasonably fast,
less than 100 microseconds. The only difference is that Diablo the setup we have here is
probably slightly, the latency is probably better than the real set up because we didn't actually
calibrate our models. But if you want to do absolute value, you can definitely calibrate your
models and gather a closer curve, but in terms of the trend of the whole system, this is very
good. So the next experiment is the most exciting part of the thing that we've seen so far. This
was actually taking the Diablo and really trying to study a problem at a large-scale, like we tried
to study a problem with a scale out to 2000 nodes. Since this is a simulator we can do many
tricks. You can play around many hardware setups there, right, see you can simulate different
inter-connects. For our case we're looking at gigabit interconnect with the regular microsecond
level port-port latency and also we look at what if we change the whole interconnect to 10
gigabits and improve our switch latency by a factor of 10 on everything, like throughput latency
of factor 10, what would happen to the application. I know the setup is a little more aggressive,
but I'll show you some ideas of what we can do with Diablo at this scale. When we scale up this
application and scale up this experiment we are trying to be moderate. We are trying to
maintain the server versus client configurations, so we still have two memcached servers per
simulator rack and with the rest of the clients and we want to make sure all of the server loads
are moderate and we don't want to like push everything to the cliff. That's not where people
are running today. So the server utilization we measured. The simulations run roughly 35
percent. There's actually no packet loss when I create this experiment. One problem at largescale is this problem people call request latency longtail. It's actually identified. Luiz Barroso
actually talked about this problem about two years ago at the FCRC keynote. So we are actually
trying to use Diablo to see if we can see the similar power on the large-scale. So this slide
actually shows you the client request latency distributions. This is the PMF graphs with x-axis is
in log, in microseconds. Y is the frequency or -- and also you see two types of -- there are lots of
-- this graph actually tells you many things. First, there are the results from the 10 GHz
interconnect which is the dashed line and also the solid lines for a gigabyte interconnect. And
the first thing is we actually see this throughput, the latency long tail problems, which is most of
the queries finished very quickly and on the order of 100 microseconds, but there was still
maybe 1 percent of them that was 1.5 percent or whatever to finish a lot slower than the rest
of the queries for some reasons, two orders of magnitude slower. Also if you look at this like
gigabit switch, we actually classify the types of queries into three categories, right? So the one
is just this is the size of the system. Some encouraged to keep the local memcached service.
Some you may have to traverse one switch, one aggregate switch and sometimes two. We
actually [indiscernible] latency distribution of all types of queries. And once you have a sizable
system most of the queries actually going across your rack is going off the racks, so that's very
common since that's where it dominates. Also you'll find if you traverse more switches and like
this guy and you have this peak is getting lower than, this is low code. This is one hardest to
[indiscernible] lowered. It means if you try more switches and the variations your query will be
more. I think Luiz this year he wrote a paper talking about the cost of this latency long tail. One
of the realizations that you have is once you have more switches you have more deterministics
in your system so you can actually recapture that. And another great thing is you can compare
the results with the 10 gigabit switch versus the 1 gigabit switch. Keep in mind everything is 10
x better in the 10 gigabit set up. It actually does help on this latency long tail issue and fewer
requests will finish slower, but if you look at actual value of this request finish by the 10 GB
interconnect they are not 10 x better in terms of application latencies. This no more than like
2x. Again, look at Luiz’ talk, he mention if you look at a portion of time in the query latencies
you'll spend, the time you spend in the switch, the hardware only accounts for a very small
portion of the overall. There are lots of time you are spending by as the kernel stack and this
again Diablo helped to identify this problem, and also OS naturally in the stack computation
again, very, very important. So since we have this axle in the vehicle, right, so you can see
problem in a large-scale. So one big question like we asked the impact of the system scale on
this long tail. This is what makes the Diablo unique. So let's look at the result we have in the
previous slide you saw the PMF and the distribution. Now let's look at this CDF curve. This is
took one curve in previous slides. So if you Google, you not actually interested in this part
where most of the query actually finish reasonably fast. What you are actually interested in
this part. This is the tail part of the whole request. This is where it makes the whole thing more
interesting in terms of revenue and everything. So first let's see, okay. Let's zoom in on this
CDF curve a little bit. So let's zoom in on this curve and answer the question if we have a larger
system does it hurt the latency problem or help latency problems, latency long tail problem. So
apparently, I'm using here. This is zoom into the curve, the tail part. This is 96. This is like 100
and I only use like there are lots of data points. I only use one configuration because other
configuration is similar, just show you the ideas here. The one I use here is 10 gigabit
interconnect and we are running with UDP protocols. And ideally you want this curve closer to
the blue. And that means that fewer things are falling into the tail. But if you plot this curve
with 10 gigabits here, you see clearer trend, right? This is for 500 nodes. This is 1000 nodes.
This is 2000 nodes. So you have more nodes, more servers in your system, apparently this tail
problem is more obvious. So that's the first thing. Okay. So next, academia, especially
academia really looking at a system just like hundreds of nodes because of physical limitations
and they tried to draw some conclusions. I think many people draw conclusions just with
handwritten notes. And so let me asked this question. Would you draw the same questions if I
give you a platform that is capable of looking at a problem with 1000 nodes? Let's say in this
case look at the tail. So we have a different conclusion, different answers if we have a larger
scale system. One very simple experiment we can perform here is to change the protocols,
software protocol we use for the servers. Let's say TCP versus UDP. Which is better given a
specific hardware set up, which is better in terms of minimizing the long tail, right? This is a
very simple experiment we can do. So let's first look at the experiment we have done with the
gigabit, the traditional gigabit interconnect. Again, I'll plot this, like zoom in on the curve of the
CDF and I will keep the same coloring scheme consistent where red represents the UDP
protocol and with the blue one represent the TCP protocols. So clearly you have 500 nodes a
gigabit interconnect set up, you see, well UDP is definitely a better choice because this curve is
closer to the roof. Now, let's move a little further; let's go look at 1000 nodes. Well apparently,
very interesting, right? Going up to 1000 nodes you don't see significant difference between
TCP and UDP and with TCP slightly better. Now let's move a little bit further to 2000 nodes.
This is the result for 2000 nodes. In this case TCP outperforms, for sure outperforms UDP. So I
can draw a conclusion now that TCP is the better in terms of scale. But what if we change the
interconnect to 10 gigabit, right? So will I still have the same conclusion like TCP is a much
better choice in terms of latency long tail? Let's redo the same experiment. So first, again, 500
nodes. Here you can see, TCP already outperform UDP for 500 nodes set up at 10 gigabit. Now
let's go up to 1000 nodes. This time your conclusion reverted. It's apparently UDP is better.
Now let's move the 2000 nodes and let's see if UDP is still better choice. Okay. This is the curve
we got at 2000 nodes. There is not much difference; they are almost the same. So this is really
interesting because we think that long tail problems really is a complicated problem. Multiple
factors contribute to this issue. And it's not a linear function, so you definitely draw the answer
which one is better. I believe there are many other things happen at scale and Diablo is
definitely capable to help that and when we did like this large-scale experiment we also found
many other interesting things at large-scale.
>>: In these experiments you see these effects. Do you know why?
>> Zhangxi Tan: I have some like rough ideas. I think the problem is I can mention, there are
lots of key software queues in the networking stack and the way the current networking stack
is, the handle is not really deterministic. So you look at the drivers, right, don't just look at the
switch itself. The model factor’s at the lead, but software in the networking stack is the big
problem. The way it's designed is it's actually optimized for internet. It's not deterministic. It's
working the porting, using the polling [indiscernible] for most of the queues, the way that the
OI stream the queues. So it's not like you have a packet and you send to the software. It's
really up to the OS to decide when to notify the application. So I can't tell you exactly why this
happens, but I believe there are more things you can do, but this is not a single, caused by a
single problem. This is probably a system design issue. So let's look at the other issues. Sure?
>>: How much observability do you have?
>> Zhangxi Tan: How much, anything you can use with the software, you can do that and also
we use hardware…
>>: So anything you can see in software?
>> Zhangxi Tan: Yes. So this is standard Linux, whatever you can do with the Linux, you can do
it.
>>: Right. Can I put the FPGA [indiscernible] switch and all that?
>> Zhangxi Tan: Yes. You can actually do that. It would actually generate, I actually didn't use
[indiscernible] but I can give you some distribution as to how, the way the queue is drained
over the time and that's done through the hardware performance counters. So you can, yeah.
We have like this debugging infrastructure there, but now for this experiment I just like to
spend my time finishing up. I didn't look at those, but yes. The answer is it is definitely
possible. You can see more things than you couldn't see on a real switch design. So yeah.
Potentially this is architecture simulator, right? You want to prove whatever you want and also
the nice thing about FPGA is the debugging infrastructure runs in parallel, to the main stuff so it
tacks on zero performance impact, so that's good. Okay. So let's look at other issues, some
other small issues. The first issue is Facebook, if you look at Facebook publisher, a blog about
they are memcached twice, like probably good choices. And Facebook saying we are running
UDP. The reason for that is because TCP consume more memory than UDP. But we actually
didn't see this. We actually look at the server memory utilizations at scale and they are pretty
much identical. But we think what really matters is the dynamic patterns of the TCP versus
UDP, so it could be caused, the memory consumption could be caused by the server are not
evenly balanced, like we have more in-flight transaction you have to handle for TCP versus UDP,
but you can't simply draw the conclusion that TCP is definitely better. It is really about load
balancing. Another thing we found is when you look at how people do, especially system
networking folks do similar experiment, they tend to focus on the transport level protocols.
They are not happy with the vanilla TCP. They do lots of hacks and transfer protocol control
theory based transfer protocols and also they are trying to hack the kernel, say tweaking the
timeout values through the protocol. But we found some cases where vanilla TCP might do just
fine. You have to, what you really should focus on is not just the protocol itself. You have to
focus on the CPU processing, like in the case of the TCP in cache, right? You have to focus on
your nic design and also even OS and application logic are crucial. Also like to answer Eric's
question in that we think lots of software queues like when you start building the system even
though when you start writing the driver, you figure out the way the current stack is
networking is processing the packet is very, very complicated. There are lots of queues and
buffers just on the software side. It's not just like there's a buffer and a switch. There's more
software buffers which is more nondeterministic in software stack. So why don't you just get
rid of the buffers and [indiscernible] buffers in build a supercomputing like interconnect directly
on the [indiscernible] fabric. So also we saw some other interesting like say we try to change
the interconnect hierarchy. Sometimes we have one layer and sometimes we have two layers.
We found that adding another layer or a datacenter level switch affects DRAM server usages
and I think that affects your flow dynamics. This is also very interesting to see at large-scale.
And so in conclusion, we believe by looking at and evaluating the data center networking
architecture and things, looking at OS application logic and computation logic is definitely
crucial. We need that. And second thing is you can't generalize your result you got from 100
nodes to a couple of thousand nodes. You see completely different things at a couple of
thousand scale. And we believe Diablo is really good enough to generate these relative
numbers versus an absolute number because we figure out a trend, right? And we think this is
really a great design space exploration for a large-scale before you really invest like shell out a
couple of million dollars to build a real system. So apparently, I've learned a lot during this,
building this. This was like a year effort, by the way. So Diablo can see now is actually designed
for large-scale system capable of simulating thousands of instances. So there's some people
that asked the question, so why are you only interested in things happening within the rack. So
will Diablo still help me because using FPGAs is supposed to be slow. The answer is yes. The
reason for that is because we have like 3000 instances here on the prototype we have. You can
actually potentially run your experiment in parallel. All the experiments we ran so far finish
overnight. Think about if you have a real physical design and when you are trying to run your
experiment on real physical design and sometimes you couldn't afford the cost of building
multiple racks, but you can, you have the right experiment in sequential, but overall it's not
necessarily faster or significantly faster than running in the simulator, so once you have 3000
simulator instances, great massive simulation bandwidth, that will probably change the whole
story. The second thing is because the goal of Diablo is very aggressive. We're actually running
the real software codes. It's not in microkernels like researchers use. We are trying to bring up
the whole, the ultra shell in its kernel, so we found that there are lots of issues in the software.
The real software they have boxed, right? So we actually saw very interesting things when we
put on Linux. A couple times we had to change the hardware just to work around this Linux
issues because we sometimes change software and the problem show up again and then we
decide okay. We have to change the hardware to make sure we can run this legacy in Linux.
Also we actually found that there is some, Linux is actually a product of the hackers, right? The
software hackers, they don't actually follow the hardware spec. When we design the processor
-- what's wrong? When we design the Diablo we actually follow the spec from [indiscernible]
we follow the specifications. We follow the stack and one of the things we found is
[indiscernible] processor state register, right, but there's a third of the [indiscernible] register
but some of these space is for reserve. I mean it's for future use. Software shouldn't use it, but
the current implementation with SPARC processors [indiscernible] processor they allow some of
the reserve bits that's writable and readable, and the Linux actually currently use one of the bits
stored really crucial information like say to indicate whether or not the current processing OS,
use the current OS application mode, that would be used to clean out the whole OS kernel
stack. So we actually found purely based on luck because we follow the hardware spec but
hackers, no, they have a machine. They just code. If it works it works. So also we look at
Diablo it was six FPGA system that is actually half of the server rack, so this is a massive scale
simulation. This is like the real machines. Think about the number of DRAMs. You plug in the
numbers from Google's number for [indiscernible] higher rate for DRAM, you see real soft
errors in the system. We actually see the DRAM errors and also we saw errors from the server
links. We have lots of server links, right? They are not 100 percent reliable. You run into errors
for sure per day. Also we actually this ends up because we are using the memory in different
ways so some of the nodes just die in the simulation. We think the FAME methodology,
however, is great. It helps us solve a problem on our scale. How it was to get closer to the
hardware side. But we really think we need a better tool to develop it. And now I code
everything with the Verilog/Sytemverilog so I may [indiscernible] system [indiscernible] so I
have to be frank in saying this. It's not a very productive tool to build. We are actually working
on a second version. Maybe there's some DSLs can help us. Like I say, I am a hard-core Verilog
guy. Even I will say this is hard. So you have to think about whether or not you want to build a
system and just rely on Verilog. So again, Diablo is actually available from this website and you
can go look for more information. We plan to publish all the source code out there and there
will be a paper upcoming and I am currently building a second generation of Diablo hardware
and I will talk about how we see at 10,000 CPUs with enough memory in the next versions. So
with that, I will be happy to take questions.
>>: Any questions?
>>: So you said you tried to avoid all the IP from the FPGA vendors. Is there a particular reason
you…
>> Zhangxi Tan: Yes. It actually works. [laughter] that's my frankest. Yeah. But say the
memory DRAM controller was not reliable enough, ones that have lots of clocks. Clocking
[indiscernible] for [indiscernible] controller is not reliable enough. Besides, I don't like their
controller because there are lots of hardcoded like routing constraints in those things. And also
the reason we design our own [indiscernible] for example, SERDES you are not really using all of
the features in the [indiscernible]. Now you are using all the features in the [indiscernible]
drives and also you want to make sure the whole thing is reliable enough, so we build our own
retransmission schemes, reliable schemes there and basically then take it like TCP is equal to -will like the hardware is TCP with a window size equal to one. And also for the control
protocols are either [indiscernible]. Once you hook up lots of internet cables to the servers,
you see software running on actual server will drop [indiscernible] packet. So there also must
be some reliable guarantee and I believe the guarantee is in your hardware design. So you
need to redesign the whole thing. It's not just like one. So we have to think about don't grab
off the shelf [indiscernible]. It will never work. I mean for some people, but not for me. Okay.
>>: You said that you had 120 megabytes per [indiscernible]. Does your target OS support a
virtual memory?
>> Zhangxi Tan: It's full Linux. It will support anything that supports Linux.
>>: So it supports the demand [indiscernible]?
>> Zhangxi Tan: [indiscernible] turn it off, in this case. Yeah.
>>: Okay. But as you turn it on you -- so is it possible that you could virtualize at the host level
to allow you to have [indiscernible]
>> Zhangxi Tan: Yes. Actually, now we are doing the second version. The second version will
give you [indiscernible] for simulated servers. And the new the server is not going to be a single
core model. It's going to be multicore and we are going, the idea is as you mention we are
going to use external flash as the main solvers to provide you really multiple gigabyte DRAM
storage, and use the DRAM as the cache.
>>: Okay. And then in order to ensure accuracy would you solve your [indiscernible]
>> Zhangxi Tan: [indiscernible] using DRAM cache. This is to put optimized multithreading
design so latency will be hidden hopefully. And based on our current setup, like say I
mentioned 1000 nodes like three boards versus six boards, I don't see any actually raw
performance simulation going down, so it's very great. Besides the synchronization between
the models, the high-end models run on different FPGAs. Since we control the protocol design
our self, we have done that, the round-trip time is like 1.6 microseconds doing all the payloads.
I don't think you can do, I don't think any of the existing parallel software simulators will beat
that. And I don't think, I don't see to the best of my knowledge there is an equivalent software
simulator that allows you to do the same thing.
>> Jim: I think you are going to talk to most of these people. So why don't we thank our
speaker. [applause]
Download