>> Jim: Okay. It's my pleasure to introduce Zhangxi Tan who is a PhD [laughter] from UC Berkeley who is actually pretty well-known. I think you've been here several times and you've got a Microsoft Fellowship. >> Zhangxi Tan: No. >> Doug Berger: No? You didn't? Well, you should have [laughter]. And has done a lot of very interesting work using FPGAs to simulate data centers and he'll talk about that work today. >> Zhangxi Tan: Thanks, Jim, thanks for the introduction. So the talk runs about 50ish minutes so feel free to ask any questions I can answer on the line. The topic I am going to present is my dissertation work I've done in Berkeley with Professor Kriste Asanoic and David Patterson. The topic presented is called Diablo, Using FPGAs to Simulate Novel Datacenter Network Architectures at Scale. So this is the outline, agenda of my talk. I'm going to briefly talk about motivations of this work and since we're building emulation on the test batch, so we treat data centers as the whole computer system, so during the course of the project we're systematically look at how computer systems should be evaluated and we have our own methodologies. And then I'm going to talk about the beef part of the talk which is the Diablo which stands for data center in the box at low cost and talk about incantations, talk about architectures and then I'll show you three very exciting case studies that we've done. Some of them you wouldn't be able to see with current tools and finally since this is a really long time system building project, there are lots of experience and life lessons we learned. I'll talk about that and then conclude my talk. So data center network is a very interesting topic. The first I want to show is a traditional data center network architecture. This is from a Cisco perspective. This is a traditional one. So this slide is from Cisco. As you can see, data center network is usually a tiered architecture. They have multiple tiers down at the bottom, a rack of servers and each rack of servers contains roughly around 440 each machines and they have a top of the rack switch. And further this top of rack switch will aggregate a switch people call AS switch or aggregate switch and there are multiple levels of those. And just to give you an idea about the size of the whole network, a typical center usually you will see about 10,000 or more machines and plus like thousands of switches, so this is really, really a large network and there is no doubt about that. Data center networking infrastructure is apparently very, very important. This slide I stole from Jim Hamilton from Amazon. Basically a high level he built a data center network infrastructure is the SUV of the data center. This is an analogy of the car. SUV car is large and heavy and gets poor gas mileage. First of all, network infrastructure is expensive, right, it accounts for the third-largest cost of your whole center. And more importantly, it's not just the cost; it's about supporting applications and supporting data intensive jobs like MapReduce types of jobs. And so they are very important, right? So therefore, there are lots of people looking at this networking design space. Some people from research side, I believe Microsoft and also there are some people from product sides like from the Google’s efforts and now Facebook. They are looking at like 40 gig, even 100 GB switch designs. If you look at all of these designs and you will find lots of distinct design features. For example, intensive switch design. People have different packet buffer micro-architectures and even the sizing of the buffers are completely different. Also, some of the switches have the programmable flow tables and some don't. Even in terms of applications and the protocols people support it, they are quite different. Some people will support different versions of TCP and has the congestion bit support and so on and so forth. However, if you look at all designs, they are pretty nice. Everybody claim they have the best design compared to others. But the problem is if you look at all of the work done so far there is one problem that has been largely ignored so far, which is the evaluation methodology. If we look at how people evaluate their novel designs there are a couple of issues. First of all the scales. The scale people have right now is way smaller than real data center network I showed you in the first slides. If you look at what the experiment people have done so far typically, you will see a test batch of less than 100 nodes. And especially for academia things are even worse because they don't have much money. They just use less than 20 machines and put some of the virtual machines and they pretend they have lots of nodes. The second issue is about the programming that people use. People tend to use syntactic programs or micro-benchmarks. Even some people claim they won't use micro-benchmarks because micro-benchmarks can drive their switch to full and that's a really bad assumption. If you look at real data center application programs, what they are running is really the web searches Gmail and MapReduce types of things. Last, is in terms of the switch design itself. So there are also networking researchers today who are trying to use off the shelf switch and unfortunately those switches are really proprietary and there is almost nothing you can change there. For example, you couldn't actually change like the link performance delays and even the sizing of the buffers. You want to change the computer switch? Okay. Good luck. Have some talk with [indiscernible]. They allow you to do that. It's not very easy to do it. So here I raise a really valid question. How do we really enable network architecture innovations at large-scale like in order of thousands of nodes? It's not hundreds of nodes without really spending lots of money building a real data center, large scale data center infrastructure. So before I start a project I visit a couple of places including here. And also talk to people like large corps like Microsoft or Google and Amazon and asked them what kind of features you want if you want to do networking evaluations. Apparently one common conclusion is that networking evaluations is at least as hard as evaluating other large-scale system design. So first of all you need decent scales. So data center on order of tens of thousands of nodes, but you need at least a couple of thousand nodes to see really interesting phenomenon. Secondly, if you look at the real designs, the real thing you want to look at I say are switches, right? They are massive parallel. If you want to run something for this design's slate, that means you really need some decent performance in your simulators. A think about large high rated switch are usually, 48, sometimes close to 100 ports. Think about number of flow you would support and the virtual queues [indiscernible] concurrent events just per clock cycle and the thing trying to look at. And also if you look at network itself, it is really in nano second time accuracies. It's really accurate. Think about the typical user’s case for a 10 gigabit ethernet. Usually, say transmit a 64 byte packet on 10 gigabit ethernet, it takes about 50 ns, so this is really comparable to DRAM access. That means if you want to have a simulators, or similar things, you will see lots of many fine-grained synchronizations just during the simulations. Last but not least is people really want their production software. They need extensive application logic. People are not content with micro-benchmarks at all, especially from products side. So here is my proposal. I'm going to use FPGAs. So this sounds like crazy. For you who are not familiar with FPGAs, let me just give you a very brief overview about what is FPGA. This is from digital circuit design class from Berkeley. FPGA is a configurable, reconfigurable set of architectures. You can see there is an interconnect and there is a big functional block. If you zoom in this block you see a structure like this and typically you'll have lookup table with a fixed number of input. That will be used to implement any [indiscernible] logic and to store some states in the circuit, and you have a flip-flop. The people use FPGAs for many things. For example, people use FPGAs to build initial hardware prototypes and people use FPGAs to do basic design validations. You can really use this highly structured device to build a target hardware system today very easily. My idea is trying to use FPGAs a little bit different compared to people who use FPGAs for compute and people who use FPGAs for verification. What I did here is I built abstract execution performance driven module. Performance model is realistic enough to run the full software stack. And overall our built system when calculated the cost per simulated data center node is really low. It's like $12 compared to the other is really high, right, so you spend like 25 bucks you can buy lots of those, but you still have to buy the interconnect and there's nothing you can change there. It's just implementation. Also I believe there are similar efforts to build hundreds nodes test batch using atom processors in academia. So I mentioned like Luiz Barroso from Google that data center is really a warehouse computer, so we going to take the data center problem is a computer system problem. So we're going to look at how people deal with this computer system validations. One common approach is computer architecture use. Computer architecture use is trying to run some simulations. Before I start this talk let's get familiar with the jargon and terminology that I will like to use. In simulation world people often like to mention host versus target. What I mean by target is the system is actually being simulated. In my case I'm looking at a data center design, so that means it's real servers and switches and that work and the interface card. And the host usually when talking about simulation means the platform on which the simulator itself runs. In our case it actually FPGAs. We use FPGAs as the architecture simulators. If you are interested in the software simulators, that's going to be actually six machines. We actually look at many computer architecture simulators system simulators out there. Primarily, we think there are two types of major simulators. So the first one is pure, the most popular ones, based on software. We call it a software architecture model execution, in short SAME. The second one is like our approach is to use FPGAs and we call that in short FAME. Actually I wrote a paper about it a couple of years ago just talking about differences and why you have to do the FPGA architecture FAME style simulations to exploit the design space, why this offer is bad. So first let's look at a current state of the art. The validation [indiscernible] software architecture model execution, SAME. Almost everyone will use it. There are lots of issues with software simulators. So the number one issue is the performance. So to show you the problem we actually look at how computer architecture, the simulations in their research. By doing that we're looking at the papers published from the premier architecture conference which is ISCA a decade ago and also all of the papers people publish a couple of years ago. I think the data is slightly old, but I don't think there is significant change today in terms of the way people simulate their benchmarks. So we're actually looking at [indiscernible] instructions people simulated for benchmark across all the paper people publish. Also look at the target, the complexity of target design in terms of number of number of CPU cores it supports. Also calculate the number of instructions you actually study per core. So here's some interesting thing we found. Back a decade ago, like in the last millennium, right, everybody is looking at a single core design and people simulate around the medium instruction people simulate for benchmark is around 267 million instructions. And going back to a few years ago moving to the multicore [indiscernible] everybody looking at the multicore design, so the median number of cores people study is around 16 cores. Given the complexity of the system grows over the past decade. People do simulate more instructions like close to a billion instructions, 825 million, but if you divide by the number of ports people are looking at this number a hundred million seconds not going up but actually going down. Sadly, think about cost frequencies people are running today. For the processor back a decade ago so everybody runs a couple of hundred megahertz, this recently processor is actually running at a couple of gigahertz. If you convert that's 100 million instructions to the raw walkoff time and this is really sad. This only translates to less than 10 milliseconds. So what does that mean? That means you couldn't even simulate a typical operating system [indiscernible] which is typically around a millisecond. And if you look at data center applications and I/O's and all sorts of things, it at least has to simulate a couple hundred, in just a few minutes to see really interesting phenomenons. And also another big problem of the software simulator is usually, people tend to build unrealistic models. For example, we have the infinite fast CPUs. Everything just consume one cost cycle which is very dangerous. And so therefore we are really betting on this FPGA architecture emulation. Just I'll have you know there are simulators built on FPGAs, so now that we confuse it with the standard FPGA computers seen regularly and not be confused with FPGA-based accelerators people are interested. So there are actually simulators, right, architecture simulators. So when will be architecture simulator design a FAME, the architecture found a lot of these things the design features and there are lots of dimensions, right? So we summarize it like three basic dimensions, right? So the first one is direct versus decoupled. Decoupled means you can actually decouple the host cycle from the target cycle you use. So for example, you use multiple host cycles, the simulate one target cycle. And the second one is for RTL versus abstracted RTL so for [indiscernible] verification guide it tends to use a modified [indiscernible] RTL but in terms of FPGA simulation in the interest of performance you can actually use like abstract performance, actually performance models so that has many FPGA friendly structures to simplify your design and also increase your density and along with this decoupled timing [indiscernible] and you have a much simpler design [indiscernible] just take FPGA cycles similar to one target cycle. And the last feature is single threaded versus multithreaded host, and this is really crucial for me to release [indiscernible], to be able to simulate a couple of thousand server nodes with switches. I'll show you what I mean by multithreaded hosts. Think about this case, what we call host multithreading, by the way. So very trying to simulate your target is trying to simulate four independent CPUs. One of the easiest ways to build your model is you build four CPU pipelines, but that will consume more resources. And a simple thing you can do is build a single processor pipeline but you put four hardware threads on it. You use each of the hardware threads dedicated to one of the CPUs in target and you have a tiny model so you can synchronize all the hardware threads and make sure they on terms of simulation they act as if in parallel. So as a proof of concept I actually build a FAME simulator and we call it Ramp Gold and the goal is to simulate 64 SPARC v8 core with a shared memory hierarchy. It's a really low cost that runs the $750 Simics as [indiscernible] in the board. And also it has sufficient hardware units. They have hierarchy FPUs and MMUs and it's compatible with the Linux. We actually compared the simulator with the state-of-the-art software simulator which is Simics. We found that it's actually to the order of magnitude faster. It's more efficient. And this is also the basic building block for my data center project, Diablo. So next let's look at the data center project, the Diablo simulator. So I'll highlight all of Diablo. So Diablo is really a wind tunnel for the data center networks built with of course FPGAs and so the goal of Diablo was very aggressive. We're trying to simulate tens of thousands of nodes. In our prototype we have a couple of thousand nodes running and each node in the system is actually running a real software stack. We're not running micro-benchmarks. And also we're looking at switch interconnects so focusing on switch interconnect we be able to simulate a couple hundred even thousands of switches with Diablo at all levels and with enough details, architecture details and accurate timing. So I wanted to point out, so in Diablo we there are real instruction executing. We are really moving real bytes in the network. We're not sending fake pointers. And also Diablo is fast enough given like overnight is capable of running a scale of 100 seconds, just a few minutes. And also we know FPGAs are really hard to program so we prebuilt many run time comfortable architecture parameters for the user to modify with their control software. There is no need [indiscernible] this. We can allow users to change the link performance, link throughput, link bandwidth and link latencies and even the switch buffer layout and policy, buffer drop in policy, things like that. Of course, this will be built with the FAME technology I mentioned. So there are actually three -- so I mentioned Diablo is all about performance models. So there are actually three basic models in Diablo. The first one is the server model. It's actually built with a RAMP Gold simulators. For the server models right now are using a very simple timing model, assuming it's a fixed CPI timing and the all of the instructions take one second and the rest of the memory instruction takes fixed number of cycles. It supports SPARC v8 ISA. As I mentioned this is actually SPARC’s certificate processor, so we run there verification suite. And this runs the full Linux, so we run the full Linux 2.6 there. Right now the Linux 3.5 which is very recent. Also, in terms of Diablo the second types of models we have in Diablo is the switch models. If you look at data center natural data centers, there's really two types of switches. The first one, the circuit switch which is more like from the research side, and the second one is more conventional. It's just like packet switching like production data center run. So we actually have both models. We build both models, the abstracted models, but our focus is more on the switch buffer configurations instead of the switch routings. The switch, the actual model is looking after a Cisco Nexus switch, the Nexus 5000 and plus a Broadcom patent, I point out by Microsoft Research. Also the last model in this Diablo is the natural interface card models. Realize that natural interface card model is really crucial, so we have very sophisticated natural interface card model, the NIC model. It supports scather gather DMAs and zero copy drivers like many features you find the performance optimization features you find the production NICs. They also support the polling APIs from the latest kernels, so this is not trivial. So now we have three models and let's see how I map these three models to use the three models to model a real data center. Just recall on the first slide that this is the abstracted view of your data center. I'm using, it's like a factory close topology, just for illustration. You have thousand servers like thousand switches and [indiscernible] centers. Instead of building like tens of thousands of different FPGA bit files, I only have like two kinds of bit files. This is very marginalized design. So the first type of bit files I used is to simulate a rack of servers and on top of a rack switch. Given the current setup which is allowing for [indiscernible] I was able to simulate about four racks of servers plus a top of rack switch. And the second type of FPGA design is we dedicate just the simulate switch itself. We use the FPGA to simulate the monster switch; it's like tens of thousands of switches designs. And for aggregated switch and array switch and data center switch, so it can really do a detailed architecture modeling there. We connect these two type FPGAs using the high-speed SERDES links. Now the connection is based on the physical topology. Say you want to have a hypercube, then you just connect FPGA in a hypercube way. There is no restriction, just wiring. So we actually build this Diablo prototype using multiple BEE3 boards. This is actually developed by Berkeley and Microsoft Research. You probably have heard of it. Each BEE3 board has a Xilinx Virtex5 FPGAs, so six boards I use like 24 FPGAs in total. So this is the photos of our cluster. We actually populate all of the boards the maximum memory, so this has decent memory capacity. It's 384 gigabytes DRAM in total, but think about number of nodes trying to simulate, that's still not that much. It's roughly around 128 megabytes per simulated node, but we think it's kind of interesting enough to run some naturally occurring applications. We're actually working on a second-generation trying to solve this problem. And also, if you look at the bandwidth of the memory is really great. There are 48 [indiscernible] run through channels, so you have roughly 180 gigabytes per second of bandwidth and each of the FPGAs has this one gigabit. You see this yellow cable? This is one gigabit links. We use this link just for the control, so you have a 24x1 gigabit links. We have a couple, actually six servers just to drive all of the boards which will be used for like say the consoles similar in Linux and also provide some disk functionalities. We connect these boards using -- sure. >>: I have two questions. One is I notice on the first bullet that you use 128 megabytes per node. Do the workloads have to be scaled down to map to that? >> Zhangxi Tan: I think yes. For the one like memory cache that we're looking at, it doesn't actually matter, so in terms of the workload size. >>: Okay. >> Zhangxi Tan: I think this is limited by the board design itself. >>: It seems really small [indiscernible] that size [indiscernible] >> Zhangxi Tan: Yeah. When I talk to some like natural [indiscernible] people they think that when you have 100 megabytes then you can run most of the interesting protocol studies. But I know for something like star applications you definitely need to gigabytes because just for the caches this is very large. >>: So if you scale it down to that size, then the software simulate a bit faster? >> Zhangxi Tan: No. [indiscernible] >>: Okay. And the other question I had is on a previous slide you showed how you partitioned, and so is this a balance system and if I'm a user do I have to manually make these partitioned decisions to get it through [indiscernible] >> Zhangxi Tan: You don't have to. We just use the naïve partitions like this. You don't have to worry about that too much. And I can tell you the performance simulator is very scalable. When we try to simulate 1000 nodes versus 2000 nodes, we saw zero performance loss. >>: I'm thinking like if I'm trying to evaluate different types of work architectures, I might have a different topology altogether. And how I'm drawing these boxes [indiscernible] some places. >> Zhangxi Tan: I don't think so. We don't play tricks with like fancy partitioning, so we just use mostly just one, mostly the before one. That will still give you some performance. And also I can tell you the performance simulation bottleneck is not actually simulating the switches. It's actually simulating the server computations and so we have to optimize your server computations simulations in that case. Okay. So back to this. We actually connect the prototype. We actually connect all the boards using this like, you can see these server cables around our own custom protocol, server protocols. There's lots of those. Each link is running like 2.5 gigabits per second. We designing a second version which will have a passive back link to handle all of the connections using these cables. And overall our whole system is pretty nice. It consumes only like over 1 kW, 1.2 kW, but in terms of simulation capacity what you saw here is that BEE is capable of simulating slightly over 3000 servers and they are also 96, 3000 servers and 96 racks and there are also 96 simulated switches there. So think about there are 3000 copies of Linux running there running real applications. It's very interesting. In terms of instructions simulation throughput, the whole system can deliver roughly around 8.4 billion instructions per second. Just a little bit about the limitations, because there are lots of things going on there. What I'm showing here is a dye photo of the FPGAs. I use a similar rack of servers. So on high-level, this is a full custom FPGA design. There's lots of our design there. We use minimum third-party IPs, so the only thing we don't design ourselves is the designing FPUs and this is pre-patent full design so we stretch the chip to the limit and so it's like a 90 percent of the BRAM utilization and also we close the [indiscernible] close to 95 percent look up tables that are being used. And keep in mind that the FPGAs we use here are like five years old, 2007 FPGAs. If you have this year's you could potentially have more pipelines there. And we are actually running the whole circuit like on 90 MHz, like 180 MHz. This is double pump data path to make sure we have enough pores on the BRAMs. And if you look at the design closely, there are basically two partitions. Also, each partition has one host DRAM controller of 16 GB DRAMs and it has two server run goal pipelines plus two switch pipelines there. And in terms of size of this model, probably you can see. So the NIC models and switch models and processor [indiscernible] pipeline is roughly the same size and with the [indiscernible] is double the size, so you can guess I have a large FPGA how much I can put there. And so this FPGA has four server pipelines and four switches and slash NIC pipelines there. And it also has a couple, sometimes one to five, 2.5 Gbps transceivers there. It depends on which type of FPGA, which type of topology you want to use. So I show you some prototype with 6 BEE boards. Let's say what if you want to go with a 10,000 node system. There is some type of [indiscernible] calculations. If we stick with the current technology which is very old, 2007, five-year-old technology, it probably ends of using like 22 BEE boards means 88 FPGAs. But if you are looking at this year, this is really promising. Look at the recent technology, 28nm generations. Simulating 10,000 CPU systems, you only probably need 12 boards. You actually calculate the cost of the boards assuming each board costs around $5000. We are actually doing the board design right now. The total cost is about 120k. That includes the board costs with the DRAM cost and the cost is roughly half-and-half. And next I'm going to show you three exciting case studies. This is very interesting. So the first one is we took Diablo to simulate a novel circuit switching network. This is actually work I've done with MSR in Mountain View with Chuck Thacker. I was trying to use Diablo to simulate the initial version of his circuit switching network. Probably you are familiar with Chuck Thacker's work. To bring everybody back up to speed, and this is a circuit switching network works more like an ATM. It has multiple levels of the switching design. During the period of three months of my internship there, we actually modeled all levels of the switches. And also we tried to run hand coded Dryad Terasort kernels with some device drivers, so we think this is very interesting because Diablo can actually capture some performance bottlenecks in the early designs. For example, we found some problems in the circuit switching set in the tear down circuit. Also, we think there are lots of interesting things from a software perspective. Chuck's design is a last leaf network and the network doesn't draw any packets but we found actually that because everything is handled by the processor eventually, and the software processing speed can actually lead to packet loss. And the second case study idea is going back to the traditional packet switching world, so I'm looking at a classic problem people call TCP Incast throughput collapse issue. This problem is actually reported by CMU. They claim they found it in the real production gigabits network attached storage. The set up of the problem is very simple. You have, this is the set up topology. You have a shared switch and there's a shared [indiscernible] failed server and also a couple senders connect to the same switch. So when you measure the throughput, the bottleneck link, which is the link between the shared server and the shared switch, and you plot the throughput curve versus number, different number of senders, or receive or whatever, number of clients and you see a curve something like this. This is the throughput and this is the number of senders and ideally, you won't see the throughput curve is -- when you increase the number of senders gradually going up and actually to this peak, but in reality and practice, this is the curve you saw. And people concluded, people thought the network simulations using NS/2 and people try their experiment on small-scale clusters and researchers conclude that this is caused by small switch buffers and also TCP transmission timeout in the kernel is too long and so that's their conclusion. Okay. So when I look at this problem I actually talked to a couple of people especially on the industry side and I asked them if that's a real problem. Some of them told me they are pretty artificial. Only happens on this specific set up, so we actually looked at how people did their experiment, and we found this problem actually only happens under really specific like preset up. It happens with small block size, usually less than 256 kB of request size and it happens only on a switch with shallow buffers. So the first thing we wanted to do with Diablo is to see if we re-create this set up, like this critical set up. Can we reproduce this throughput collapse problem? So this is the graph, this result you see from Diablo on the simulation. We are actually trying to model the the Gbps switch, with shallow buffer Gbps switch and we are trying to run the TCP Incast code actually measuring just to be fair, we're not using Berkeley code. We're using code from Stanford networking research group. This is the throughput curve you will see, and also we tried to change conversion points in the system, scaling up the CPU performance to see if that affects. And also changed OS syscall, people using applications. We actually didn't see a lot of difference at the gigabit level, so this is throughput collapse and that's what you see on a real comparable physical set up, so this is great. So we can reproduce the problem. Okay. Since Diablo is really a simulator, the question to asked is what if we operate the whole interconnect to a 10 gigabit, change the switch stack and change even the server performance? How will the system perform what if we have a faster server? And also there are lots of things you can play with, right? You can say recall this program. The original program is written using standard pthread plus [indiscernible] blocking syscalls but if you look at most recent server applications, they are actually using epoll versions which is more responsive versions of OI syscall which is use this called epoll and so what if we use epoll? Many current servers like memcached are actually using this versus pthread. And what if we change this, we do the same experiment what throughput collapse we will get. So here I show you a simulated 10 gigabytes switch in which a bunch of things I mention, like say I'm simulating four gigahertz servers and also versus 3 GHz servers. If you look at the orange color, using the Stanford unmodified code and just change your switch, this is the one you get. You still see a throughput collapse at the very early stage and there's not a significant difference between the 4 GHz and the 2 GHz server with 4 GHz slightly better, but not that much. But once you start looking at OS, let's change the OS code that is used. Let us change the syscall to be epoll, right, you have got a drastic difference curve around the left-hand side. You see the great difference between 4 GHz versus 2 GHz. Moreover, you see this like collapse on site point actually shift to the right. So this was very interesting. First, we're happy because we can reproduce this simple problem even if people call it artificial. Second, we found once you try to scale up the whole system, say change your switch, your system performs a dramatic shift to a very different scale and rethink looking at OS implementation, even application [indiscernible] is extremely important. Much of the current work done with this simple problem, people just run NS/2 there's no computation model. There's no OS at all. Also we think fixing the transmission timeout timer is not really like fixing the real problem. You have to look at where the root of the problem is, the real problem of where the throughput collapse is. Then you start developing some methodology or some probably new system design to solve this problem. So the last case study we think is very interesting and we actually run a modified memcached service on a really large-scale set up. So we think this is exciting because it reproduced some of the phenomenon that people saw in a real large-scale environment in production, and also this is probably the first time academia will be able to look at a problem at thousands of scales. And previously I talked to people and people can only afford to see a couple of hundred nodes from academia, but now we're moving a whole magnitude better. This is like thousands versus hundreds. So memcached just like probably you are familiar with that, it's just like very popular distributed keyvalue store application used by many large websites, let's say like Facebook, Twitter and any large website; it's memcached basically like its name. And actually the code we use is unmodified memcached. This is not a memcached tiny or mini or whatever; this is unmodified code, downloaded from their site and compiled on our platform with a compiled SPARC and the clients we use is actually coded with [indiscernible] on top of libmemcached. This is publication level that will be used to handle application level [indiscernible] and things like that. In terms of the workload we use, ideally we want to have a real workload by, you know, universities. We have some limited access to the logs, but Facebook actually published workload statistics in last year's SIGMETRICS, so we look at the paper and build our own workload generator and verify our own according to real Facebook data. So we think this is probably the best thing we can get. The first thing that we did with this memcached set up is to try and validate our results at small scale. Large-scale is pretty hard [indiscernible] but small-scale we're going to have a relative equivalent system, physical system set up and run the same experiment, the same programs on both systems and see if we can get the shape of the curve right. So in our validation what we did is we were trying to build a 16 node cluster with a 3 GHz Xeon processor, pretty regular, with a 16 port Asante Intracore switch and in terms of configurations all the 16 servers we used two of them to be the memcached server and the rest of them will be the clients and we'll scale the number of clients from 1 to 14 and then plot the throughput curve and latency curves and so on and so forth. Also, memcached [indiscernible] application [indiscernible]. There are lots of parameter configurations that you can just change from the server side. What you can change here is you can actually change the protocol use TCP versus UDP and which is better and also change the number of worker threads in the server and say from four threads to eight threads. And keep in mind the simulated server we have a single core model run at 4 GHz with a fixed CPI timing model versus this guy. And it's a different ISA and different CPU performance. We don't expect it will get the absolute number right, but we can actually expecting, what we're looking at is the shape of the curve, the trend and scale, so that's what we care about most. The first thing we look at in trying to do the validation about from the server’s perspective, look at the server application throughput. This slide shows you the application throughput we measured Diablo which is simulated versus the thing we really measured on the real cluster we set up. The x-axis of this graph is number of clients, like from 1 to 14. The y-axis is the application throughput measured from memcached to [indiscernible] interface. So it's measured in kilobytes per second, so we can see under different set up, Diablo can successfully reproduce the trend of the curve and to our surprise, because this is more I/O bounded, the absolute value is more close to the real thing, so we're happy. So now let's look at from the client side. We measured for each of the clients we ran we measured the log of all requested queries. We measured the client root query latencies and we did a similar comparison graph as the previous slides and again, on the right-hand side is the real cluster and the left hand side is the Diablo. As you can see again, we've got most of the trend right, so most of the queries finished reasonably fast, less than 100 microseconds. The only difference is that Diablo the setup we have here is probably slightly, the latency is probably better than the real set up because we didn't actually calibrate our models. But if you want to do absolute value, you can definitely calibrate your models and gather a closer curve, but in terms of the trend of the whole system, this is very good. So the next experiment is the most exciting part of the thing that we've seen so far. This was actually taking the Diablo and really trying to study a problem at a large-scale, like we tried to study a problem with a scale out to 2000 nodes. Since this is a simulator we can do many tricks. You can play around many hardware setups there, right, see you can simulate different inter-connects. For our case we're looking at gigabit interconnect with the regular microsecond level port-port latency and also we look at what if we change the whole interconnect to 10 gigabits and improve our switch latency by a factor of 10 on everything, like throughput latency of factor 10, what would happen to the application. I know the setup is a little more aggressive, but I'll show you some ideas of what we can do with Diablo at this scale. When we scale up this application and scale up this experiment we are trying to be moderate. We are trying to maintain the server versus client configurations, so we still have two memcached servers per simulator rack and with the rest of the clients and we want to make sure all of the server loads are moderate and we don't want to like push everything to the cliff. That's not where people are running today. So the server utilization we measured. The simulations run roughly 35 percent. There's actually no packet loss when I create this experiment. One problem at largescale is this problem people call request latency longtail. It's actually identified. Luiz Barroso actually talked about this problem about two years ago at the FCRC keynote. So we are actually trying to use Diablo to see if we can see the similar power on the large-scale. So this slide actually shows you the client request latency distributions. This is the PMF graphs with x-axis is in log, in microseconds. Y is the frequency or -- and also you see two types of -- there are lots of -- this graph actually tells you many things. First, there are the results from the 10 GHz interconnect which is the dashed line and also the solid lines for a gigabyte interconnect. And the first thing is we actually see this throughput, the latency long tail problems, which is most of the queries finished very quickly and on the order of 100 microseconds, but there was still maybe 1 percent of them that was 1.5 percent or whatever to finish a lot slower than the rest of the queries for some reasons, two orders of magnitude slower. Also if you look at this like gigabit switch, we actually classify the types of queries into three categories, right? So the one is just this is the size of the system. Some encouraged to keep the local memcached service. Some you may have to traverse one switch, one aggregate switch and sometimes two. We actually [indiscernible] latency distribution of all types of queries. And once you have a sizable system most of the queries actually going across your rack is going off the racks, so that's very common since that's where it dominates. Also you'll find if you traverse more switches and like this guy and you have this peak is getting lower than, this is low code. This is one hardest to [indiscernible] lowered. It means if you try more switches and the variations your query will be more. I think Luiz this year he wrote a paper talking about the cost of this latency long tail. One of the realizations that you have is once you have more switches you have more deterministics in your system so you can actually recapture that. And another great thing is you can compare the results with the 10 gigabit switch versus the 1 gigabit switch. Keep in mind everything is 10 x better in the 10 gigabit set up. It actually does help on this latency long tail issue and fewer requests will finish slower, but if you look at actual value of this request finish by the 10 GB interconnect they are not 10 x better in terms of application latencies. This no more than like 2x. Again, look at Luiz’ talk, he mention if you look at a portion of time in the query latencies you'll spend, the time you spend in the switch, the hardware only accounts for a very small portion of the overall. There are lots of time you are spending by as the kernel stack and this again Diablo helped to identify this problem, and also OS naturally in the stack computation again, very, very important. So since we have this axle in the vehicle, right, so you can see problem in a large-scale. So one big question like we asked the impact of the system scale on this long tail. This is what makes the Diablo unique. So let's look at the result we have in the previous slide you saw the PMF and the distribution. Now let's look at this CDF curve. This is took one curve in previous slides. So if you Google, you not actually interested in this part where most of the query actually finish reasonably fast. What you are actually interested in this part. This is the tail part of the whole request. This is where it makes the whole thing more interesting in terms of revenue and everything. So first let's see, okay. Let's zoom in on this CDF curve a little bit. So let's zoom in on this curve and answer the question if we have a larger system does it hurt the latency problem or help latency problems, latency long tail problem. So apparently, I'm using here. This is zoom into the curve, the tail part. This is 96. This is like 100 and I only use like there are lots of data points. I only use one configuration because other configuration is similar, just show you the ideas here. The one I use here is 10 gigabit interconnect and we are running with UDP protocols. And ideally you want this curve closer to the blue. And that means that fewer things are falling into the tail. But if you plot this curve with 10 gigabits here, you see clearer trend, right? This is for 500 nodes. This is 1000 nodes. This is 2000 nodes. So you have more nodes, more servers in your system, apparently this tail problem is more obvious. So that's the first thing. Okay. So next, academia, especially academia really looking at a system just like hundreds of nodes because of physical limitations and they tried to draw some conclusions. I think many people draw conclusions just with handwritten notes. And so let me asked this question. Would you draw the same questions if I give you a platform that is capable of looking at a problem with 1000 nodes? Let's say in this case look at the tail. So we have a different conclusion, different answers if we have a larger scale system. One very simple experiment we can perform here is to change the protocols, software protocol we use for the servers. Let's say TCP versus UDP. Which is better given a specific hardware set up, which is better in terms of minimizing the long tail, right? This is a very simple experiment we can do. So let's first look at the experiment we have done with the gigabit, the traditional gigabit interconnect. Again, I'll plot this, like zoom in on the curve of the CDF and I will keep the same coloring scheme consistent where red represents the UDP protocol and with the blue one represent the TCP protocols. So clearly you have 500 nodes a gigabit interconnect set up, you see, well UDP is definitely a better choice because this curve is closer to the roof. Now, let's move a little further; let's go look at 1000 nodes. Well apparently, very interesting, right? Going up to 1000 nodes you don't see significant difference between TCP and UDP and with TCP slightly better. Now let's move a little bit further to 2000 nodes. This is the result for 2000 nodes. In this case TCP outperforms, for sure outperforms UDP. So I can draw a conclusion now that TCP is the better in terms of scale. But what if we change the interconnect to 10 gigabit, right? So will I still have the same conclusion like TCP is a much better choice in terms of latency long tail? Let's redo the same experiment. So first, again, 500 nodes. Here you can see, TCP already outperform UDP for 500 nodes set up at 10 gigabit. Now let's go up to 1000 nodes. This time your conclusion reverted. It's apparently UDP is better. Now let's move the 2000 nodes and let's see if UDP is still better choice. Okay. This is the curve we got at 2000 nodes. There is not much difference; they are almost the same. So this is really interesting because we think that long tail problems really is a complicated problem. Multiple factors contribute to this issue. And it's not a linear function, so you definitely draw the answer which one is better. I believe there are many other things happen at scale and Diablo is definitely capable to help that and when we did like this large-scale experiment we also found many other interesting things at large-scale. >>: In these experiments you see these effects. Do you know why? >> Zhangxi Tan: I have some like rough ideas. I think the problem is I can mention, there are lots of key software queues in the networking stack and the way the current networking stack is, the handle is not really deterministic. So you look at the drivers, right, don't just look at the switch itself. The model factor’s at the lead, but software in the networking stack is the big problem. The way it's designed is it's actually optimized for internet. It's not deterministic. It's working the porting, using the polling [indiscernible] for most of the queues, the way that the OI stream the queues. So it's not like you have a packet and you send to the software. It's really up to the OS to decide when to notify the application. So I can't tell you exactly why this happens, but I believe there are more things you can do, but this is not a single, caused by a single problem. This is probably a system design issue. So let's look at the other issues. Sure? >>: How much observability do you have? >> Zhangxi Tan: How much, anything you can use with the software, you can do that and also we use hardware… >>: So anything you can see in software? >> Zhangxi Tan: Yes. So this is standard Linux, whatever you can do with the Linux, you can do it. >>: Right. Can I put the FPGA [indiscernible] switch and all that? >> Zhangxi Tan: Yes. You can actually do that. It would actually generate, I actually didn't use [indiscernible] but I can give you some distribution as to how, the way the queue is drained over the time and that's done through the hardware performance counters. So you can, yeah. We have like this debugging infrastructure there, but now for this experiment I just like to spend my time finishing up. I didn't look at those, but yes. The answer is it is definitely possible. You can see more things than you couldn't see on a real switch design. So yeah. Potentially this is architecture simulator, right? You want to prove whatever you want and also the nice thing about FPGA is the debugging infrastructure runs in parallel, to the main stuff so it tacks on zero performance impact, so that's good. Okay. So let's look at other issues, some other small issues. The first issue is Facebook, if you look at Facebook publisher, a blog about they are memcached twice, like probably good choices. And Facebook saying we are running UDP. The reason for that is because TCP consume more memory than UDP. But we actually didn't see this. We actually look at the server memory utilizations at scale and they are pretty much identical. But we think what really matters is the dynamic patterns of the TCP versus UDP, so it could be caused, the memory consumption could be caused by the server are not evenly balanced, like we have more in-flight transaction you have to handle for TCP versus UDP, but you can't simply draw the conclusion that TCP is definitely better. It is really about load balancing. Another thing we found is when you look at how people do, especially system networking folks do similar experiment, they tend to focus on the transport level protocols. They are not happy with the vanilla TCP. They do lots of hacks and transfer protocol control theory based transfer protocols and also they are trying to hack the kernel, say tweaking the timeout values through the protocol. But we found some cases where vanilla TCP might do just fine. You have to, what you really should focus on is not just the protocol itself. You have to focus on the CPU processing, like in the case of the TCP in cache, right? You have to focus on your nic design and also even OS and application logic are crucial. Also like to answer Eric's question in that we think lots of software queues like when you start building the system even though when you start writing the driver, you figure out the way the current stack is networking is processing the packet is very, very complicated. There are lots of queues and buffers just on the software side. It's not just like there's a buffer and a switch. There's more software buffers which is more nondeterministic in software stack. So why don't you just get rid of the buffers and [indiscernible] buffers in build a supercomputing like interconnect directly on the [indiscernible] fabric. So also we saw some other interesting like say we try to change the interconnect hierarchy. Sometimes we have one layer and sometimes we have two layers. We found that adding another layer or a datacenter level switch affects DRAM server usages and I think that affects your flow dynamics. This is also very interesting to see at large-scale. And so in conclusion, we believe by looking at and evaluating the data center networking architecture and things, looking at OS application logic and computation logic is definitely crucial. We need that. And second thing is you can't generalize your result you got from 100 nodes to a couple of thousand nodes. You see completely different things at a couple of thousand scale. And we believe Diablo is really good enough to generate these relative numbers versus an absolute number because we figure out a trend, right? And we think this is really a great design space exploration for a large-scale before you really invest like shell out a couple of million dollars to build a real system. So apparently, I've learned a lot during this, building this. This was like a year effort, by the way. So Diablo can see now is actually designed for large-scale system capable of simulating thousands of instances. So there's some people that asked the question, so why are you only interested in things happening within the rack. So will Diablo still help me because using FPGAs is supposed to be slow. The answer is yes. The reason for that is because we have like 3000 instances here on the prototype we have. You can actually potentially run your experiment in parallel. All the experiments we ran so far finish overnight. Think about if you have a real physical design and when you are trying to run your experiment on real physical design and sometimes you couldn't afford the cost of building multiple racks, but you can, you have the right experiment in sequential, but overall it's not necessarily faster or significantly faster than running in the simulator, so once you have 3000 simulator instances, great massive simulation bandwidth, that will probably change the whole story. The second thing is because the goal of Diablo is very aggressive. We're actually running the real software codes. It's not in microkernels like researchers use. We are trying to bring up the whole, the ultra shell in its kernel, so we found that there are lots of issues in the software. The real software they have boxed, right? So we actually saw very interesting things when we put on Linux. A couple times we had to change the hardware just to work around this Linux issues because we sometimes change software and the problem show up again and then we decide okay. We have to change the hardware to make sure we can run this legacy in Linux. Also we actually found that there is some, Linux is actually a product of the hackers, right? The software hackers, they don't actually follow the hardware spec. When we design the processor -- what's wrong? When we design the Diablo we actually follow the spec from [indiscernible] we follow the specifications. We follow the stack and one of the things we found is [indiscernible] processor state register, right, but there's a third of the [indiscernible] register but some of these space is for reserve. I mean it's for future use. Software shouldn't use it, but the current implementation with SPARC processors [indiscernible] processor they allow some of the reserve bits that's writable and readable, and the Linux actually currently use one of the bits stored really crucial information like say to indicate whether or not the current processing OS, use the current OS application mode, that would be used to clean out the whole OS kernel stack. So we actually found purely based on luck because we follow the hardware spec but hackers, no, they have a machine. They just code. If it works it works. So also we look at Diablo it was six FPGA system that is actually half of the server rack, so this is a massive scale simulation. This is like the real machines. Think about the number of DRAMs. You plug in the numbers from Google's number for [indiscernible] higher rate for DRAM, you see real soft errors in the system. We actually see the DRAM errors and also we saw errors from the server links. We have lots of server links, right? They are not 100 percent reliable. You run into errors for sure per day. Also we actually this ends up because we are using the memory in different ways so some of the nodes just die in the simulation. We think the FAME methodology, however, is great. It helps us solve a problem on our scale. How it was to get closer to the hardware side. But we really think we need a better tool to develop it. And now I code everything with the Verilog/Sytemverilog so I may [indiscernible] system [indiscernible] so I have to be frank in saying this. It's not a very productive tool to build. We are actually working on a second version. Maybe there's some DSLs can help us. Like I say, I am a hard-core Verilog guy. Even I will say this is hard. So you have to think about whether or not you want to build a system and just rely on Verilog. So again, Diablo is actually available from this website and you can go look for more information. We plan to publish all the source code out there and there will be a paper upcoming and I am currently building a second generation of Diablo hardware and I will talk about how we see at 10,000 CPUs with enough memory in the next versions. So with that, I will be happy to take questions. >>: Any questions? >>: So you said you tried to avoid all the IP from the FPGA vendors. Is there a particular reason you… >> Zhangxi Tan: Yes. It actually works. [laughter] that's my frankest. Yeah. But say the memory DRAM controller was not reliable enough, ones that have lots of clocks. Clocking [indiscernible] for [indiscernible] controller is not reliable enough. Besides, I don't like their controller because there are lots of hardcoded like routing constraints in those things. And also the reason we design our own [indiscernible] for example, SERDES you are not really using all of the features in the [indiscernible]. Now you are using all the features in the [indiscernible] drives and also you want to make sure the whole thing is reliable enough, so we build our own retransmission schemes, reliable schemes there and basically then take it like TCP is equal to -will like the hardware is TCP with a window size equal to one. And also for the control protocols are either [indiscernible]. Once you hook up lots of internet cables to the servers, you see software running on actual server will drop [indiscernible] packet. So there also must be some reliable guarantee and I believe the guarantee is in your hardware design. So you need to redesign the whole thing. It's not just like one. So we have to think about don't grab off the shelf [indiscernible]. It will never work. I mean for some people, but not for me. Okay. >>: You said that you had 120 megabytes per [indiscernible]. Does your target OS support a virtual memory? >> Zhangxi Tan: It's full Linux. It will support anything that supports Linux. >>: So it supports the demand [indiscernible]? >> Zhangxi Tan: [indiscernible] turn it off, in this case. Yeah. >>: Okay. But as you turn it on you -- so is it possible that you could virtualize at the host level to allow you to have [indiscernible] >> Zhangxi Tan: Yes. Actually, now we are doing the second version. The second version will give you [indiscernible] for simulated servers. And the new the server is not going to be a single core model. It's going to be multicore and we are going, the idea is as you mention we are going to use external flash as the main solvers to provide you really multiple gigabyte DRAM storage, and use the DRAM as the cache. >>: Okay. And then in order to ensure accuracy would you solve your [indiscernible] >> Zhangxi Tan: [indiscernible] using DRAM cache. This is to put optimized multithreading design so latency will be hidden hopefully. And based on our current setup, like say I mentioned 1000 nodes like three boards versus six boards, I don't see any actually raw performance simulation going down, so it's very great. Besides the synchronization between the models, the high-end models run on different FPGAs. Since we control the protocol design our self, we have done that, the round-trip time is like 1.6 microseconds doing all the payloads. I don't think you can do, I don't think any of the existing parallel software simulators will beat that. And I don't think, I don't see to the best of my knowledge there is an equivalent software simulator that allows you to do the same thing. >> Jim: I think you are going to talk to most of these people. So why don't we thank our speaker. [applause]