>> Madan Musuvathi: Hi, everyone. Thanks for coming. I am Madan Musuvathi from Research and Software Engineering, and it's my pleasure to introduce Milind Chabbi today. He joins us from Rice. He is graduating in a minute, and then after this, he's going to be joining HP Labs. He is very interested in all aspects of high-performance computing, from parallel programming compilers, performance, and today he'll be talking about software support for efficient use of modern parallel systems. Milind. >> Milind Chabbi: Thank you. Thanks for the introduction, and welcome to this talk. I'm sure there are more people joining online. The title of this talk, as you said, is Software Support for Efficient Use of Modern Parallel Systems. In some simple terms, probably, this is performance analysis. I used to be in Microsoft, and back in 2010, one day, I felt more adventurous. I used to work in Microsoft .NET, CLR, JIT compiler, and when I felt adventurous, drove to Glacier National Park – actually just packed up everything and drove my car. From there, went to Rocky Mountain National Park. Further down, I drove and ended up at Rice. That took me nine days, drove for 3,000 miles. And it was supposed to be a nice place to start some research in compilers, and we'll see how that has turned out so far. And I'm quite proud of all the Ship Its that Microsoft gave me. I don't know if you can see it -- yes, quite proud of those. All right, so talks in high-performance computing, computer architecture, compilers, always begin with paying some homage to Gordon Moore and his prediction about growth of transistors, but you're all aware of Gordon Moore's prediction of growth of transistors, and so I will skip that and go into something that's more interesting, somewhat controversial. This is from Proebsting, who used to be a researcher at MSR. And he made this corollary to Moore's law, which states improvements to compiler technology double the performance of typical performances every 18 years. In about that much time, you would expect 500 to 1,000X speedup in hardware, whereas in an application, you will not see so much. What this is saying is if you took your modern hardware and your favorite application and compiled it with today's compiler and an 18-year-old compiler, the difference in performance would be only about 2X. In the meantime, the typical shelf life of hardware is very short. Top 500 supercomputers remain in that level for a very short time. Typical lifespan of hardware is about four years. As a result, typical scientific applications achieve only 5% to 15% of peak performance on today's microprocessor-based supercomputers, and that should be true for many applications, not HPC applications alone. So there is a wide chasm between what hardware can offer and how much software makes use of it. Modern -- yes. >>: So you have -- if I change the architecture and I come up with some fancy new architecture that has these things called registers, you as a compiler say, hey, I've got this really good idea called register allocation. Then, my compiler technique for that particular architecture clearly will outweigh Proebsting's law. >> Milind Chabbi: Your register allocator will be done in four years? >>: Sure. I'll make a heuristic version of it. >> Milind Chabbi: And will it work for next architecture? >>: No. It'll work for the architecture that I'm running on right now. >> Milind Chabbi: If you do it within four years, go for it. >>: But I want to go back to your slide, previous slide. So I think the gap, the scientific applications are a place where people care most about performance, and that's the best gap you can get. And most applications, like JavaScript and Java ->> Milind Chabbi: The gap is there. >>: And they don't even get 5% to 15%. They get like ->> Milind Chabbi: They don't even get -- like 2% to 3%. >>: Like .001% of what the peak of the machine can be. So how come people are still using computers and they're happy with that? >> Milind Chabbi: Compared to not having anything, having something is better. And for a long time, we all relied on clock speeds going up and putting more transistors and how long is that going to carry forward? And there are efforts being done to put JavaScript on accelerators these days, to make better use of JavaScript. >>: You can always use performance, but I would say when the clock stops, this gap can be closed and other gaps can be closed, and that's the only way to get ->> Milind Chabbi: Can be closed, well, closed is a strong term. May not continue to widen as much as we are seeing for a single thread of execution. But if we are adding more cores, are we making use of those for these JavaScript kind of applications? >>: Well, not for JavaScript, because it's ->> Milind Chabbi: So in that terms, if you count it as the amount of FLOPs available on a processor, there is still a widening gap, because FLOPs available is increasing. >>: Well, if the hardware changes less frequently, like the typical shelf life of a top-500 supercomputer is four years, it's going to get longer, because we're running out of things to cram onto it to make them go faster. >> Milind Chabbi: But the architecture is still changing. How do you lay it out? Do you put accelerators? Are accelerators discrete? Are they on chip? So in these ways, architecture is still changing. Are they heterogeneous? Are they homogeneous? So these things are still continuing to change, and they are changing at a rate faster than software can catch up with them. >>: That's right. I think that's the most important aspect, this hardware changes faster than compilers or applications can keep up. >> Milind Chabbi: Right. >>: And that seems to be getting worse, because the way it's changing is not clock speed. It's other stuff that's even harder to program. >> Milind Chabbi: Yes, that is this thing. That's this thing where that architectures are changing. We use heterogeneity more frequently than we did before. There are a few heavyweight cores in modern optimizers, which are latency optimizers, and many lightweight cores that are throughput optimized. Modern architectures have a deep memory hierarchy. NUMA is very common, and even sometimes memory is explicitly managed. For example, Intel's Knights Landing processor that will be deployed at NERSC will have its special memory that you can program on however you want to use it. Power is a limiting factor. Dramatic clock rate increase is a thing of past, and DOE expects -- Department of Energy expects more than 25X power efficiency to reach exaflop. Similarly, software is also complicated. To deal with complications in software, we use modularity, which is hierarchy of abstractions, made of libraries and functions. They have an advantage that they give reusability and ease of understanding a complex system. For example, this is an architect's view of how abstraction should look, but reality is different from that. In reality, the abstractions are much deeper. They are delicately balanced, and they interact in much complicated ways. Abstractions, interoptimizations, they introduce redundancy. By redundancy, I do not mean resiliency. Resiliency is good. Redundancy is the overhead that comes with abstractions, and abstractions are written insensitive to the context in which they execute. Abstractions are context and input insensitive, so they are not heavily optimized for a particular workload. Inefficiencies in software arise due to various causes. Some of their hierarchy of abstractions, insufficient parallelism, load imbalance, developer's inattention to performance, poor choice of algorithms, data structures, ineffective compiler optimizations. There may be times when a compiler may do more harm than good, and excessive data movement sometimes. If I want to classify them, I tend to see them in these three top classes. One is resource idleness -- for example, insufficient parallelism or load imbalance. Wasteful resource consumption, where you are making -- the resources are not idle. You are making use of them, but that's not making progress towards the computation. It's not getting closer to the end result. And hardware-software mismatch. For example, hardware may provide NUMA architecture. Software assumes flat memory hierarchy. That's a mismatch in software and hardware. So my research interest is in achieving top performance on modern architectures. To do so, I detect performance problems. I am interested in knowing why we are losing performance, where and by how much. I build effective performance analysis tools that pinpoint performance losses and aid developer to identify problems in the source code so that they can easily fix them. Having found these problems, I have provided solutions for alleviating performance bottlenecks by building adaptive runtime and designing architectureaware algorithms to ultimately bridge the divide between hardware and software. Some of the contributions are novel techniques to pinpoint opportunities for tuning in hybrid applications. I have built some lightweight performance analysis tools to do so. Efficiently attributing execution characteristics to source and data in context. I have built some fine-grained instruction-level monitoring tools to do so. On-the-fly redundancy detection elimination, for example, detecting and eliminating redundant synchronization and a scalable synchronization algorithm that is better suited for NUMA architectures. I know Catherine is smiling, because Rice has some tradition of doing scalable synchronization algorithms. So in building performance analysis tools, I usually apply these principles. Measure accurately, introduce no blind spots. If there is dynamically loaded code in the program, measure it as well. Do not dilate what you are measuring. For example, if you are measuring cache misses, do not introduce cache misses of the tool itself. Introduce low measurement overhead, so that the tool can scale to large parallel systems. Observe on real production runs, so that you observe what is happening in fully optimized code, and attribute contextually -- this is a recurring theme in this talk. That deserves a slide. Context is important in large applications. Modern software is made of layers and layers and layers of libraries. You have math libraries, communication, application framework and so on. And we often use generic programming, such as C++ templates, where instantiation is important. So if a tool says that you are spending a lot of time in a wait, you exactly want to know how you reach that wait state. For example, in this climate code which has several components, you exactly want to know what wait is the one that is causing problems. Say, if main an ocean component, calls wait, you want to know that that's where you need to focus attention for improving. That's why performance is highly context dependent. So I have organized the rest of this talk -- yes. >>: So this context, there can be like static code context and data context. >> Milind Chabbi: Yes, both. >>: And then the same function were called on different data structures, it behaves differently. >> Milind Chabbi: Differently. I have tried to address both, where our data-centric attribution capabilities tell you that, in this code, when called from main code, blah, blah, blah, and it was touching a data that was allocated in this particular context, is when you are seeing certain performance anomalies. So rest of the talk, I have -- I will revolve it around these themes, where you have resource idleness and how we can pinpoint and quantify how we identify wasteful resource consumption and how we can tailor for an architecture. Modern architectures are increasingly employing accelerators. This is Titan, which is second-largest supercomputer. It is made of AMD Opteron processors and GPUs. So some of the cores that we looked at that were using accelerators were LAMMPS and NAMD, which are molecular dynamics codes, LULESH, which is a hydrodynamics code, and Chroma, which is a lattice field theory. There are many, many more codes being rewritten to make use of accelerators. This is just a few that we looked at. >>: So are these running on this architecture. >> Milind Chabbi: Yes. >>: These are a set of very important ->> Milind Chabbi: Important, yes. For example, LULESH is a proxy application that many, many people have looked at. It's a hydrodynamics code, how a body changes when you hit it with a bullet or some material. >>: A chicken. It's for flight simulation, when you hit something on the airplane, so the airplane doesn't crash. >> Milind Chabbi: Right, and many other similar other kinds of applications that the National Labs care about. >>: Like one of my relatives did. They did frozen chickens at the airplane until they had this code, and then they did it in simulation. Really. But it's really -- it changes how you do science, if you have good simulation tools, right? >> Milind Chabbi: So challenge is tuning these scientific applications to make use of heterogeneous architectures, and an observation is that these codes have a well-tuned CPU portion that has been developed for the last many years and an emerging GPU portion. So performance analysis tools play a vital role in tuning these applications for these architectures. This is the execution trace of LAMMPS, which is a molecular dynamics code. On X-axis is time, Y-axis is various resources, resources being a CPU thread and two concurrent threads running on GPU. And gray color indicates idleness, which means on GPU either nothing is running. On CPU, it means it's waiting for something to finish on GPU. What you can notice is there is a lot of idleness. They are being used in a mutually exclusive fashion. If GPU is running, CPU is idle. If CPU is doing work, GPU is not doing anything. >>: So just to understand the process here, you're looking at a single machine? >> Milind Chabbi: Yes. >>: This code executes in a larger context? >> Milind Chabbi: In a larger context. It has exactly the same behavior if I unfold it for all the concurrent API processes. >>: Right, so is there any -- but when you are going to talk about performance, you're talking mostly about a single machine and then extrapolating to the entire cluster? >> Milind Chabbi: No, both ways. Sometimes, if the ->>: So obviously there are situations ->> Milind Chabbi: Situations where you need to look at all of them, situations where the behavior is symmetric and you can look at one of them. So idleness wastes compute cycles. Offloading an entire execution to GPU wastes CPU cycles, and only making use of CPU and not making GPU, you are not making good use of your throughput-optimized cores. A better way is overlap CPU and GPU computation. There are a couple of ways to do it. You could divide the principal execution itself between both CPU and GPU, or if the code is not amenable for that, you could be using CPU to prepare next piece of work when GPU is busy doing current piece of work. This is pipelining the execution. So having identified that idleness is an important source of performance loss in these kinds of systems, the performance analysis that we do for heterogeneous systems focuses on idleness and attributes idleness to its causes. The insight is that when a resource is idle, you can recognize that. That's a symptom of performance loss. The cause of such loss is another resource that is at work. So you can blame the cause. Simple analogy, if GPU is working and CPU is idle, the idleness that you notice in CPU is because of GPU having not finished the work yet and vice versa. Another analogy is if you have lock contention going on in a multithreaded code, if many threads are waiting to acquire a lock, you can recognize they being spin waiting or something, that's a symptom. The cause is the thread that is holding the lock, and you can push all this waiting, as a blame, to the critical section or the thread that is holding the lock. Is that analogy -- okay. So one can tune code regions that accumulate a lot of blame, and typically, tuning such code reduces your critical path in a parallel program. We call this as CPU-GPU blame shifting, because you shift the blame from a symptom to its cause. So in CPU-GPU systems, if this X-axis is time and I have two resources, CPU may offload a kernel -- Kernel A, to GPU and go on to do some work, and they are well overlapped. And only towards the end, CPU waits for about 5% time for Kernel A to finish. And then there is another piece of code where CPU offloads another piece of work, Kernel B, and after a little bit of work, now it waits for Kernel B to finish, and that wait is about 40%. And as I said, in many of these codes, CPU is already well tuned. GPU is not so well tuned. A classical hotspot analysis that tells you which resource consuming a lot of time will tell you that Kernel A, being the longest time consuming, is the one that you might want to tune, but because it is well overlapped, tuning this can give you only about 5% performance improvement, whereas Kernel B, which the blame shifting identifies because there is a lot of idleness, if you tune this, you gain a lot of benefit in your end-to-end performance, because it reduces the critical path. So the insight is the top GPU kernel may not be the best candidate for tuning. The vice versa is also true. If you are interested in tuning a CPU portion, the one that has less overlap with another component, GPU being here, is the one to tune and that reduces the critical path. We have implemented this in HPCToolkit, which is a performance analysis tool being developed at Rice University. It supports multilingual fully optimized, statically or dynamically linked applications. We do not do any source code modification. We take the fully compiled, optimized code. We measure performance using asynchronous sampling of timers and hardware performance counters. Because it is sampling based, it incurs very low overhead. We attribute performance to call stack, full call path. We can do so in many programming models. It could be Pthreads, OpenMP, MPI, CUDA or any combinations of these. It is decentralized. There are no centralized bottlenecks, and it scales to thousands of processes. >>: So you just walk the stack to get the call path, or do you insert instrumentation? >> Milind Chabbi: We do not insert any instrumentation. We do an on-the-fly binary analysis and recognize, if this is the PC, where will my return address be and so on, to unwind your call stack. >>: So that works to generally, even if I, for instance, remove frame pointers? >> Milind Chabbi: Yes, works without frame pointers. It is full of heuristics that work very well, and in the worst case, if you haven't found, you can trawl the stack a little bit to recognize, oh, this looks like a return address, and let me use that. We applied blame shifting on some important codes. For example, LULESH, which has been already looked by many researchers. It's a shock dynamics code, and blame shifting identified that 30% of GPU was idle due to GPU memory allocation routines that was happening in each time step. In each time step, you would allocate a memory on GPU and deallocate it, and during that time, CPU was doing it, and during that time, GPU was idle. And when we recognized it, we noticed that the amount of memory allocated in each time step is exactly the same, so instead of doing it each time step, one could just hoist it, allocate it once and reuse it each time. That gained us 30% performance improvement in this code. In another code, which is EpiSimdemics, which is an agent-based simulation for spread of contagion, the CPU and GPU were being used mutually exclusively, like the one I showed in the earlier diagram, and we used the pipeline parallelism, where as soon as a little bit of data was available, we started computation on GPUs. And that -- and we used a particular runtime, called as MPI-ACC being developed at Argonne National Lab, and that gave us about 62% speedup in this application. One of the impacts is this technique is being proposed as to be used in an important accelerator manufacturer's product roadmap. I cannot tell you who the manufacturer is. Moving on, that's about ->>: [Indiscernible]. >> Milind Chabbi: Pardon me? >>: Have we looked at [indiscernible] HPC applications [indiscernible]? >> Milind Chabbi: No, I did not. That's an interesting one try. Yes, thank you. >>: Would something built on strace that just looks at system calls be sufficient to understand this notion of blame? >> Milind Chabbi: Let me think. >>: So it's been a while since I've been in Linux, but if you strace something and you notice that the CPU makes a system call, way out to the GPU, and then doesn't do anything for a while until the response from that I/O call. >> Milind Chabbi: Could do. >>: It's sufficient to realize that the ->> Milind Chabbi: It may have some important information. Thinking of it, the difference is, you have to collect the entire trace and then do a postmortem analysis. Collecting trace is tricky, depending on your size of parallel program. Traces can grow really, really large. This is a profiling technique where, as soon as you identify idleness, you instantaneously blame the code running on another component. That's the difference in blame shifting. Anything that you can do with profiling you can do with tracing. Just it is postmortem and you have to collect a lot of data. >>: So one of your -- go ahead. >>: But on the other hand, that does this blame really attribute it to this previous execution of the program? >> Milind Chabbi: Yes. >>: In other words, because scheduler can be dynamic, we will have a different execution order of the same program that this blame may not be valid anymore. >> Milind Chabbi: Yes, in a diag-based execution, we are thinking how to do blame shifting in a diag-based execution, where the schedule is not same all the time. This is more useful in statically partitioned executions, the one. Thank you. You may still find useful results in dynamically scheduled ones, but iteration to iteration, if it changes, run to run, if it changes, then it is tricky. If the sampling is representative of what is happening over time, it may still give you good enough insights. >>: Yes, I think the tricky one is that you basically identify the impact of the critical path. >> Milind Chabbi: Critical path on a diag. >>: When you reduce this critical path, then the blame shifts to, say, an alternative. >> Milind Chabbi: Alternative list, right. >>: But potentially, that this could shift to another one, so when the back structure is complicated, it could be hard to analyze the entire impact. >> Milind Chabbi: That's true, that's true. There could be many closely related critical paths, and you may identify one, you optimize this, and then the next one becomes the critical path. That's the issue any time you are trying to optimize critical path. >>: Exactly. >>: So when you ->> Milind Chabbi: Then you have to go for hard critical paths. >>: So when you showed the example, and if you go back, one of your solutions was to do better pipelining. >> Milind Chabbi: This one? >>: Yes, but can we go back to the example where you have the 40%? Okay, so in this one, it could be that Kernel B is actually highly optimized, and this is an example where you want to do pipelining. Or there could be some work that's independent of Kernel B that you could shove into here. How do you know which thing you should do? >> Milind Chabbi: Case by case. At that point, an automated tool will not help you. Automated tool tells you what's going on, where are the potential problems. Solution is in the mind of the application developer. >>: Okay. So here, you say you can either -- but you could give advice. You could say Kernel B is your bottleneck. You could shift work, independent work in here, or you could optimize Kernel B. >> Milind Chabbi: Yes, either of them is a good option. >>: Are there other options? >> Milind Chabbi: Not off the top of my head. I'll keep thinking. >>: I can't think of any off the top of my head, either. >> Milind Chabbi: Yes, usually, pipelining is something that people prefer, because bringing some other piece of work can change your critical path. That may take longer than this one. >>: Right. Pipelining is more directly going -- more localized than some of the other options. >> Milind Chabbi: Right. >>: Thank you. >> Milind Chabbi: You're welcome. So going on to wasteful resource consumption, another top cause of performance losses, I have split that into internode and intranode, how wastage happens across nodes and how that happens within a node. First, across nodes. NWChem is this computational chemistry code from DOE. It is a flagship quantum chemistry code. It uses this thing called as Global Array Toolkit Programming, where the data is physically distributed onto several nodes, but each process gets this illusion of being able to access the entire data structure using get-put style one-sided semantics. And the software is written in SPMD-style PGAS, Partitioned Global Address-Based Programming Model, and it's a large source code, with more than 6 million lines of code and about 25,000 files. It scales quite well. It has been shown to work on more than 100,000 processes. It is widely used. There are more than 60,000 downloads worldwide and about 200 to 250 scientific publications are produced per year based on this framework. This is something impactful. So in terms of the software organization, there is NWChem chemistry code written, usually in FORTRAN, and then it calls into this Global Array Framework, which in turn calls into ARMCI, which is an Aggregate Remote Memory Copy Interface, which is being reorganized into communication [indiscernible] scale these days, and all these can go on different substrates. For example, it can use MPI, Message Passing Interface, or GASNet, which in turn can go on InfiniBand, DMAPP, on Cray machines, PAMI, on IBM machines and so on. Basically, it's a layered software. In PGAS programs, typically, people use a lot of barriers, because that's how you can enforce consistency of data between asynchronous updates. But communication libraries for HPC are typically written agnostic to an execution context, so developers use -- make conservative assumptions about underlying layers, and they enforce barriers on entry and exit to each API. For example, each time this layer calls into this layer, there will be a barrier to here and one barrier to here, and call into here, a barrier and a barrier. This layering and use of APIs leads to redundancies, as I'll show in the next slide. This is four lines of code written by an application programmer that sits in the chemistry component. If you track own what happens underneath -- this is the call graph -- you will see that there are nine barriers in four lines of code. And in the Global Array layer, there is a sync at entry sync, at exit sync, at entry sync, at exit and so on. Now, of those nine barriers, at least three are redundant. For example, the one that is called at the exit and the one that is called at entry, there is absolutely no data update in between these two. But because conservative assumptions put barriers and they lead to redundant barriers. Yes? >>: So would a smart inlining -- if I just inlined this, would the compiler figure this out? >> Milind Chabbi: Multilingual, multi-component, there is -- as of now, I don't know of analysis that does barrier elimination, and safety is the other thing. Is it impossible? No. One can do static analysis, just ->>: Yes, but even I'm working at the binary, right? If I'm looking at x86, at link time, if I did the inlining? >>: Then you can't identify the barriers very well. >> Milind Chabbi: Then you have to understand the data access before a barrier and after a barrier in a binary analysis. >>: Right, right. >>: So can you explain why, if I'm doing a GA copy -- let's say I'm copying two arrays. Why does GA copy need to synchronize the whole program? It just needs to synchronize those threads that access GA, right? >>: Yes, but it's ->> Milind Chabbi: But data A and B are spread all over, on all processes. >>: Okay, so I guess I haven't understood the programming model yet. But there's one -- so when I say GA copy, then there's one thread that's copying. >> Milind Chabbi: Yes. It's an SPMD. Each per thread says let's copy. So your role is, I copy the data that's present on my local portion to my portion of B. That's what it is doing. >>: So everybody is operating on GAG. Okay, great. Thank you. >> Milind Chabbi: Okay, nine barriers in four lines of code, three redundant. Program spent 20% time doing barriers. Compensate of API leads to redundancies. Redundancy is contextual. You cannot go and delete this thing, because there may be another path where it is called, and that's not followed by this, which means then it is necessary. >>: But in a sense, I would expect that if threads have already come to a barrier, adding another barrier is probably not ->> Milind Chabbi: There is no load imbalance problem, but the cost of barrier itself is high. Barrier incurs latency, and it can lead to a little load imbalance, because one process may get stuck, may get stalled, may have slow CPU. CPU may get throttled, so the arrival is not perfect. So to answer your question, it is multilingual, multiple programming models, MPI, OpenMP, so that's why elimination using compiler-driven techniques is tricky. So our idea is to detect redundant barriers at runtime by intercepting certain communication calls, and speculatively rely on redundant barriers, but when we do so, we are being conservative and lightweight. So to give some background on when barriers are needed and when they are redundant, we have these two processes, P1, P2. If there is a put of data X and a get of X on P2, then barrier is needed, because there is a data dependence across the process here. Over here, two processes touch data, and they touch a different data after a barrier, so there is no data dependence here. Hence, this is a redundant barrier. Over here, P1 does a put of X and then does a get of X. There is data dependence, but it does not cross process boundaries, so even here, barrier is redundant. If there is no data being touched after a barrier, those barriers are redundant, and if there is no data being touched -- by data, I mean the shared data -- not being touched before a barrier, then even that is redundant. So this led us into thinking about how barriers can be eliminated when they are needed, when they are not. So if you can think of two barriers, consecutive barriers, and observe all types of accesses in between two barriers, you can classify them into three categories, and which means the process did not do any remote memory access, remote as in any shared data memory access. L means it accessed data, but the data was resident on the same process, local. R means the process accessed the data that was remote. It was not on the same processor. These L, N and R form a lattice. For example, if one process did not access shared data, another process remote data, the lattice will result as if you accessed remote data. If you take triplets of an access before a barrier and what is an access after the barrier, if you have an execution trace of all processes, the entire execution trace of the program, you can look at these accesses and say no access before, no access after. That barrier is redundant, whereas if there was a remote access before a barrier and a remote access after a barrier, then it is not safe to eliminate such barriers. Then you can say, well, that barrier is needed. You could do this if you have the entire execution trace for all processes, but if I am on a barrier and I want to decide at runtime should I participate in the barrier or not, I cannot, because I don't know what is coming next. So a priori knowledge of an access is unknown, but there is a silver lining here. One can make safe approximations, if you notice -- no matter what comes after a barrier, if there was no remote memory access, if before a barrier, you can always remove such barrier. So that says if no remote memory access and a barrier, remove such barrier. If there was only local access, there is at least one case where it is not safe, so you cannot eliminate such barrier. If there was a remote access before a barrier, there are two cases where you cannot eliminate a barrier, so conservatively, keep the barrier. Now, you might be thinking, fine, I as an individual process know my kind of accesses. How will I know what's happening in some other process to make a global decision of whether to participate in a barrier or skip a barrier? For a second, assume an oracle tells you, if you know your local status, you know the status of the entire system. I will tell you what that oracle is. Just assume there is an oracle, and let's try to apply these three rules to the core I had shown you. Over here, there is a barrier and a barrier, no remote memory access in between. My N and B rule says, skip such barrier. Don't participate. Then there is L and B. LB rule says participate in that barrier. Then there is N and B. My NB rule again says skip such barrier, so that's how one can apply these rules on the fly. So let's see what is that oracle. The oracle is simple. It is actually a learning and elision mechanism, where we identify local redundancy of a barrier instance in its calling context. You arrive at a barrier. You notice the die make in the access from last barrier to this barrier, and you have learned something about yourself. And while learning, we replace barrier with reduction, and each process tells its local state in the reduction and the result of reduction tells you back, was it needed in some process or was it not needed in any process? If it is system wide redundant, then such barrier becomes an elision candidate in that context. If it is needed in some process, then we will say we cannot elide such barriers, and it costs lower rate while learning, because passing one additional information in reduction has very little overhead. Once we have learned something, we speculatively mark a barrier as redundant for a call in context if all instances are globally redundant and we elide future episodes that are marked as elision candidates. >>: So this is the full calling context. >> Milind Chabbi: Full calling context, yes. So each time you arrive at a barrier, you do a call stack unwind and record it. So if I am a P1 process, I arrive at a barrier. I wouldn't have, say, done any remote operation. My instrumentation would unwind the calling context. P2 would arrive, do the same thing, and then while learning, they participate in a reduction and they inform what they have found about themselves, and they learn something about that. In this case, they learn that there is barrier not needed, and they locally make this decision. In that context, it can be elided, and they do it for a certain number of iterations, which is tunable, and once they have learned enough, whence -- and then it arrives. It says, I am in this context. What is the learned thing about it? Elide. So it does not participate in the barrier and goes forward. The same thing for another process, as well. It arrives, where is local information, and skips. So can there be misspeculation? Yes. Misspeculation can happen if, when you arrive at a barrier, you are not following the consensus that you have made previously, and we can always detect misspeculation, because you would have broken the consensus as a process. There are times when you can recover. For example, if all processes made misspeculation, they all go into a reduction and they understand. Otherwise, we rely on checkpoint restart facility that is available in NWChem and start from last checkpoint. In NWChem, luckily, there were no misspeculation. Some amount of training made sure that once it is redundant, it is always redundant because of the SPMD style programming. To implement, we instrumented barriers, some remote memory calls, certain application-specific features that were bypassing underlying -- whenever they were accessing local data, sometimes, they were bypassing well-known calls. And all this costed only 1% of instrumentation overhead and scales perfectly, because most of it is all local operations. Now, how do we gain developer confidence if we start doing this kind of elision on the fly? So to gain more developer confidence, we have a multipass approach, where in one pass, we only identify what are redundant barriers and present it to the user in a summarized way. Here, in this context, we think this barrier is redundant, do you want to elide us? And if the developer says, yes, go ahead, elide, in the actual production run, only if the context matches we will elide. So in a production of NWChem running on Cray XC30 machine with about 2,000 processes, we were doing a simulation to understand catalytic destruction of ozone. There were 138K instances of barriers for a core running for about 30 minutes, and they were spread across 8,000 unique calling contexts, and there were 63% barriers that were redundant. 63% is that of this one, so as you can imagine, this layering of software causes a lot of redundancy. And by eliminating them, we gained about 14% running time improvement. >>: So the checkpoint in NWChem was an application checkpoint? >> Milind Chabbi: No, the library is the checkpoint for you, periodically. >>: For other reasons. >> Milind Chabbi: Not for this. >>: For failure, because if the code has a lot of state and they fail, it lasts for hours. >> Milind Chabbi: Yes. Next is the kind of wasteful resource consumption that happens within a node. Memory access is expensive within a node because of multiple levels of hierarchy, cores sharing a cache and having limited bandwidth per core. That drive happens when two writes happen to the same memory location without an intervening read. So here is a dead write. We have int x equal to zero and x equal to 20. The first x equal to zero is a useless operation. That's a dead write, and for this discussion, we will call this one as a killing write. What is not a dead write is over here, x equal to zero, and then we print the value of x, and then x equal to 20. Now, compiler scan eliminates something like this. Dead code elimination can get rid of dead writes, but what is new is this piece of code that we came across in Chombo, which is Adaptive Mesh Refinement framework for solving partial differential equations. There is a three-level nested loop in Riemann solver, and the program spent 30% time doing this. So over here, the first three lines are an assignment to a four-dimensional array, and then a check is made. If that check is true, the same array is overwritten with a different set of values, so this is killing that if this condition is true. Another condition is being checked. If that is true, these three or maybe that is overwritten again, with some more values. Now, compilers cannot eliminate all dead writes because of aliasing and ambiguity, aggregate variables such as arrays and structures and compilers apply optimization within function boundaries or within some limitations, whereas there is always some late binding code can be loaded on the fly, and there is partial dead write. For example, it may be dead on one path. It may not be dead on some other path. So over here, this code lacked for performance. A very simple way to fix this is using else-if structuring, where you do this thing first. If that condition is not true, then you do that. If that is not true, you fall to the default one, where it will eliminate all dead writes for you. Doing so sped up the loop by 35% and program by about 7%, so I was motivated to see how frequent these dead writes are in any program, so I ended up writing this tool called as DeadSpy, which monitors every load and store and maintains state information about every memory byte, and it detects dead writes in an execution using a very simple automaton. Each memory allocation starts within a virgin state. A read operation takes it to an R state, write operation takes it to W, a read following a write takes it to R. If you follow several reads, it will remain in the same state. A write will transition back to W. If there is a write that follows a write, the automaton will detect it and report it. That's a dead write. Now, to make it precise, we are doing this at byte level. There are no false positives or false negatives. Yes. >>: So there's also an implication here that anything that leads to a dead write is itself dead, so actually, it's that -- I want to make sure that that's true. I guess, if you do a backward slice from dead write, that implies ->> Milind Chabbi: By dead, you mean the killing or the dead. >>: The dead. >> Milind Chabbi: Yes, everything that led to the dead write is also used less. I'll take that back. Those computations may be needed later. >>: Yes. >> Milind Chabbi: But there is some transitivity there. >>: Did you look into that? >> Milind Chabbi: This is something that someone else also suggested me, that okay, if you have like 10% dead writes, how much is the previous computation compute causing that one, that is transitive relationship which is interesting to explore. Thank you. It's insufficient if I just told you that there is 30% dead write. You want to know where it happened, what are the two parties involved and how you reached those positions, so that's why you need call in context that pinpoint where dead write happened and where killing write happened. And we do the sourcelevel attribution along with calling context. And I ran this on several codes. I'm showing you only the ones on SPEC integer reference benchmarks. In general, dead writes are very frequent. Lower is better here, higher dead write is bad. About 20% dead writes in SPEC integer reference benchmark, and GCC had particularly astonishingly high dead writes, 60% on average, and for one particular input, there was 76% dead writes. So I was interested in knowing what's going on in here, and it happened in this piece of code called as loop registered scan, where GCC is walking through basic blocks belonging to a loop and identifying what are the registers being used. So the quota begins here by allocating an array of 16,000 elements to this last set array, and then it is walking through each instruction of a basic block and a particular pattern matches. It says, okay, that register is being used. And then once a basic block finishes, it says, okay, my work is done for this block. Let me reset the whole array with zero and start doing it with next block. So that is killing this assignment, because typically, basic blocks are short. Median use of number of elements in this array is only two, and you have a 16,000-large arrays, so dense array is a poor choice of data structure here, and it keeps happening over and over again, as you use it for several basic blocks. So we replace this array with sparse data structures. I used split reads, because it has this property of frequently accessed position being in your cache and easy to access, faster access. That led to about 28% running time improvement in GCC. Found many other codes where it was useful. For example, NWChem had a redundant initialization of a large memory buffer, and when we eliminated that redundant initialization, we gained 50% speed up when running on 48 cores. BZip had a particularly overly aggressive compiler optimization that was hoisted from a non-executed code path onto a hot code path. And yes, that's a bad optimization. That was causing a lot of dead writes, because of stack spilling happening over and over again. When we eliminated that, we gained 14% running time improvement. HMMR, which is a code for DNA sequencing, had dead writes not being eliminated because of aliasing, and once we made compiler aware that two arrays were not aliased, it removed dead writes, as well as it did vectorization for us, and we gained 40% improvement in running time. So dead writes, DeadSpy, there were lessons to learn from it. Dead writes are common symptoms of performance inefficiencies. Choose your data structures prudently. Suggestions to compiler writers is pay attention to your code-generation algorithms. Know when they may do more harm than good. Okay, that's one. And profiling for wasteful resource consumption opens a new avenue for application tuning, and context is really important in understanding feedback from a tool. You want to know where dead writes happened, in what context can you optimize. Actually, these context-sensitive tools are useful in many other situations. There are tools for correctness -- for example, data race tools, where you want to know where a race happened and what was the other access that led to that race and you want to know the full context that led to it. There are other performance analysis tools. For example, reuse-distance analysis, where you want to know where was the previous use and where is the reuse, and if there is a common ancestor that you can hoist your use and reuse, then you can retain the use of memory hierarchy. We have written a tool, an open-source library that anybody can use for their fine-grained instrumentation. We call it as CCTLib. In the interest of time, I will skip this portion, but you can stop me later if you are interested. So the last portion of this talk is the mismatch that happens between architectures and applications or algorithms, and how one can tailor for an architecture. Modern architectures use NUMA. For example, this is IBM Power 755, which has four IBM 7 processors. Each processor is four-way SMT sharing, L1 and L2 cache. That forms your first level of NUMA hierarchy. There are eight cores sharing the L3 as a victim cache. That forms the second level of NUMA hierarchy, and they are connected through a fast network, forming the third level in the NUMA hierarchy. If you thought three is the number of levels in a NUMA hierarchy, you will be surprised. SGI UV 1000 is the world's largest shared-memory machine, with 4096 cores, and each node, which is also called as a blade, is composed of two Intel Nehalem processors, and every node can access the memory of every other node using loads and stores -- not gets and puts -- loads and stores. So two way SMT within a core forms first level of hierarchy. Eight cores per socket form the second level of NUMA hierarchy. Two sockets on the same node form the third level of NUMA hierarchy, and this is logical diagram of a rack of these nodes, and there can be up to three hops from one node to other. That forms fourth, fifth and sixth level in the NUMA hierarchy, and several such racks are joined together to form this half of the machine, and there is five hops from one node to the other. That forms eight levels in the NUMA hierarchy. Now, locks are one of the fundamental features in shared memory architectures. >>: Where are you distinguishing between loads and stores and gets and puts? They should be the same, right? >> Milind Chabbi: Yes, gets and puts are also as problematic. Actually, they are even more problematic than loads and stores. Just because this is a shared memory machine, people tend to write OpenMP, Pthreaded-style programs. A little different in get-send puts is that you do not have a cache problem. They are -- you take the data from somewhere, put it into your big, large buffer. So the issue is more of the communication between node to node than anything going through your memory hierarchy. Make sense? That's why. Okay, so locks are one of the fundamental features in parallel programs, and they have been studied for a long time, and centralized locks are bad, and one of the good ones that scalable locks is MCS lock, MellorCrummey and Scott lock, and I will explain what is MCS lock, how it works and what are the deficiencies of that lock where you are using it on a NUMA machine. So MCS lock has a lock variable. Each thread arrives with a record, which has a status field and a next field. Here, thread even arrives, and let's say it arrives in NUMA domain one. The protocol involves swapping this lock pointer to itself, and if you had no predecessor, then you are the lock holder, and you enter the critical section. In the meantime, let's say another thread arrives in a different domain with its own record. Now, it swaps the stale pointer to itself. Now, this was cached in this domain, so that was a cache miss for this one. Now, it is waiting, and it wants to tell its predecessor that he is the successor, and it knows who predecessor is by swapping the tail pointer. It goes and pokes the next pointer and says he's the successor. While poking that, it incurred a cache miss, because this was present in a different NUMA domain. And now, it goes into a local spin. This thread eventually finishes the critical section, and it wants to tell its successor that he is the lock holder, so first, it accesses its next pointer, but that was cached in this line, because that was the last accessor. And now, it goes and touches the status field and says, you are the holder. That is another cache miss. At some point, this thread realizes that the status has field, the status has changed, and because it was previously accessed by a thread in different domain, it also incurs another cache miss. And then, the data that T1 exists is most likely to be accessed by T2. That's why we are holding a lock, so that whole data moves from one domain to another domain, so lock and the data access in the critical section keep pingponging between one NUMA domain to another NUMA domain indiscriminately. For example, in this MCS lock, you have a chain of waiters. They may belong to very different NUMA domains, where I have indicated each node with the name, so data can just keep ping-ponging between various NUMA domains. How bad is this? It is pretty bad, actually. Here, I have shown what is the time it takes the lock to be passed from one level to another level. If you pass it within SMT tiers, what happens if you pass it to two cores sharing the same socket? What if you pass it from one core to another and so on SGI UV 1000. This is the time taken, which means lower is better. When you are passing within the SMT, the time is small, but as you keep passing it farther and further away, it can be as many times -- as many as 2,500 times lower in passing the lock. So to address this, we have built a NUMA-aware locking algorithm. We call it as hierarchical MCS lock, HMCS. The insight is that the passing the lock to a thread in a nearer NUMA domain incurs less overhead. So always pass the lock to a waiting thread in the nearest NUMA domain when possible. So to do so, we organize the lock into a hierarchy of queuing MCS lock, mimicking the underlying NUMA hierarchy and orchestrate the protocol such that threads wait in locality domains. If here is how threads were waiting in the original MCS lock, in the hierarchical MCS lock, they will wait in their own locality domains. For example, if two threads belong to same innermost domain, they will be lined up in the same queue. Here, each LO bar is one level of MCS lock. So every thread arrives. If you are the first one, you have more overhead to go further up and acquire the lock at a higher level. Having acquired that in high contention, typically you have waiters either in your domain or somewhere in the nearest domain, and you keep passing it within for a bounded number of times and make use of the locality. And once you have done that, the last thread, which this a threshold, passes it to someone in the nearest domain and makes use locality. It makes better use of data locality, NUMA architectures, high lock throughput for highly contended locks. A given lock acquisition may incur higher latency, because if there is no contention, you incur the cost of going up in the tree. We ensure starvation freedom by bounding the number of times we pass, and we have a relaxed notion of fairness. And one can tailor the design of the lock either for high throughput or for high fairness. If you are interested in fairness, your passing bound is shorter. If you are interested in throughput, you pass it much more number of times, and the bound on the passing is not a random number. It is based on the excess latencies between two levels in the memory hierarchy and an and analytical model of the queuing lock to take advantage of the difference in latency between two levels of hierarchy. On this SGI UV 1000 machine that I showed you, on X-axis is number of threads. Y-axis is throughput. That is, how many times can I acquire a lock per second, so higher is better. MCS lock's throughput keeps falling each time a new NUMA domain is added, whereas the hierarchical MCS lock reaches a stable point and retains very high throughput throughout. This is K-Means, which is an application I took from a mining benchmark, and on X-axis is time. Y-axis is how much improvement did we see based to using MCS lock? So black color is MCS lock's throughput -- I mean, end-to-end application time. And red color is how much time did hierarchical MCS lock take in this application? When the contention is low, using hierarchical MCS lock has more overhead. It's about within 80% of the MCS lock, but as the contention rises, HMCS lock can do about 2.2 or 2.1 times better than using an MCS lock. You might be wondering how to deal with this case of no contention. We used a trick of Lamport's slow fast-path mechanism where, if a thread arrives and there is no contention, instead of going through the entire tree of locks, we can directly enqueue at the root and make the case of no contention very fast. I did not show the numbers with disk optimization. And if that's not the case, you take a slow path, and even within slow path, there is a hysteresis mechanism which says, if when I came last time, if there wasn't enough contention at this lowest leaf level, let me directly go in and queue at a level where I had noticed [indiscernible] of contention. And that makes our no contention case about 95% of the throughput of MCS lock, and high contention case about 90% of the best HMCS for a given configuration. To conclude, production software systems suffer from performance losses. >>: I would think of HPC applications as using a lock -- maybe it's just my ignorance of these applications. So, for example, the payments, I would expect them to be embarrassingly [indiscernible], so where are the locks used? >> Milind Chabbi: So what happened, K-Means was they were using atomic operations. They were not using locks, and you would expect atomic operations to do well, better than locks, right? So what happens is -- this is on Power 7. The way atomic operations are implemented is you take reservation, you want to make an update, and then you write the location. And when you have a lot of threads, you take a reservation, some other guys takes reservation, you lost your reservation, so they keep ping-ponging data and no one makes progress. So atomic operation was terrible, and what I did was, I replaced atomic operation with MCS lock. That gave about 11X improvement, so there was this loop, and inside that, there were three atomic operations. So instead, I put a lock around the loop. I said take the lock, and once you have taken the reservation, make changes to all the locations. That worked out much better. >>: So the average number is still I have talking to folks in like the [indiscernible] folks is that, hey, lock-free data structures are better than locked data structures in performance. What you're saying is that's not necessarily the case. >> Milind Chabbi: It depends. Is it a small update or a large update? For short sections? >>: For short critical sections. >> Milind Chabbi: For short critical sections, you may, may not benefit from lock free. But if you have a large data to update, you will end up taking locks, right? >>: Yes, in that case, it's very clear to use locks. >> Milind Chabbi: So this is a case where one part, that fine-grained atomic operations is better than using locks. It turned out it was worse. Yes, this may work okay on an x86 architecture, where you're guaranteed of some progress, but if you are using low link store conditional, you are not guaranteed of progress. And the other thing is, with MCS lock, the contention is only for this tail pointer. With atomic operations, everybody is trying to bang on them. There is a lot of communication going on. In MCS lock, what happens, queue gets formed in high contention, one thread updates data, say, increments a variable, and passes it to the next one, and that's the one which gets enqueued, so the contention is only for these two threads trying to enqueue. That works out better than atomic operations. >>: So coming not from this area, whenever people talk about these lock-free data structures, they always use as comparison a locked version, [indiscernible], but what you're claiming is if you just use MCS as the baseline, maybe it will not be as bad, and it might be better. >> Milind Chabbi: I don't know what they have compared it with. They might have compared with MCS. There may be cases where it might do well. It depends on architecture somewhat at this point. It may do well on x86 because of progress guarantees. And then I am showing you a highly, highly contended case. If the contention is not so high, lock overhead may be more than the benefits you may get, because if you have no one to pass the lock to, then you have only overhead of acquiring the lock, which is what happens in lock free. Two threads simultaneously try to do all this, and one guy wins and all others fail. >>: So they have different variations of the data structure, where things get more and more complicated to avoid everybody contending on a single memory location. >> Milind Chabbi: Yes. >>: Okay, but so -- I think an interesting thing is, if you're using MCS, which automatically buys you that benefit ->> Milind Chabbi: It buys you some benefit, yes. >>: Then do we need lock-free data structure, or precisely figuring out when we need lock-free data structures might be interesting. >> Milind Chabbi: Yes, there are tradeoffs. Actually, I do not recommend using hierarchical MCS lock left and right, because it takes some memory. Your tree takes memory, because internal tree nodes are pre-allocated. You might not have those memory requirements in lock- free data structures and also in simple locks. Okay. Production software, software from losses, as I mentioned, by resource idleness, wasteful resource consumption and mismatch between hardware and software. Contributions of my work are novel and efficient tools to pinpoint and quantify performance losses, adaptive runtime to eliminate wasteful resource consumption and architecture-aware algorithms. The impact is a new perspective on application performance and insights and improvements in important software. Going forward, I'm interested in doing hardware-software core design, tools to detect energy and power wastage, adaptive runtimes for - runtimes that are aware of contention and redundancy, as well as doing data analytics on large execution traces or profiles collected. I'm interested in tools for parallel programming, tools that are in the space of performance correctness or debugging, compilers and runtimes, parallel algorithms, including verification and performance modeling. This is a collaborative work. I'm not the only person to have done all of this. My adviser, Professor John Mellor-Crummey, Michael Fagan who is a researcher and several others at Rice University have been of great help. My colleagues at Lawrence Berkeley National Lab, Koushik Sen at UC Berkeley. Xu Liu is a professor at the College of William and Mary. We had a collaboration with Virginia Tech and Argonne National Lab. Nathan Tallent was an ex-student at Rice University. He is currently at Pacific Northwest National Lab, and we had a collaboration with NVIDIA. That is my talk. I am happy to take more questions if you have. >>: Thank you for the great talk. These tools look very interesting. Do any of them work in Windows? >> Milind Chabbi: Not yet. There is work to be done. >>: Thank you. >> Milind Chabbi: The idea should work. >>: I realize that. >> Milind Chabbi: Locking should be easy to do it, right, on Windows. Our sampling-based tools rely heavily on Linux interface of signaling, signal handing and things like that. NWChem works both on Windows and on Linux, but my studies were on Linux. Okay. >>: Thank you. >> Milind Chabbi: Thank you.