>> Madan Musuvathi: Hi, everyone. Thanks for coming. ... Research and Software Engineering, and it's my pleasure to introduce...

advertisement
>> Madan Musuvathi: Hi, everyone. Thanks for coming. I am Madan Musuvathi from
Research and Software Engineering, and it's my pleasure to introduce Milind Chabbi today. He
joins us from Rice. He is graduating in a minute, and then after this, he's going to be joining HP
Labs. He is very interested in all aspects of high-performance computing, from parallel
programming compilers, performance, and today he'll be talking about software support for
efficient use of modern parallel systems. Milind.
>> Milind Chabbi: Thank you. Thanks for the introduction, and welcome to this talk. I'm sure
there are more people joining online. The title of this talk, as you said, is Software Support for
Efficient Use of Modern Parallel Systems. In some simple terms, probably, this is performance
analysis. I used to be in Microsoft, and back in 2010, one day, I felt more adventurous. I used to
work in Microsoft .NET, CLR, JIT compiler, and when I felt adventurous, drove to Glacier
National Park – actually just packed up everything and drove my car. From there, went to Rocky
Mountain National Park. Further down, I drove and ended up at Rice. That took me nine days,
drove for 3,000 miles. And it was supposed to be a nice place to start some research in
compilers, and we'll see how that has turned out so far. And I'm quite proud of all the Ship Its
that Microsoft gave me. I don't know if you can see it -- yes, quite proud of those. All right, so
talks in high-performance computing, computer architecture, compilers, always begin with
paying some homage to Gordon Moore and his prediction about growth of transistors, but you're
all aware of Gordon Moore's prediction of growth of transistors, and so I will skip that and go
into something that's more interesting, somewhat controversial. This is from Proebsting, who
used to be a researcher at MSR. And he made this corollary to Moore's law, which states
improvements to compiler technology double the performance of typical performances every 18
years. In about that much time, you would expect 500 to 1,000X speedup in hardware, whereas
in an application, you will not see so much. What this is saying is if you took your modern
hardware and your favorite application and compiled it with today's compiler and an 18-year-old
compiler, the difference in performance would be only about 2X. In the meantime, the typical
shelf life of hardware is very short. Top 500 supercomputers remain in that level for a very short
time. Typical lifespan of hardware is about four years. As a result, typical scientific applications
achieve only 5% to 15% of peak performance on today's microprocessor-based supercomputers,
and that should be true for many applications, not HPC applications alone. So there is a wide
chasm between what hardware can offer and how much software makes use of it. Modern -- yes.
>>: So you have -- if I change the architecture and I come up with some fancy new architecture
that has these things called registers, you as a compiler say, hey, I've got this really good idea
called register allocation. Then, my compiler technique for that particular architecture clearly
will outweigh Proebsting's law.
>> Milind Chabbi: Your register allocator will be done in four years?
>>: Sure. I'll make a heuristic version of it.
>> Milind Chabbi: And will it work for next architecture?
>>: No. It'll work for the architecture that I'm running on right now.
>> Milind Chabbi: If you do it within four years, go for it.
>>: But I want to go back to your slide, previous slide. So I think the gap, the scientific
applications are a place where people care most about performance, and that's the best gap you
can get. And most applications, like JavaScript and Java ->> Milind Chabbi: The gap is there.
>>: And they don't even get 5% to 15%. They get like ->> Milind Chabbi: They don't even get -- like 2% to 3%.
>>: Like .001% of what the peak of the machine can be. So how come people are still using
computers and they're happy with that?
>> Milind Chabbi: Compared to not having anything, having something is better. And for a
long time, we all relied on clock speeds going up and putting more transistors and how long is
that going to carry forward? And there are efforts being done to put JavaScript on accelerators
these days, to make better use of JavaScript.
>>: You can always use performance, but I would say when the clock stops, this gap can be
closed and other gaps can be closed, and that's the only way to get ->> Milind Chabbi: Can be closed, well, closed is a strong term. May not continue to widen as
much as we are seeing for a single thread of execution. But if we are adding more cores, are we
making use of those for these JavaScript kind of applications?
>>: Well, not for JavaScript, because it's ->> Milind Chabbi: So in that terms, if you count it as the amount of FLOPs available on a
processor, there is still a widening gap, because FLOPs available is increasing.
>>: Well, if the hardware changes less frequently, like the typical shelf life of a top-500
supercomputer is four years, it's going to get longer, because we're running out of things to cram
onto it to make them go faster.
>> Milind Chabbi: But the architecture is still changing. How do you lay it out? Do you put
accelerators? Are accelerators discrete? Are they on chip? So in these ways, architecture is still
changing. Are they heterogeneous? Are they homogeneous? So these things are still continuing
to change, and they are changing at a rate faster than software can catch up with them.
>>: That's right. I think that's the most important aspect, this hardware changes faster than
compilers or applications can keep up.
>> Milind Chabbi: Right.
>>: And that seems to be getting worse, because the way it's changing is not clock speed. It's
other stuff that's even harder to program.
>> Milind Chabbi: Yes, that is this thing. That's this thing where that architectures are
changing. We use heterogeneity more frequently than we did before. There are a few
heavyweight cores in modern optimizers, which are latency optimizers, and many lightweight
cores that are throughput optimized. Modern architectures have a deep memory hierarchy.
NUMA is very common, and even sometimes memory is explicitly managed. For example,
Intel's Knights Landing processor that will be deployed at NERSC will have its special memory
that you can program on however you want to use it. Power is a limiting factor. Dramatic clock
rate increase is a thing of past, and DOE expects -- Department of Energy expects more than 25X
power efficiency to reach exaflop. Similarly, software is also complicated. To deal with
complications in software, we use modularity, which is hierarchy of abstractions, made of
libraries and functions. They have an advantage that they give reusability and ease of
understanding a complex system. For example, this is an architect's view of how abstraction
should look, but reality is different from that. In reality, the abstractions are much deeper. They
are delicately balanced, and they interact in much complicated ways. Abstractions, interoptimizations, they introduce redundancy. By redundancy, I do not mean resiliency. Resiliency
is good. Redundancy is the overhead that comes with abstractions, and abstractions are written
insensitive to the context in which they execute. Abstractions are context and input insensitive,
so they are not heavily optimized for a particular workload. Inefficiencies in software arise due
to various causes. Some of their hierarchy of abstractions, insufficient parallelism, load
imbalance, developer's inattention to performance, poor choice of algorithms, data structures,
ineffective compiler optimizations. There may be times when a compiler may do more harm
than good, and excessive data movement sometimes. If I want to classify them, I tend to see
them in these three top classes. One is resource idleness -- for example, insufficient parallelism
or load imbalance. Wasteful resource consumption, where you are making -- the resources are
not idle. You are making use of them, but that's not making progress towards the computation.
It's not getting closer to the end result. And hardware-software mismatch. For example,
hardware may provide NUMA architecture. Software assumes flat memory hierarchy. That's a
mismatch in software and hardware. So my research interest is in achieving top performance on
modern architectures. To do so, I detect performance problems. I am interested in knowing why
we are losing performance, where and by how much. I build effective performance analysis
tools that pinpoint performance losses and aid developer to identify problems in the source code
so that they can easily fix them. Having found these problems, I have provided solutions for
alleviating performance bottlenecks by building adaptive runtime and designing architectureaware algorithms to ultimately bridge the divide between hardware and software. Some of the
contributions are novel techniques to pinpoint opportunities for tuning in hybrid applications. I
have built some lightweight performance analysis tools to do so. Efficiently attributing
execution characteristics to source and data in context. I have built some fine-grained
instruction-level monitoring tools to do so. On-the-fly redundancy detection elimination, for
example, detecting and eliminating redundant synchronization and a scalable synchronization
algorithm that is better suited for NUMA architectures. I know Catherine is smiling, because
Rice has some tradition of doing scalable synchronization algorithms. So in building
performance analysis tools, I usually apply these principles. Measure accurately, introduce no
blind spots. If there is dynamically loaded code in the program, measure it as well. Do not dilate
what you are measuring. For example, if you are measuring cache misses, do not introduce
cache misses of the tool itself. Introduce low measurement overhead, so that the tool can scale to
large parallel systems. Observe on real production runs, so that you observe what is happening
in fully optimized code, and attribute contextually -- this is a recurring theme in this talk. That
deserves a slide. Context is important in large applications. Modern software is made of layers
and layers and layers of libraries. You have math libraries, communication, application
framework and so on. And we often use generic programming, such as C++ templates, where
instantiation is important. So if a tool says that you are spending a lot of time in a wait, you
exactly want to know how you reach that wait state. For example, in this climate code which has
several components, you exactly want to know what wait is the one that is causing problems.
Say, if main an ocean component, calls wait, you want to know that that's where you need to
focus attention for improving. That's why performance is highly context dependent. So I have
organized the rest of this talk -- yes.
>>: So this context, there can be like static code context and data context.
>> Milind Chabbi: Yes, both.
>>: And then the same function were called on different data structures, it behaves differently.
>> Milind Chabbi: Differently. I have tried to address both, where our data-centric attribution
capabilities tell you that, in this code, when called from main code, blah, blah, blah, and it was
touching a data that was allocated in this particular context, is when you are seeing certain
performance anomalies. So rest of the talk, I have -- I will revolve it around these themes, where
you have resource idleness and how we can pinpoint and quantify how we identify wasteful
resource consumption and how we can tailor for an architecture. Modern architectures are
increasingly employing accelerators. This is Titan, which is second-largest supercomputer. It is
made of AMD Opteron processors and GPUs. So some of the cores that we looked at that were
using accelerators were LAMMPS and NAMD, which are molecular dynamics codes, LULESH,
which is a hydrodynamics code, and Chroma, which is a lattice field theory. There are many,
many more codes being rewritten to make use of accelerators. This is just a few that we looked
at.
>>: So are these running on this architecture.
>> Milind Chabbi: Yes.
>>: These are a set of very important ->> Milind Chabbi: Important, yes. For example, LULESH is a proxy application that many,
many people have looked at. It's a hydrodynamics code, how a body changes when you hit it
with a bullet or some material.
>>: A chicken. It's for flight simulation, when you hit something on the airplane, so the airplane
doesn't crash.
>> Milind Chabbi: Right, and many other similar other kinds of applications that the National
Labs care about.
>>: Like one of my relatives did. They did frozen chickens at the airplane until they had this
code, and then they did it in simulation. Really. But it's really -- it changes how you do science,
if you have good simulation tools, right?
>> Milind Chabbi: So challenge is tuning these scientific applications to make use of
heterogeneous architectures, and an observation is that these codes have a well-tuned CPU
portion that has been developed for the last many years and an emerging GPU portion. So
performance analysis tools play a vital role in tuning these applications for these architectures.
This is the execution trace of LAMMPS, which is a molecular dynamics code. On X-axis is
time, Y-axis is various resources, resources being a CPU thread and two concurrent threads
running on GPU. And gray color indicates idleness, which means on GPU either nothing is
running. On CPU, it means it's waiting for something to finish on GPU. What you can notice is
there is a lot of idleness. They are being used in a mutually exclusive fashion. If GPU is
running, CPU is idle. If CPU is doing work, GPU is not doing anything.
>>: So just to understand the process here, you're looking at a single machine?
>> Milind Chabbi: Yes.
>>: This code executes in a larger context?
>> Milind Chabbi: In a larger context. It has exactly the same behavior if I unfold it for all the
concurrent API processes.
>>: Right, so is there any -- but when you are going to talk about performance, you're talking
mostly about a single machine and then extrapolating to the entire cluster?
>> Milind Chabbi: No, both ways. Sometimes, if the ->>: So obviously there are situations ->> Milind Chabbi: Situations where you need to look at all of them, situations where the
behavior is symmetric and you can look at one of them. So idleness wastes compute cycles.
Offloading an entire execution to GPU wastes CPU cycles, and only making use of CPU and not
making GPU, you are not making good use of your throughput-optimized cores. A better way is
overlap CPU and GPU computation. There are a couple of ways to do it. You could divide the
principal execution itself between both CPU and GPU, or if the code is not amenable for that,
you could be using CPU to prepare next piece of work when GPU is busy doing current piece of
work. This is pipelining the execution. So having identified that idleness is an important source
of performance loss in these kinds of systems, the performance analysis that we do for
heterogeneous systems focuses on idleness and attributes idleness to its causes. The insight is
that when a resource is idle, you can recognize that. That's a symptom of performance loss. The
cause of such loss is another resource that is at work. So you can blame the cause. Simple
analogy, if GPU is working and CPU is idle, the idleness that you notice in CPU is because of
GPU having not finished the work yet and vice versa. Another analogy is if you have lock
contention going on in a multithreaded code, if many threads are waiting to acquire a lock, you
can recognize they being spin waiting or something, that's a symptom. The cause is the thread
that is holding the lock, and you can push all this waiting, as a blame, to the critical section or the
thread that is holding the lock. Is that analogy -- okay. So one can tune code regions that
accumulate a lot of blame, and typically, tuning such code reduces your critical path in a parallel
program. We call this as CPU-GPU blame shifting, because you shift the blame from a symptom
to its cause. So in CPU-GPU systems, if this X-axis is time and I have two resources, CPU may
offload a kernel -- Kernel A, to GPU and go on to do some work, and they are well overlapped.
And only towards the end, CPU waits for about 5% time for Kernel A to finish. And then there
is another piece of code where CPU offloads another piece of work, Kernel B, and after a little
bit of work, now it waits for Kernel B to finish, and that wait is about 40%. And as I said, in
many of these codes, CPU is already well tuned. GPU is not so well tuned. A classical hotspot
analysis that tells you which resource consuming a lot of time will tell you that Kernel A, being
the longest time consuming, is the one that you might want to tune, but because it is well
overlapped, tuning this can give you only about 5% performance improvement, whereas Kernel
B, which the blame shifting identifies because there is a lot of idleness, if you tune this, you gain
a lot of benefit in your end-to-end performance, because it reduces the critical path. So the
insight is the top GPU kernel may not be the best candidate for tuning. The vice versa is also
true. If you are interested in tuning a CPU portion, the one that has less overlap with another
component, GPU being here, is the one to tune and that reduces the critical path. We have
implemented this in HPCToolkit, which is a performance analysis tool being developed at Rice
University. It supports multilingual fully optimized, statically or dynamically linked
applications. We do not do any source code modification. We take the fully compiled,
optimized code. We measure performance using asynchronous sampling of timers and hardware
performance counters. Because it is sampling based, it incurs very low overhead. We attribute
performance to call stack, full call path. We can do so in many programming models. It could
be Pthreads, OpenMP, MPI, CUDA or any combinations of these. It is decentralized. There are
no centralized bottlenecks, and it scales to thousands of processes.
>>: So you just walk the stack to get the call path, or do you insert instrumentation?
>> Milind Chabbi: We do not insert any instrumentation. We do an on-the-fly binary analysis
and recognize, if this is the PC, where will my return address be and so on, to unwind your call
stack.
>>: So that works to generally, even if I, for instance, remove frame pointers?
>> Milind Chabbi: Yes, works without frame pointers. It is full of heuristics that work very
well, and in the worst case, if you haven't found, you can trawl the stack a little bit to recognize,
oh, this looks like a return address, and let me use that. We applied blame shifting on some
important codes. For example, LULESH, which has been already looked by many researchers.
It's a shock dynamics code, and blame shifting identified that 30% of GPU was idle due to GPU
memory allocation routines that was happening in each time step. In each time step, you would
allocate a memory on GPU and deallocate it, and during that time, CPU was doing it, and during
that time, GPU was idle. And when we recognized it, we noticed that the amount of memory
allocated in each time step is exactly the same, so instead of doing it each time step, one could
just hoist it, allocate it once and reuse it each time. That gained us 30% performance
improvement in this code. In another code, which is EpiSimdemics, which is an agent-based
simulation for spread of contagion, the CPU and GPU were being used mutually exclusively, like
the one I showed in the earlier diagram, and we used the pipeline parallelism, where as soon as a
little bit of data was available, we started computation on GPUs. And that -- and we used a
particular runtime, called as MPI-ACC being developed at Argonne National Lab, and that gave
us about 62% speedup in this application. One of the impacts is this technique is being proposed
as to be used in an important accelerator manufacturer's product roadmap. I cannot tell you who
the manufacturer is. Moving on, that's about ->>: [Indiscernible].
>> Milind Chabbi: Pardon me?
>>: Have we looked at [indiscernible] HPC applications [indiscernible]?
>> Milind Chabbi: No, I did not. That's an interesting one try. Yes, thank you.
>>: Would something built on strace that just looks at system calls be sufficient to understand
this notion of blame?
>> Milind Chabbi: Let me think.
>>: So it's been a while since I've been in Linux, but if you strace something and you notice that
the CPU makes a system call, way out to the GPU, and then doesn't do anything for a while until
the response from that I/O call.
>> Milind Chabbi: Could do.
>>: It's sufficient to realize that the ->> Milind Chabbi: It may have some important information. Thinking of it, the difference is,
you have to collect the entire trace and then do a postmortem analysis. Collecting trace is tricky,
depending on your size of parallel program. Traces can grow really, really large. This is a
profiling technique where, as soon as you identify idleness, you instantaneously blame the code
running on another component. That's the difference in blame shifting. Anything that you can
do with profiling you can do with tracing. Just it is postmortem and you have to collect a lot of
data.
>>: So one of your -- go ahead.
>>: But on the other hand, that does this blame really attribute it to this previous execution of
the program?
>> Milind Chabbi: Yes.
>>: In other words, because scheduler can be dynamic, we will have a different execution order
of the same program that this blame may not be valid anymore.
>> Milind Chabbi: Yes, in a diag-based execution, we are thinking how to do blame shifting in a
diag-based execution, where the schedule is not same all the time. This is more useful in
statically partitioned executions, the one. Thank you. You may still find useful results in
dynamically scheduled ones, but iteration to iteration, if it changes, run to run, if it changes, then
it is tricky. If the sampling is representative of what is happening over time, it may still give you
good enough insights.
>>: Yes, I think the tricky one is that you basically identify the impact of the critical path.
>> Milind Chabbi: Critical path on a diag.
>>: When you reduce this critical path, then the blame shifts to, say, an alternative.
>> Milind Chabbi: Alternative list, right.
>>: But potentially, that this could shift to another one, so when the back structure is
complicated, it could be hard to analyze the entire impact.
>> Milind Chabbi: That's true, that's true. There could be many closely related critical paths,
and you may identify one, you optimize this, and then the next one becomes the critical path.
That's the issue any time you are trying to optimize critical path.
>>: Exactly.
>>: So when you ->> Milind Chabbi: Then you have to go for hard critical paths.
>>: So when you showed the example, and if you go back, one of your solutions was to do
better pipelining.
>> Milind Chabbi: This one?
>>: Yes, but can we go back to the example where you have the 40%? Okay, so in this one, it
could be that Kernel B is actually highly optimized, and this is an example where you want to do
pipelining. Or there could be some work that's independent of Kernel B that you could shove
into here. How do you know which thing you should do?
>> Milind Chabbi: Case by case. At that point, an automated tool will not help you. Automated
tool tells you what's going on, where are the potential problems. Solution is in the mind of the
application developer.
>>: Okay. So here, you say you can either -- but you could give advice. You could say Kernel
B is your bottleneck. You could shift work, independent work in here, or you could optimize
Kernel B.
>> Milind Chabbi: Yes, either of them is a good option.
>>: Are there other options?
>> Milind Chabbi: Not off the top of my head. I'll keep thinking.
>>: I can't think of any off the top of my head, either.
>> Milind Chabbi: Yes, usually, pipelining is something that people prefer, because bringing
some other piece of work can change your critical path. That may take longer than this one.
>>: Right. Pipelining is more directly going -- more localized than some of the other options.
>> Milind Chabbi: Right.
>>: Thank you.
>> Milind Chabbi: You're welcome. So going on to wasteful resource consumption, another top
cause of performance losses, I have split that into internode and intranode, how wastage happens
across nodes and how that happens within a node. First, across nodes. NWChem is this
computational chemistry code from DOE. It is a flagship quantum chemistry code. It uses this
thing called as Global Array Toolkit Programming, where the data is physically distributed onto
several nodes, but each process gets this illusion of being able to access the entire data structure
using get-put style one-sided semantics. And the software is written in SPMD-style PGAS,
Partitioned Global Address-Based Programming Model, and it's a large source code, with more
than 6 million lines of code and about 25,000 files. It scales quite well. It has been shown to
work on more than 100,000 processes. It is widely used. There are more than 60,000 downloads
worldwide and about 200 to 250 scientific publications are produced per year based on this
framework. This is something impactful. So in terms of the software organization, there is
NWChem chemistry code written, usually in FORTRAN, and then it calls into this Global Array
Framework, which in turn calls into ARMCI, which is an Aggregate Remote Memory Copy
Interface, which is being reorganized into communication [indiscernible] scale these days, and
all these can go on different substrates. For example, it can use MPI, Message Passing Interface,
or GASNet, which in turn can go on InfiniBand, DMAPP, on Cray machines, PAMI, on IBM
machines and so on. Basically, it's a layered software. In PGAS programs, typically, people use
a lot of barriers, because that's how you can enforce consistency of data between asynchronous
updates. But communication libraries for HPC are typically written agnostic to an execution
context, so developers use -- make conservative assumptions about underlying layers, and they
enforce barriers on entry and exit to each API. For example, each time this layer calls into this
layer, there will be a barrier to here and one barrier to here, and call into here, a barrier and a
barrier. This layering and use of APIs leads to redundancies, as I'll show in the next slide. This
is four lines of code written by an application programmer that sits in the chemistry component.
If you track own what happens underneath -- this is the call graph -- you will see that there are
nine barriers in four lines of code. And in the Global Array layer, there is a sync at entry sync, at
exit sync, at entry sync, at exit and so on. Now, of those nine barriers, at least three are
redundant. For example, the one that is called at the exit and the one that is called at entry, there
is absolutely no data update in between these two. But because conservative assumptions put
barriers and they lead to redundant barriers. Yes?
>>: So would a smart inlining -- if I just inlined this, would the compiler figure this out?
>> Milind Chabbi: Multilingual, multi-component, there is -- as of now, I don't know of analysis
that does barrier elimination, and safety is the other thing. Is it impossible? No. One can do
static analysis, just ->>: Yes, but even I'm working at the binary, right? If I'm looking at x86, at link time, if I did
the inlining?
>>: Then you can't identify the barriers very well.
>> Milind Chabbi: Then you have to understand the data access before a barrier and after a
barrier in a binary analysis.
>>: Right, right.
>>: So can you explain why, if I'm doing a GA copy -- let's say I'm copying two arrays. Why
does GA copy need to synchronize the whole program? It just needs to synchronize those
threads that access GA, right?
>>: Yes, but it's ->> Milind Chabbi: But data A and B are spread all over, on all processes.
>>: Okay, so I guess I haven't understood the programming model yet. But there's one -- so
when I say GA copy, then there's one thread that's copying.
>> Milind Chabbi: Yes. It's an SPMD. Each per thread says let's copy. So your role is, I copy
the data that's present on my local portion to my portion of B. That's what it is doing.
>>: So everybody is operating on GAG. Okay, great. Thank you.
>> Milind Chabbi: Okay, nine barriers in four lines of code, three redundant. Program spent
20% time doing barriers. Compensate of API leads to redundancies. Redundancy is contextual.
You cannot go and delete this thing, because there may be another path where it is called, and
that's not followed by this, which means then it is necessary.
>>: But in a sense, I would expect that if threads have already come to a barrier, adding another
barrier is probably not ->> Milind Chabbi: There is no load imbalance problem, but the cost of barrier itself is high.
Barrier incurs latency, and it can lead to a little load imbalance, because one process may get
stuck, may get stalled, may have slow CPU. CPU may get throttled, so the arrival is not perfect.
So to answer your question, it is multilingual, multiple programming models, MPI, OpenMP, so
that's why elimination using compiler-driven techniques is tricky. So our idea is to detect
redundant barriers at runtime by intercepting certain communication calls, and speculatively rely
on redundant barriers, but when we do so, we are being conservative and lightweight. So to give
some background on when barriers are needed and when they are redundant, we have these two
processes, P1, P2. If there is a put of data X and a get of X on P2, then barrier is needed, because
there is a data dependence across the process here. Over here, two processes touch data, and
they touch a different data after a barrier, so there is no data dependence here. Hence, this is a
redundant barrier. Over here, P1 does a put of X and then does a get of X. There is data
dependence, but it does not cross process boundaries, so even here, barrier is redundant. If there
is no data being touched after a barrier, those barriers are redundant, and if there is no data being
touched -- by data, I mean the shared data -- not being touched before a barrier, then even that is
redundant. So this led us into thinking about how barriers can be eliminated when they are
needed, when they are not. So if you can think of two barriers, consecutive barriers, and observe
all types of accesses in between two barriers, you can classify them into three categories, and
which means the process did not do any remote memory access, remote as in any shared data
memory access. L means it accessed data, but the data was resident on the same process, local.
R means the process accessed the data that was remote. It was not on the same processor. These
L, N and R form a lattice. For example, if one process did not access shared data, another
process remote data, the lattice will result as if you accessed remote data. If you take triplets of
an access before a barrier and what is an access after the barrier, if you have an execution trace
of all processes, the entire execution trace of the program, you can look at these accesses and say
no access before, no access after. That barrier is redundant, whereas if there was a remote access
before a barrier and a remote access after a barrier, then it is not safe to eliminate such barriers.
Then you can say, well, that barrier is needed. You could do this if you have the entire execution
trace for all processes, but if I am on a barrier and I want to decide at runtime should I participate
in the barrier or not, I cannot, because I don't know what is coming next. So a priori knowledge
of an access is unknown, but there is a silver lining here. One can make safe approximations, if
you notice -- no matter what comes after a barrier, if there was no remote memory access, if
before a barrier, you can always remove such barrier. So that says if no remote memory access
and a barrier, remove such barrier. If there was only local access, there is at least one case where
it is not safe, so you cannot eliminate such barrier. If there was a remote access before a barrier,
there are two cases where you cannot eliminate a barrier, so conservatively, keep the barrier.
Now, you might be thinking, fine, I as an individual process know my kind of accesses. How
will I know what's happening in some other process to make a global decision of whether to
participate in a barrier or skip a barrier? For a second, assume an oracle tells you, if you know
your local status, you know the status of the entire system. I will tell you what that oracle is.
Just assume there is an oracle, and let's try to apply these three rules to the core I had shown you.
Over here, there is a barrier and a barrier, no remote memory access in between. My N and B
rule says, skip such barrier. Don't participate. Then there is L and B. LB rule says participate in
that barrier. Then there is N and B. My NB rule again says skip such barrier, so that's how one
can apply these rules on the fly. So let's see what is that oracle. The oracle is simple. It is
actually a learning and elision mechanism, where we identify local redundancy of a barrier
instance in its calling context. You arrive at a barrier. You notice the die make in the access
from last barrier to this barrier, and you have learned something about yourself. And while
learning, we replace barrier with reduction, and each process tells its local state in the reduction
and the result of reduction tells you back, was it needed in some process or was it not needed in
any process? If it is system wide redundant, then such barrier becomes an elision candidate in
that context. If it is needed in some process, then we will say we cannot elide such barriers, and
it costs lower rate while learning, because passing one additional information in reduction has
very little overhead. Once we have learned something, we speculatively mark a barrier as
redundant for a call in context if all instances are globally redundant and we elide future episodes
that are marked as elision candidates.
>>: So this is the full calling context.
>> Milind Chabbi: Full calling context, yes. So each time you arrive at a barrier, you do a call
stack unwind and record it. So if I am a P1 process, I arrive at a barrier. I wouldn't have, say,
done any remote operation. My instrumentation would unwind the calling context. P2 would
arrive, do the same thing, and then while learning, they participate in a reduction and they inform
what they have found about themselves, and they learn something about that. In this case, they
learn that there is barrier not needed, and they locally make this decision. In that context, it can
be elided, and they do it for a certain number of iterations, which is tunable, and once they have
learned enough, whence -- and then it arrives. It says, I am in this context. What is the learned
thing about it? Elide. So it does not participate in the barrier and goes forward. The same thing
for another process, as well. It arrives, where is local information, and skips. So can there be
misspeculation? Yes. Misspeculation can happen if, when you arrive at a barrier, you are not
following the consensus that you have made previously, and we can always detect
misspeculation, because you would have broken the consensus as a process. There are times
when you can recover. For example, if all processes made misspeculation, they all go into a
reduction and they understand. Otherwise, we rely on checkpoint restart facility that is available
in NWChem and start from last checkpoint. In NWChem, luckily, there were no misspeculation.
Some amount of training made sure that once it is redundant, it is always redundant because of
the SPMD style programming. To implement, we instrumented barriers, some remote memory
calls, certain application-specific features that were bypassing underlying -- whenever they were
accessing local data, sometimes, they were bypassing well-known calls. And all this costed only
1% of instrumentation overhead and scales perfectly, because most of it is all local operations.
Now, how do we gain developer confidence if we start doing this kind of elision on the fly? So
to gain more developer confidence, we have a multipass approach, where in one pass, we only
identify what are redundant barriers and present it to the user in a summarized way. Here, in this
context, we think this barrier is redundant, do you want to elide us? And if the developer says,
yes, go ahead, elide, in the actual production run, only if the context matches we will elide. So in
a production of NWChem running on Cray XC30 machine with about 2,000 processes, we were
doing a simulation to understand catalytic destruction of ozone. There were 138K instances of
barriers for a core running for about 30 minutes, and they were spread across 8,000 unique
calling contexts, and there were 63% barriers that were redundant. 63% is that of this one, so as
you can imagine, this layering of software causes a lot of redundancy. And by eliminating them,
we gained about 14% running time improvement.
>>: So the checkpoint in NWChem was an application checkpoint?
>> Milind Chabbi: No, the library is the checkpoint for you, periodically.
>>: For other reasons.
>> Milind Chabbi: Not for this.
>>: For failure, because if the code has a lot of state and they fail, it lasts for hours.
>> Milind Chabbi: Yes. Next is the kind of wasteful resource consumption that happens within
a node. Memory access is expensive within a node because of multiple levels of hierarchy, cores
sharing a cache and having limited bandwidth per core. That drive happens when two writes
happen to the same memory location without an intervening read. So here is a dead write. We
have int x equal to zero and x equal to 20. The first x equal to zero is a useless operation. That's
a dead write, and for this discussion, we will call this one as a killing write. What is not a dead
write is over here, x equal to zero, and then we print the value of x, and then x equal to 20. Now,
compiler scan eliminates something like this. Dead code elimination can get rid of dead writes,
but what is new is this piece of code that we came across in Chombo, which is Adaptive Mesh
Refinement framework for solving partial differential equations. There is a three-level nested
loop in Riemann solver, and the program spent 30% time doing this. So over here, the first three
lines are an assignment to a four-dimensional array, and then a check is made. If that check is
true, the same array is overwritten with a different set of values, so this is killing that if this
condition is true. Another condition is being checked. If that is true, these three or maybe that is
overwritten again, with some more values. Now, compilers cannot eliminate all dead writes
because of aliasing and ambiguity, aggregate variables such as arrays and structures and
compilers apply optimization within function boundaries or within some limitations, whereas
there is always some late binding code can be loaded on the fly, and there is partial dead write.
For example, it may be dead on one path. It may not be dead on some other path. So over here,
this code lacked for performance. A very simple way to fix this is using else-if structuring,
where you do this thing first. If that condition is not true, then you do that. If that is not true,
you fall to the default one, where it will eliminate all dead writes for you. Doing so sped up the
loop by 35% and program by about 7%, so I was motivated to see how frequent these dead
writes are in any program, so I ended up writing this tool called as DeadSpy, which monitors
every load and store and maintains state information about every memory byte, and it detects
dead writes in an execution using a very simple automaton. Each memory allocation starts
within a virgin state. A read operation takes it to an R state, write operation takes it to W, a read
following a write takes it to R. If you follow several reads, it will remain in the same state. A
write will transition back to W. If there is a write that follows a write, the automaton will detect
it and report it. That's a dead write. Now, to make it precise, we are doing this at byte level.
There are no false positives or false negatives. Yes.
>>: So there's also an implication here that anything that leads to a dead write is itself dead, so
actually, it's that -- I want to make sure that that's true. I guess, if you do a backward slice from
dead write, that implies ->> Milind Chabbi: By dead, you mean the killing or the dead.
>>: The dead.
>> Milind Chabbi: Yes, everything that led to the dead write is also used less. I'll take that back.
Those computations may be needed later.
>>: Yes.
>> Milind Chabbi: But there is some transitivity there.
>>: Did you look into that?
>> Milind Chabbi: This is something that someone else also suggested me, that okay, if you
have like 10% dead writes, how much is the previous computation compute causing that one,
that is transitive relationship which is interesting to explore. Thank you. It's insufficient if I just
told you that there is 30% dead write. You want to know where it happened, what are the two
parties involved and how you reached those positions, so that's why you need call in context that
pinpoint where dead write happened and where killing write happened. And we do the sourcelevel attribution along with calling context. And I ran this on several codes. I'm showing you
only the ones on SPEC integer reference benchmarks. In general, dead writes are very frequent.
Lower is better here, higher dead write is bad. About 20% dead writes in SPEC integer reference
benchmark, and GCC had particularly astonishingly high dead writes, 60% on average, and for
one particular input, there was 76% dead writes. So I was interested in knowing what's going on
in here, and it happened in this piece of code called as loop registered scan, where GCC is
walking through basic blocks belonging to a loop and identifying what are the registers being
used. So the quota begins here by allocating an array of 16,000 elements to this last set array,
and then it is walking through each instruction of a basic block and a particular pattern matches.
It says, okay, that register is being used. And then once a basic block finishes, it says, okay, my
work is done for this block. Let me reset the whole array with zero and start doing it with next
block. So that is killing this assignment, because typically, basic blocks are short. Median use
of number of elements in this array is only two, and you have a 16,000-large arrays, so dense
array is a poor choice of data structure here, and it keeps happening over and over again, as you
use it for several basic blocks. So we replace this array with sparse data structures. I used split
reads, because it has this property of frequently accessed position being in your cache and easy
to access, faster access. That led to about 28% running time improvement in GCC. Found many
other codes where it was useful. For example, NWChem had a redundant initialization of a large
memory buffer, and when we eliminated that redundant initialization, we gained 50% speed up
when running on 48 cores. BZip had a particularly overly aggressive compiler optimization that
was hoisted from a non-executed code path onto a hot code path. And yes, that's a bad
optimization. That was causing a lot of dead writes, because of stack spilling happening over
and over again. When we eliminated that, we gained 14% running time improvement. HMMR,
which is a code for DNA sequencing, had dead writes not being eliminated because of aliasing,
and once we made compiler aware that two arrays were not aliased, it removed dead writes, as
well as it did vectorization for us, and we gained 40% improvement in running time. So dead
writes, DeadSpy, there were lessons to learn from it. Dead writes are common symptoms of
performance inefficiencies. Choose your data structures prudently. Suggestions to compiler
writers is pay attention to your code-generation algorithms. Know when they may do more harm
than good. Okay, that's one. And profiling for wasteful resource consumption opens a new
avenue for application tuning, and context is really important in understanding feedback from a
tool. You want to know where dead writes happened, in what context can you optimize.
Actually, these context-sensitive tools are useful in many other situations. There are tools for
correctness -- for example, data race tools, where you want to know where a race happened and
what was the other access that led to that race and you want to know the full context that led to it.
There are other performance analysis tools. For example, reuse-distance analysis, where you
want to know where was the previous use and where is the reuse, and if there is a common
ancestor that you can hoist your use and reuse, then you can retain the use of memory hierarchy.
We have written a tool, an open-source library that anybody can use for their fine-grained
instrumentation. We call it as CCTLib. In the interest of time, I will skip this portion, but you
can stop me later if you are interested. So the last portion of this talk is the mismatch that
happens between architectures and applications or algorithms, and how one can tailor for an
architecture. Modern architectures use NUMA. For example, this is IBM Power 755, which has
four IBM 7 processors. Each processor is four-way SMT sharing, L1 and L2 cache. That forms
your first level of NUMA hierarchy. There are eight cores sharing the L3 as a victim cache.
That forms the second level of NUMA hierarchy, and they are connected through a fast network,
forming the third level in the NUMA hierarchy. If you thought three is the number of levels in a
NUMA hierarchy, you will be surprised. SGI UV 1000 is the world's largest shared-memory
machine, with 4096 cores, and each node, which is also called as a blade, is composed of two
Intel Nehalem processors, and every node can access the memory of every other node using
loads and stores -- not gets and puts -- loads and stores. So two way SMT within a core forms
first level of hierarchy. Eight cores per socket form the second level of NUMA hierarchy. Two
sockets on the same node form the third level of NUMA hierarchy, and this is logical diagram of
a rack of these nodes, and there can be up to three hops from one node to other. That forms
fourth, fifth and sixth level in the NUMA hierarchy, and several such racks are joined together to
form this half of the machine, and there is five hops from one node to the other. That forms eight
levels in the NUMA hierarchy. Now, locks are one of the fundamental features in shared
memory architectures.
>>: Where are you distinguishing between loads and stores and gets and puts? They should be
the same, right?
>> Milind Chabbi: Yes, gets and puts are also as problematic. Actually, they are even more
problematic than loads and stores. Just because this is a shared memory machine, people tend to
write OpenMP, Pthreaded-style programs. A little different in get-send puts is that you do not
have a cache problem. They are -- you take the data from somewhere, put it into your big, large
buffer. So the issue is more of the communication between node to node than anything going
through your memory hierarchy. Make sense? That's why. Okay, so locks are one of the
fundamental features in parallel programs, and they have been studied for a long time, and
centralized locks are bad, and one of the good ones that scalable locks is MCS lock, MellorCrummey and Scott lock, and I will explain what is MCS lock, how it works and what are the
deficiencies of that lock where you are using it on a NUMA machine. So MCS lock has a lock
variable. Each thread arrives with a record, which has a status field and a next field. Here,
thread even arrives, and let's say it arrives in NUMA domain one. The protocol involves
swapping this lock pointer to itself, and if you had no predecessor, then you are the lock holder,
and you enter the critical section. In the meantime, let's say another thread arrives in a different
domain with its own record. Now, it swaps the stale pointer to itself. Now, this was cached in
this domain, so that was a cache miss for this one. Now, it is waiting, and it wants to tell its
predecessor that he is the successor, and it knows who predecessor is by swapping the tail
pointer. It goes and pokes the next pointer and says he's the successor. While poking that, it
incurred a cache miss, because this was present in a different NUMA domain. And now, it goes
into a local spin. This thread eventually finishes the critical section, and it wants to tell its
successor that he is the lock holder, so first, it accesses its next pointer, but that was cached in
this line, because that was the last accessor. And now, it goes and touches the status field and
says, you are the holder. That is another cache miss. At some point, this thread realizes that the
status has field, the status has changed, and because it was previously accessed by a thread in
different domain, it also incurs another cache miss. And then, the data that T1 exists is most
likely to be accessed by T2. That's why we are holding a lock, so that whole data moves from
one domain to another domain, so lock and the data access in the critical section keep pingponging between one NUMA domain to another NUMA domain indiscriminately. For example,
in this MCS lock, you have a chain of waiters. They may belong to very different NUMA
domains, where I have indicated each node with the name, so data can just keep ping-ponging
between various NUMA domains. How bad is this? It is pretty bad, actually. Here, I have
shown what is the time it takes the lock to be passed from one level to another level. If you pass
it within SMT tiers, what happens if you pass it to two cores sharing the same socket? What if
you pass it from one core to another and so on SGI UV 1000. This is the time taken, which
means lower is better. When you are passing within the SMT, the time is small, but as you keep
passing it farther and further away, it can be as many times -- as many as 2,500 times lower in
passing the lock. So to address this, we have built a NUMA-aware locking algorithm. We call it
as hierarchical MCS lock, HMCS. The insight is that the passing the lock to a thread in a nearer
NUMA domain incurs less overhead. So always pass the lock to a waiting thread in the nearest
NUMA domain when possible. So to do so, we organize the lock into a hierarchy of queuing
MCS lock, mimicking the underlying NUMA hierarchy and orchestrate the protocol such that
threads wait in locality domains. If here is how threads were waiting in the original MCS lock,
in the hierarchical MCS lock, they will wait in their own locality domains. For example, if two
threads belong to same innermost domain, they will be lined up in the same queue. Here, each
LO bar is one level of MCS lock. So every thread arrives. If you are the first one, you have
more overhead to go further up and acquire the lock at a higher level. Having acquired that in
high contention, typically you have waiters either in your domain or somewhere in the nearest
domain, and you keep passing it within for a bounded number of times and make use of the
locality. And once you have done that, the last thread, which this a threshold, passes it to
someone in the nearest domain and makes use locality. It makes better use of data locality,
NUMA architectures, high lock throughput for highly contended locks. A given lock acquisition
may incur higher latency, because if there is no contention, you incur the cost of going up in the
tree. We ensure starvation freedom by bounding the number of times we pass, and we have a
relaxed notion of fairness. And one can tailor the design of the lock either for high throughput or
for high fairness. If you are interested in fairness, your passing bound is shorter. If you are
interested in throughput, you pass it much more number of times, and the bound on the passing is
not a random number. It is based on the excess latencies between two levels in the memory
hierarchy and an and analytical model of the queuing lock to take advantage of the difference in
latency between two levels of hierarchy. On this SGI UV 1000 machine that I showed you, on
X-axis is number of threads. Y-axis is throughput. That is, how many times can I acquire a lock
per second, so higher is better. MCS lock's throughput keeps falling each time a new NUMA
domain is added, whereas the hierarchical MCS lock reaches a stable point and retains very high
throughput throughout. This is K-Means, which is an application I took from a mining
benchmark, and on X-axis is time. Y-axis is how much improvement did we see based to using
MCS lock? So black color is MCS lock's throughput -- I mean, end-to-end application time.
And red color is how much time did hierarchical MCS lock take in this application? When the
contention is low, using hierarchical MCS lock has more overhead. It's about within 80% of the
MCS lock, but as the contention rises, HMCS lock can do about 2.2 or 2.1 times better than
using an MCS lock. You might be wondering how to deal with this case of no contention. We
used a trick of Lamport's slow fast-path mechanism where, if a thread arrives and there is no
contention, instead of going through the entire tree of locks, we can directly enqueue at the root
and make the case of no contention very fast. I did not show the numbers with disk optimization.
And if that's not the case, you take a slow path, and even within slow path, there is a hysteresis
mechanism which says, if when I came last time, if there wasn't enough contention at this lowest
leaf level, let me directly go in and queue at a level where I had noticed [indiscernible] of
contention. And that makes our no contention case about 95% of the throughput of MCS lock,
and high contention case about 90% of the best HMCS for a given configuration. To conclude,
production software systems suffer from performance losses.
>>: I would think of HPC applications as using a lock -- maybe it's just my ignorance of these
applications. So, for example, the payments, I would expect them to be embarrassingly
[indiscernible], so where are the locks used?
>> Milind Chabbi: So what happened, K-Means was they were using atomic operations. They
were not using locks, and you would expect atomic operations to do well, better than locks,
right? So what happens is -- this is on Power 7. The way atomic operations are implemented is
you take reservation, you want to make an update, and then you write the location. And when
you have a lot of threads, you take a reservation, some other guys takes reservation, you lost your
reservation, so they keep ping-ponging data and no one makes progress. So atomic operation
was terrible, and what I did was, I replaced atomic operation with MCS lock. That gave about
11X improvement, so there was this loop, and inside that, there were three atomic operations. So
instead, I put a lock around the loop. I said take the lock, and once you have taken the
reservation, make changes to all the locations. That worked out much better.
>>: So the average number is still I have talking to folks in like the [indiscernible] folks is that,
hey, lock-free data structures are better than locked data structures in performance. What you're
saying is that's not necessarily the case.
>> Milind Chabbi: It depends. Is it a small update or a large update? For short sections?
>>: For short critical sections.
>> Milind Chabbi: For short critical sections, you may, may not benefit from lock free. But if
you have a large data to update, you will end up taking locks, right?
>>: Yes, in that case, it's very clear to use locks.
>> Milind Chabbi: So this is a case where one part, that fine-grained atomic operations is better
than using locks. It turned out it was worse. Yes, this may work okay on an x86 architecture,
where you're guaranteed of some progress, but if you are using low link store conditional, you
are not guaranteed of progress. And the other thing is, with MCS lock, the contention is only for
this tail pointer. With atomic operations, everybody is trying to bang on them. There is a lot of
communication going on. In MCS lock, what happens, queue gets formed in high contention,
one thread updates data, say, increments a variable, and passes it to the next one, and that's the
one which gets enqueued, so the contention is only for these two threads trying to enqueue. That
works out better than atomic operations.
>>: So coming not from this area, whenever people talk about these lock-free data structures,
they always use as comparison a locked version, [indiscernible], but what you're claiming is if
you just use MCS as the baseline, maybe it will not be as bad, and it might be better.
>> Milind Chabbi: I don't know what they have compared it with. They might have compared
with MCS. There may be cases where it might do well. It depends on architecture somewhat at
this point. It may do well on x86 because of progress guarantees. And then I am showing you a
highly, highly contended case. If the contention is not so high, lock overhead may be more than
the benefits you may get, because if you have no one to pass the lock to, then you have only
overhead of acquiring the lock, which is what happens in lock free. Two threads simultaneously
try to do all this, and one guy wins and all others fail.
>>: So they have different variations of the data structure, where things get more and more
complicated to avoid everybody contending on a single memory location.
>> Milind Chabbi: Yes.
>>: Okay, but so -- I think an interesting thing is, if you're using MCS, which automatically
buys you that benefit ->> Milind Chabbi: It buys you some benefit, yes.
>>: Then do we need lock-free data structure, or precisely figuring out when we need lock-free
data structures might be interesting.
>> Milind Chabbi: Yes, there are tradeoffs. Actually, I do not recommend using hierarchical
MCS lock left and right, because it takes some memory. Your tree takes memory, because
internal tree nodes are pre-allocated. You might not have those memory requirements in lock-
free data structures and also in simple locks. Okay. Production software, software from losses,
as I mentioned, by resource idleness, wasteful resource consumption and mismatch between
hardware and software. Contributions of my work are novel and efficient tools to pinpoint and
quantify performance losses, adaptive runtime to eliminate wasteful resource consumption and
architecture-aware algorithms. The impact is a new perspective on application performance and
insights and improvements in important software. Going forward, I'm interested in doing
hardware-software core design, tools to detect energy and power wastage, adaptive runtimes for - runtimes that are aware of contention and redundancy, as well as doing data analytics on large
execution traces or profiles collected. I'm interested in tools for parallel programming, tools that
are in the space of performance correctness or debugging, compilers and runtimes, parallel
algorithms, including verification and performance modeling. This is a collaborative work. I'm
not the only person to have done all of this. My adviser, Professor John Mellor-Crummey,
Michael Fagan who is a researcher and several others at Rice University have been of great help.
My colleagues at Lawrence Berkeley National Lab, Koushik Sen at UC Berkeley. Xu Liu is a
professor at the College of William and Mary. We had a collaboration with Virginia Tech and
Argonne National Lab. Nathan Tallent was an ex-student at Rice University. He is currently at
Pacific Northwest National Lab, and we had a collaboration with NVIDIA. That is my talk. I
am happy to take more questions if you have.
>>: Thank you for the great talk. These tools look very interesting. Do any of them work in
Windows?
>> Milind Chabbi: Not yet. There is work to be done.
>>: Thank you.
>> Milind Chabbi: The idea should work.
>>: I realize that.
>> Milind Chabbi: Locking should be easy to do it, right, on Windows. Our sampling-based
tools rely heavily on Linux interface of signaling, signal handing and things like that. NWChem
works both on Windows and on Linux, but my studies were on Linux. Okay.
>>: Thank you.
>> Milind Chabbi: Thank you.
Download