Jesus Labarta: So this is the second part of... Supercomputer Center. Osman Unsal is going to do a...

advertisement
Jesus Labarta: So this is the second part of the presentation from the Barcelona
Supercomputer Center. Osman Unsal is going to do a brief presentation of the
relationships that exist between the center in Barcelona and Microsoft. Then
we'll have a five-minute video or maybe 10-minute video, I don't know how long.
Five-minute video. And then we'll continue with the presentation that Jesus
Labarta is doing on StarSs. So Osman. Or Mateo.
>>: Okay. You can do that, but we thought that we are talking about Barcelona.
Probably you know Barcelona because of Olympic. You don't like the circuit but
you know Barcelona because we organize the Olympic in '92 again.
So this here, this is Barcelona. Okay? So this is very famous La Ramblas,
Christopher Columbus or too fast. I don't know. So just a few figures about
Barcelona. This is [inaudible]. You know Gaudi the architect who build many,
many buildings and these are all familiar, is one of the most important
monuments in Barcelona. It's too fast. I'm sorry. Okay.
So this is a very nice city, okay, where we are located. And this is the university
and then here located here, and here is the stadium of the Barcelona soccer
team. 120,000 people there. And I gave a lecture, a keynote in ICS in New York
comparing a little computer -- together with the computer architecture
Manchester was pioneered. Okay. And the same for in England for the football.
But probably now we bet these guys in the final European league. As you
probably know, we have very good players.
So I'm sorry because I tried to show you Barcelona a little better. And what is the
object of it? The object of it is the second part of the movie, as you know,
because they initially thought Fabrizio Gagliardi, Tony Hey, many other people
[inaudible] meet here, we know, then since many years ago we established our
collaboration, our research center in collaboration with Microsoft.
So the Barcelona Supercomputing Center and the director is -- we are 300
people. 250 people in research. Jesus Labarta is the director of computer
science.
But associated to the BSC we have a joint center with Microsoft, okay? And
Osman is the person tending the director of the center. He will give you -- after
our movie, I would like to show you -- he will give you three or four minutes of the
work we do there. And the object of it again is to show Barcelona, to show the
center, a very nice video of the supercomputing we have and the work we do
here, we do with you and to invite you to come to Barcelona and spend some
time with us doing research, even in superscalar or StarSs or whatever,
computer architecture or the topics we are working with with you.
So I don't know how it works, but now we have a video, this one. And the audio?
Oh, okay. So [inaudible]. Okay. We are the Barcelona Supercomputing Center.
See the Spanish National Lab on supercomputers.
>>: You can see we had the supercomputing labs up there, okay. Very nice.
>>: The main supercomputer of the BSC, of the Barcelona Supercomputing
Center is the Marenostrum. You can see behind me. This supercomputer has
been here two years and a half the number one in Europe, and we were number
four and number five in the world. It has more than 10,000 processor working
together and the possibility of using it is that we can do in one hour the work that
any powerful laptop will need one year.
Our partners are the Ministry of Education in Spain and Madrid, Catalonia, and
our Technical University of Catalonia UPC. Recently the [inaudible] will join us.
This is the building where many people are.
We also do research inside the BSC. We have four groups: Computer science,
computer application, life science and earth science.
>>: Supercomputing is a key tool for the progress of the society.
>>: Computer science department at BSC has cooperation with the industry in
the computer business. We have cooperation with Microsoft on transactional
memory or with IBM on the [inaudible] and [inaudible] will be next supercomputer
so stated in Marenostrum, targeted to be around hundred times faster executing
at around 10 petaflops. Our idea is that this computer sign on the -- based on
the cell processor. So our cooperation with IBM is to work on them on things like
the design of the next cell processor, including aspects such as memory
bandwidth, optimization, tolerance to latency, scalability.
So the project has six activities on applications, programming models,
performance analysis tools, load balancing, interconnection network design and
processor design.
The project covers all aspects of the computer design and integrates the
experience that know how different research groups at BSC on all these areas.
This is one of the strengths of BSC. And by focusing all this research towards a
unified objective, we intend to have important impact in the design of future
supercomputers.
>>: Fluid communication at research level between the public and private
sectors is clear for the competitiveness of the country. High-performance
computing is a competitive tool. If we don't compute, we won't be able to
compete.
>>: The computer application since [inaudible] in department develops parallel
software that exploits the whole capabilities of the supercomputer. The software
that we develop are numerical simulations of complex physical phenomenon.
Our main software development is Alya. The fundamental characteristic of Alya
is the highest scalability when using thousands of processor. Alya is able to
manage message larger than 100 million elements. With Alya we have
simulated important engineering programs like the noise produced by a
high-speed train, the external aerodynamics of an airplane or a little wind tunnel.
In the next future, the most interesting simulation that we are going to do with
Alya are biomechanical simulations. The objective of this simulation is to provide
a tool to experiment with different surgical possibilities. At the present we are
developing a simulation of the complex human hair that [inaudible] in the brain
and the interior of the nose.
Also we develop software with external scientific groups. For example, we are
developing the Siesta code that allows to perform ab initio molecular dynamics
with extremely large molecular systems. Moreover, we develop software with
industry. For example, we are in a giant project with Repsol, our Spanish other
company, developing the most powerful subset since making machine tool.
This kind of tool is mandatory when you are looking for oil in areas like the Gulf of
Mexico where the oil is hidden by a salt layer. As a conclusion, we can say that
supercomputing is a critical tool to maintain the competitivity of our entry.
>>: We, thanks to the supercomputers, can model reality. And we can avoid
some experiences where these experiences are too expensive, too risky, or are
just impossible.
>>: The science department here in the BCC we have to mind every [inaudible].
The third one is about air quality from custom system. Personally, we work with
an [inaudible] system for Europe, Arabian Peninsula and [inaudible], Iceland.
This personal system has a -- is possible solution of four kilometers in the grid
site. This is important because personally is the operational air quality for
custom system with bigger specialty solution in the world.
Also we work [inaudible]. We test our operational modeling assistant every day
to evaluate what it is part of this -- they test from the satellite to the military
[inaudible] to America, also [inaudible]. At this moment, this is seeing [inaudible]
with [inaudible] has been nominated for the [inaudible] as traditional center in the
sun and thus is the warming system. This is important to evaluate a global level,
the promise [inaudible].
Also we work in [inaudible] modeling with the idea to improve this application in
the supercomputer Marenostrum and prepare the next EPPC modeling works.
>>: The research results affect people rights by finding cures for serious
illnesses, by fighting against climate change, or by looking for new sources of
energy.
>>: In the life science department at the Barcelona Supercomputer Center, we
work in the theoretical study of living organisms, so we are trying to really
understand them, we are trying to simulate them, trying to predict their behavior.
And we have a very broad range of interests within this field, starting from the
very detailed analysis of protein interactions and protein function to deploy
analysis of entire genomes or even metagenomic systems.
Particularly, we work in aspects like study of protein, nucleic acid flexibility, and
the study of protein docking systems [inaudible] in the analysis of the genome
information and also in aspect of drag design.
So living organisms are extremely complex. Probably the most difficult to
represent from the point of view of theory and simulation. And this is why we are
so heavy users of Marenostrum and journal of computing supercomputer
resources. We need to use very large computer to relieve the simulations of, for
instance, flexibility from the chronologies or, for instance, the dynamic of
[inaudible] pathways.
The Barcelona Supercomputing Center is not an end, it's a means toward helping
to convert Barcelona in a globally recognized innovation technology hub.
>>: Barcelona Supercomputing Center is leading the supercomputing
infrastructure in Spain. This is managing the Spanish supercomputing network
with supercomputers in Madrid, Cantabria, Malaga, Valencia and the island of La
Palma and Zaragoza.
Barcelona is hosting Marenostrum, which is the largest one of these Spanish
supercomputing network. We have hundreds of users in these supercomputing
facilities. And they do research in different areas as sun eruption also
earthquake analysis, analysis on the market economy, and some others do
analysis on new materials. And also, and this is very important for us, we try to
do research with companies.
Let me tell you that 40 percent of our budget comes from companies, 30 percent
of our budget comes from the [inaudible] project, 20 percent from our partners
and 10 percent of Spanish [inaudible].
We have to provide services to energy research in Spain, so we have a national
committee of 44 people receiving all the project. They evaluate the quality of the
project, and they give us a list with all the project that we'll get access to the
Marenostrum. We are more than 200 people working at BSC, more than 150
people doing research; and let me tell you that more than 60 of the researchers
were not born in Spain. We attract talent from outside work, and this is very
good value of BSC.
[music played].
>>: So thank you. I think Osman, you should continue. This is two-years old
video, so some of the data changing now, but this more or less reflect the place
where the Marenostrum is tall and the people are the things we do there. Okay.
>> Osman Unsal: I would like to talk about our collaboration with Microsoft
Research. We started as a project in April 2006. We are two-year initially
duration. And initial topic was transactional memory. Then we established a
BSC-Microsoft Research Center in January of 2008.
We have very heterogenous, mostly from Eastern Europe. And many people are
young. So the list is long. But most of them are really engineer PhD students.
And on the technical side, we are collaborating with Tim Harris and Satnam
Singh from Microsoft Research Cambridge and Doug Burger from here.
And Fabrizio Gagliardi has been our mentor and has been the most important
person in the forming stages of the lab.
>>: [inaudible].
>> Osman Unsal: Yes. Those are implicitly stated.
>>: [inaudible].
>> Osman Unsal: We have worked for about a couple of years on transactional
memory, and we are just starting on vector support for low-power hand-held
processors and hardware acceleration for language runtime systems. So I will
talk a little bit about those two as well. And this is a pet project of ours that I don't
have time to go into to details.
So if you look at the matrix, this is where Microsoft Research Cambridge is.
They are more on the functional language and software transactional memory,
and we are on the other end of the matrix working more on hardware
transactional memory and imperative languages.
And we have developed transactional memory applications. We have Haskell
Transactional Benchmark. We worked with recognition mining synthesis type of
applications and transactify them. We transactify the game server Quake. We
have two versions of that, one serial parallelized from the serial version using
Open MP and transactional memory, and the other one we took a parallel version
based on locks and we transactified it. And we have a configureable
transactional memory application to stress TM designs.
We also worked on the programming model a little bit. We had some
optimizations with using Haskell STM. We proposed some extensions to
OpenMP for transactional memory. And we are currently working with TM
supporting system libraries.
On the architecture side of course we have our architectural simulators. We are
in computer architecture area, so that's our main forte. And we have worked on
some hardware transactional memory proposal that was in last year's micro
conference.
We are working on power-efficient transactional memory. It's a contradiction in
terms, but however, we are working on that still. And we are also working on
hardware acceleration of software transactional memory.
We have had quite a lot of nice publications on this. So we are right now on the
wrapping up stage of transactional memory research. And hopefully a couple of
students will graduate this year, depending on how hard they work.
And about the second topic that we just started with Doug Burger here is on
vector support for emerging processors. So the idea is to provide future
applications for the palm top. And we see that many of those applications
require some kind of vector support. So those are the type of applications that
we are looking at.
So on one end of the spectrum we are looking at those applications and how we
can give low power vector support for that and on the other hand we are
leveraging to work done here by Doug Burger's group on edge architecture,
which is basically an architecture that is -- that could be composable. So you can
form larger, more high performance cores out of the basic building block of
smaller cores. And this is good for scalar performance. And I will not yet into a
lot of detail, but we want to extend this edge architecture for vector support
looking into things like low-power pipeline floating point units, vector prefetching,
mapping strategy for vectors, either SIMD-like or vector-chaining approaches are
using a hybrid approach.
And the idea is that to work both on the scalar code and the vector code and to
have a high-performance, lower-power hand-held processor of the future. That's
the theme of this collaboration that we are -- we have just recently started.
And we are also looking into things like -- because bandwidth is an issue, about
how to again in a power-aware way how to make use of stacking, dynamic bank
allocation, 3D memory. Those are more on the hardware side.
And on accelerating language runtime systems we are begin collaborating with
Tim Harris here and we have an upcoming S plus paper on this. So the basic
idea there, I'll just go to the end, is that is that a new instruction called the
dynamic filter.
And the idea this is that for a lot of the types of operations where there are
read/write barriers like software transactional memory or garbage collection, we
are doing a lot of checks that might be unnecessary. For example, in the case of
software transactional memory, we add an element to the read set. And when
we do the read again, later on, we do the same operation again.
What we propose is to have a small hardware structure that will associatively
check if it has seen an address before, and if it has, then that does not check for
this any further. So we have seen that for across a couple of application
domains. This simple ISA extension gives us with performance benefits. And
that's all really. You know, those are the main three things that we are working.
We are also -- we want to also look at hardware support for synchronization for
mini cores, but that is something that has not started yet. So in a nutshell, that's
-- I ran way over my three minutes. But yes?
>>: [inaudible].
>> Osman Unsal: Sure.
>>: I mainly want [inaudible] projects [inaudible] where we have reached some
[inaudible].
>> Osman Unsal: We are actually continuing. But it looks like also from
Microsoft side the feedback we get was that Tim Harris, who wrote the book on
transactional memory was studying this, you know, this topic has kind of reached
a plateau. So -- but we are continuing to work on transactional memory, it's just
that we are not going to open new -- we are just trying to graduate the students
that we put on the project. But we are looking at things that are -- that could be
interesting for using the transactional memory programming model for doing
things other than synchronization. I think that area is quite interesting. And one
thing in particular that we were looking at was how do combine fault tolerance
with transactional memory or how to combine data flow with transactional
memory. So those are -- those are kind of things that we are looking into now.
So we are really not done, but we are kind of trying to not open up too many -the new students will not work on transactional memory.
>> Osman Unsal: Okay. So the idea will be to continue. And we had already
seen this example. Was to continue here on the case where you support strided
and partial alias referencing. So this will be the equivalent to the previous slides
but now we support, for example, a task write this part of the space and another
task writes a subset of it. Or just erase this and then just erase that.
And the idea again is how to compute the dependencies here. Probably even
before how to specify these different regions in the programming model point of
view. And essentially it's what we want to be able to do is we want to be able to
achieve say we can specify out of it to the dimensional matrix a set of rows or a
set of columns or square block inside it or a rectangular block inside it.
So the syntax just with an example would look like something like this where you
essentially have on one side you have the argument definition and the definition
of the size of the arguments and on the other side it specifies you have some
type of region descriptors where you specify subsets of those domains that you
access. I'm only accessing in this case from A of 5 to A of -- with size BS or A of
5 plus BS and A of K to A of K plus BS. Okay?
So there is the notation you can specify the ranges by saying the whole
dimension just a value from here to here or from here with this length.
So you can specify this type of out of a larger data structure which a lower part -inner parts we access. And constitutional you can apply that to libraries which
have not been just built always but can be more regular type of, program, class
libraries.
And essentially it's apply in the syntax you can taskify and leverage existing
codes. You only need to apply the pragma on a dot H file. You don't even need
the recompiler to modify the actual library. The same binaries that we have been
using on sequential machines can be reused.
Well, computing dependencies, we have to manipulate regions. And yes, it's
complicated, okay. A region in two or in three dimensions. And finding out
whether to reuse overlap or do not overlap is a potentially complex thing. How is
just a very course idea of how is the approach that we follow. We actually -- we
actually represent the addresses of the regions by representation of the
addresses that belong to that region. And we use X where the address can be
either a zero or a one, okay, and so this for example would represent an address
both 0001100, 0110, and 11. So it's a lock of -- this would be a block of
contiguous data, but this is a block of non-contiguous data. Okay?
So we're representing in one of these vectors the possible one region. How do
we represent all the regions, all the references of the previous accesses? We
have a three which essentially would put this -- let's say in the tree we put this in
vertical and we traverse, we have for every new address we have -- we can be
either a zero, a one, or an X. Okay?
And we represent that that tree with -- well, there are approximations for -- in -so this is the -- and later we have to compare my new reference, search it
through the tree. We do approximations both in the way of generating the
mapping of region to these representation as well as on the tree. What do you
store on the tree and what happens when you have the region of a given size
and then a new region is a subset of that one, what did you do with the previous
-- with the representation? Do you keep both regions in the tree, did you partition
them in such a way that you can separate the parts that are common and the
parts that are not common? There are different options here.
Is the -- the mechanism impact of these two decisions is that -- and the decision
of making this representation of the regions is that the results do depend of the
approximation on the base address, whether we use aligned addresses or not,
whether you would use leading dimensions which are multiple of two or not.
Okay? And whether block size are multiple two, powers of two or not.
In general because you would use these approximations, the results are
conservative. We -- our approach detects more dependencies than really are in
a system, okay. And this is an example for the dependencies that there are in for
example a 2D FFT, which is just the FFTs by rows, the transpositions, the FFT
again by rows and the transpositions. And the graph that would result for the
same, for the same execution if for example the data, the base address is not
aligned.
Of course, there's much more parallelism here than here, but there is still some
parallelism. The question is does it pay off, is it still useable for some
applications or for applications? The overhead this that this search takes, which
is large, I will show you afterwards, is the overhead something that pays off?
And I have some examples. Probably this is just to show the FFT for example
you remember this is virtually the same coat that we had in the FFT before. The
main difference is that now we specify out of the arguments which is for this is
the argument with a full size, which are the parts that we are actually using of it in
the declaration, the finishing of the functions?
And the result now is that the system, we don't need those barriers that we had
before, which were just to be able to handle complex dependencies, now the
system will handle these complex dependencies. And the result is an execution
which this is a trace. In blue we have the first set of FFTs. In red purple -- this
purple we have the first set of transpositions, the second set of FFTs and the
third set of transposition. The result is that the barriers have disappeared, now
the execution is out further as long as you're satisfied the dependencies. And we
have -- this is one instance of a random -- for example, here we had one instance
of a transposition that happened much later than others. And we had already
started doing several of the FFTs.
The results are, we think are very interesting but probably will come in on that.
So this is a Gauss-Seidel [inaudible] and again, same thing. You just specify the
regions. And in this case, you have only one input argument, one input argument
but you specify the part that you touch of it, that you really read and write, which
is the yellow part, and you specify the parts that you only read, okay? By
specifying these different parts of the original argument the system is able to
compute essentially what you compute. This is the wave-front. Okay? You
execute the -- you press it through the matrix -- the matrix in the wave-front
approach.
So compared to other approaches that cannot do this wave-front, this is one of
the areas where compute independence is useful. You can be much faster than
let's say OpenMP, pure OpenMP or the previous version which are required
barriers can get.
We can do sorting, for example. Okay? Maybe -- let me go to this example,
which is -- shows a little bit the fact that we can leverage existing
implementations of, for example, glass routines. So it's completing the Cholesky,
again a Cholesky, but now the matrix is stored in the traditional row ways or
column ways storage format.
The course code at the very first level is still the same. We are leveraging the
glass libraries and we are specifying for every task we are specifying this is the
size of the argument, of the original data structure. This gives the compiler
information and the leading dimension, size of leading dimensions. And the
actual argument that we pass here gives the -- is the original reference, the
original start of the subblock. And this part here gives to the compiler the
information of how much of the total data structure that we are parsed are we
using. Again, this is an example with for example block size is an argument to
the call, so this is valid for any block size.
So we have this is -- this is the performance on -- this is on [inaudible] again, 32
-- 60 -- up to 64 processors, and this is often and old version of MKL. I think this
is a more up-to-date version of MKL. And this is what we get. And what we are
actually doing here is we are just orchestrating your sequencing different
instantiations, invocations of individual MKLs. We are using as the actual
computation kernel, we are still using MKL 10.1, but we are executing several
operations -- several of those executions currently, as long as the dependencies
are satisfied.
So this is about the runtime overheads. And you'll see the numbers here and the
numbers are large. Okay. The time it takes to detect whether a task has
dependencies with the previous ones or not, the numbers are sometimes very
large, 270 microseconds is huge. Nevertheless, is that a lot compared to the
total computation time or not, and it's typically not much of the total -- a small
percentage of the total computation time.
This is another row which says that overhead is actually paid by the master
thread only, and that's what we said before, it's paid by the master thread only.
Is the master thread just overloaded with that and has time for nothing else, or
does it have time for something -- some additional work?
Well, when you run with only one thread, for example, in all the cases the amount
of work that -- for the master thread is minimal. On the amount of work in terms
of overhead. So the rest can be devoted to computing.
When you run with 32 threads, in this case, for example, we have the main
thread devotes 10 percent of its time to doing such a generation handling of the
task graph. It's something but while you still have 31 more threads that are
spending their time in computing.
In some cases the time you spend with the master thread as [inaudible] in the
graph is more significant, it's 25 percent. So you lose a quarter of a processor
just to handle the task graph. Is that a lot? Out of 32, it is not that much on one
side. And on the other side is the less of your problems, the smaller of your
problems. Okay? What is the larger one? You see here the task size graph, the
size -- the graph of the -- the size of the task graphs? Okay.
And here you see the duration of the tasks. What is the largest of your
problems? The largest of your problems is that the machine is a [inaudible]
machine with the limited bandwidth. So as you increase the number of processor
or [inaudible] processors, what happens is that the individual tasks, their memory
accesses take more time, there are more contention, more conflicts, and the
individual access times increase.
So sometimes you have some cases you have average access times that
increase tremendously, okay? And this is average per task. And depend, there
are very short tasks and very large tasks. So this is -- this is not this ratio how
this affect the totally execution time. But it is true, it is true that in some cases
the difference is not huge but in some cases the difference in execution time of
the task is very significant. What does it mean? It means that a significant part
of your problem is the locality issues, is the access to memory.
So that was about the overhead. My -- our perception, our feeling is that in many
of these cases where you can -- where you can really tackle the granularity which
is more than enough, you can actually, as we said, by changing this block size
which can be changed just prior to invocation. And this is another thing. I mean,
the algorithm we are not setting up a layout of data. We are setting -- which is
what you do in typical message parsing programming is what you do in typical -you do it even in biggest languages or you partition your data, and partitioning
data implies that you are very statically data mining the schedules. This is much
more dynamic.
But still the type of problems are the same. The problem is how do you handle
the locality, how do -- the question is do we have in our approach the
mechanisms to address that, to address that locality? And the -- our view is that
you do have such information. You have information on which data you access
and which data you use for every task. This can be used to improve locality.
And one example is again the Cholesky. What I showed you before, okay, this is
more or less again what I showed you with 32 processors, this is the original
MKL versions. This is the improved MKL versions. This is our version with our
StarSs approach as is, just no special -- no special handling accept the dynamic
asynchronous execution of the tasks.
What is the difference to this other plot? In these other plots we have done two
things. One of them is reshaping. Changing the association of the data.
Because here we have information on how is the data in main memory? And
what part of the data are we using? The compiler has the information to
determine, well, it might be good to take out that data, bring it and put it, compact
it together in a local storage for a set of positions and access it with unit stride for
example.
So the compiler has the information for doing that. The compiler can handle that,
can make the copies, changing in the association, invoking the task on the new
association. So it's what the normal programmers do by handle using work
arrays and moving data from the global to the work arrays and sending it back.
The compiler can do that alone.
So this is a little bit the picture of how -- what are the different benefits of this type
of transformation? And the picture a histogram. So we have executed Cholesky,
and this is a histogram of the IPCs of the 32 processors.
How you read this advertise tow gram? So there is one row per process. Okay?
The different columns are the different bins of the histogram. So this is IPC, so -and this is probably 0. -- 0.1 or 0.02 of IPC. This is probably an IPC of four or
five and four or five something around here is five -- yeah. Six is here. Okay.
So if I have -- if I have an IPC of six and I have a hundred bins here or how many
I have, this would be -- this would represent increments of IPC of 0.06.
So then the value of the histogram entry is the color. Dark blue is a high value,
light green is a low value. So what these represent what -- how do we read this
thing then? What we read is the first four threads, one, two, three, and four, the
first four threads, there's an empty one here. This is an internal thread of the
system. The first four user level four threads have a high value of IPC here. This
is close to six. So these guys have the four of them have very high IPC.
Where is everybody else? Everybody else has an IPC which is not half of that
but two-thirds, okay, of that. Much worse IPC. So even if we achieve that
everybody is working, they are working at different IPCs.
Why? In this case, the data was allocated in the memory of the node where the
first four processors are. So the first four processors were accessing local
memory while everybody else was accessing to this local memory. The result is
these first four processors execute much better -- get much better IPC. It's also
one advantage of the dynamic scheduling is that at least they will also execute
more tasks, okay? So the scheduling is not static so at least they will get busy
and they will execute most tasks.
And the others are going to have much worse IPC. There is contention to -apart from the distance there is also the contention in the access to that memory
module.
So if there's way of proving, how one would improve the behavior of the program
is just initializing the matrix scattered, distributed, interleaved across memory
modules, okay, and keep with the same dynamic mechanism which is not locally
[inaudible] just because the data -- if the data is distributed what will happen is
there will be less contention for a given memory position.
How is the same program run, how does the IPC behave? What would we
expect? We would expect less contention so better IPC, okay? What is the
result is this one.
So these guys improve very significant their IPC. These other guys go a little bit
worse. So they were -- these other guys were around here, and now everybody
goes -- so everybody is a little bit worse but it's significantly better. So without
any real locality awareness just by avoiding conflicts to a single memory bank,
memory module, node, we improve quite significantly the IPC.
So this is still every access, every, every task accesses the individual data with
their non-unit stride. So what happens if our runtime brings the data together into
contiguous regions? Just by bringing the data together in contiguous regions
with we do get an improvement from here to here again, it's an improvement of
maybe 10, 15 percent in the IPC. Okay?
The result is -- and by the way, these are some of the task, additional activities
that we generate which is just the copy of the data. The copy from the strided to
the contiguous storage. [inaudible] a little bit better, well, the thing we do -- this is
doing the copy from the strided to contiguous, but we could allocate the
contiguous in the same place that we are going to execute. And then we get this
IPC. Still a little bit better.
So just to say that the importance of the [inaudible] of handling the [inaudible]
properly is really high. This is the same histogram in terms of cache misses.
There is one of this transformation which is when you reshape, when you bring
the data together you go from very high cache miss ratios to much lower cache
miss ratios.
Okay. So that was about the locality. We are -- that is a -- only on the Cholesky.
We are implement -- we are evaluating now that on other algorithms. But we
consider that that is a very important set of transformations, both dynamically
changing the association of the data, dynamically generating and handling work
arrays with the unit stride as well as making them local to the actual execution.
How do -- and so now what I'm going to do is go through different things with a
few slides for each of them and how are we -- one of the first one is how do we
address heterogeneity, how do you -- did you look at the systems with different
types of cores.
Back again to Cholesky. This is the standard Cholesky that we have seen until
now with its dependence graph. One of the things that we could do is these
tasks that we have hearsay well, maybe this task is very appropriate or not to be
ran on a GPU. Okay? This task may be appropriate for another device.
So the idea is for each task essentially you should be able to say this task is
appropriate for this device, this task is appropriate for that device. This task, for
example, performs well both in this and that device. So what we have added is
this started device that at close which essentially tells the compiler which is the
appropriate device, an appropriate device for that task. The compiler essentially
what it does is the code that outline from this -- the code constructed from this
routine is actually fed into the specific compiler for the specific auxiliary. Okay?
This is the idea. This is the directive that we have implemented.
Let me first comment about nesting. And I'll comment about the marriage of the
two things. Nesting, so there will be two ways of looking at the hierarchal nest of
the generation of tasks. One of them is inside a coarse grain task generate fine
grain tasks. So in each of the -- you have a program which at the outer most
level generates coarse grain tasks and then inside each of them the execution
actually instantiates a new StarSs engine that generates within that environment,
generates a new task graph and executes that task graph locally.
The good thing is kind of the -- that the outer task is a container that hides the
external wall from this local set of tasks. There is another thing which is -- we
have an engine to generate coarse grain tasks so the overhead -- it generates
very big tasks so the overhead even is -- with respect to the total time is very
small. And when you start execution of the individual tasks, you [inaudible] for
each of them a new engine. So actually in this case what we said before about
there's only one thread -- one process generating a job work. In this approach, in
this situation, that does not -- is not the case. Every process, every of this task
has an independent generator. So the overhead is really much, much more -amortized much more -- you do it in parallel, you can do it in parallel.
So this is one of the potential approaches. There would be another approach
that would be to let every single task to generate additional tasks, but this is -- I
do think this is extremely complicated because you are then having to handle a
single task graph both a green and the red adding tasks to the task graph. And
the problem -- the problem we have here -- or the issue that we have in the
original case is that at least we have something -- somebody that set ups order.
Okay. Here you lose order.
So should this dependence be come from here or should the dependence come
from there? This is something that you cannot really control. There is no -- in
this approach, there is not really a good way to set up what is the real -- let's say
the program order. In this case, you do have such a program order.
So we have followed this case of the left, this example of the left. And this
essentially you can complement any program for each of the tasks you can
complement it internally with sometimes with very trivial dependencies because
the algorithm is very simple, sometimes with not so trivial dependencies because
the algorithm is more complicated.
Or this could be a matrix multiply where you have the outer program first calls a
level of matrix multiplies and this one calls a second level of matrix multiplies.
For example, this is what we have implemented and this runs on a board with
two cell tips, and what happens is that we devote one of the cell tips, so we
consider the two accessible tips as being able to execute each of them one of the
coarse grain tasks and for the execution of each coarse grain task, the PP
generates work for the SPs, okay. So this is our code run implementation of the
hierarchal nested model which essentially means something like having SMP
Superscalar up close. The two PPs, the two main processors, and each of them
sell superscalar with its local SPUs.
Okay. So now let me merge a little bit the previous thing and this one. So this is
an example where you have again a matrix multiply. This matrix multiply of outer
level calls this first level of matrix multiply and this first level of matrix multiply
calls SDM-1, but for SDM-1 you have two implementations. One is the generic
implementations, which we say it's good for NSNP, and it's good for -- it's
possible for a cell. And we have another implementation which we have the
close implements, say in this is an implementation of that, that function of that
task. And this function is another implementation of the multiplication.
So essentially here we have our circled task. One implementation of the fine
grain tasks for SMP, for the host processor, one implementation for the cell. And
this implementation for the cell will be generated by giving this source code to the
cell compiler so it will be very bad, very poor in performance. But we will have
that implementation. And this implementation, this is another implementation for
the SP that will be good because we are actually linked -- we are using here an
assembly version of that routine.
So what we have is this invocation can actually -- this task can be executed with
this code running on the cell with this code running on the PP, or with this code
running on the cell. And the thing is the runtime should be able to find out which
of these is best. We had an example before where we said this can sometimes a
version compiled with DCC is better than the version with OpenCL, and
sometimes it's the other way around.
So the idea is something like here. You put both versions, one compiled with it.
This is a compiler, and one compiled with the OpenCL or one compiled with the
handwritten assembly language. Okay? And the runtime should be able to
identify which is most appropriate. Or even not just identify which is most
appropriate but at least give some of them to one of the devices, some of them to
the other, and keep everybody busy, everybody contributing within its
capabilities, within its reach.
This is an example of what happens in this situation for a sparse matrix vector
multiply. What we have done is there were routines -- there were routines for
gathering which are only con by the PPU because for these gathering the
consideration cast that the SP was not performing, and then this were other
operations which are the actual matrix vector multiplies for which we provided an
implementation for the PPU and an implementation for the SPU. Then the
system essentially decided where to run the different -- the different tasks, okay,
and actually came out that it executed on the PPU this many and on the SPU
those many. It's a case of very fine granularity. And even if the SPU is large, the
overhead of sending the data to the SPU executing and sending it back for these
granularities showed that it was better to do a large part of those computation on
the PPU, okay?
But still the SPUs did some of the executions. This is a very preliminary
implementation but we think we are -- this has to be more extensively studied
and developed. Okay? That was about the heterogeneity and nesting.
Now I'm talking about -- a very little bit about nanos, about the OpenMP compiler
that is going behind and trying to catch up by integrating to OpenMP, those
features of the StarSs that we are consider are interesting.
What is the currently proposal situation here? Let's look at this example. First,
we of curious have the input/output for tasks, but we now do not force all
arguments to be part of this input/output clause. We don't need every argument
as part of -- to specify the directionality. And those it means that they in/out -- or
the in/out clause are just used to compute dependencies, but you have to put
only the ones you considered are necessary.
So this saves you the burden or going through the burden of having to specify
input/outputs for everything if you know what you do it's just saving overhead. If
not, might be a little bit risky. So there's a situation where the current proposal is
well, leave this thing to responsibility of the programmer and enable him at least
to avoid -- to not to put some of these clauses in for some of the arguments if
he's sure that -- they're usually sure that they are not necessary.
About the heterogeneity, the same thing. We should [inaudible] this target device
cool or target device cell or target device SMP, which is the default. For
heterogeneity or even if the device is the same but you have several
implementations. You have for a generic method, you can have different
implementations, okay?
And finally the inputs outputs that we have specified here which in the StarSs
mend, the meaning in the StarSs was they were always used for dependencies.
And whether they were used for data transfers or not dependent on the actual
implementation. Here that is actually separated. The inputs/outputs are just
used for dependencies. And if the user says for this device I need to copy in,
copy out, so the transfer part it is separated, okay? So if I need to copy in or
copy out the arguments, I have to specify separately.
So I may I have combinations here, for example, where I can say for a given
argument I don't put anything about in/out because I don't want to compute
dependencies on that parameter because I know it's not the one that dominates
the dependencies. Hello.
And I can put the copies because if I know that the device has explicit separate
local memory and I have to transfer really from the global space to this separate
local memory.
This is another example which is, for example, the sparse LU. And there's one
thing. We have put the pragmas in line, okay? Until now in the StarSs the
pragmas were just before a task that defend declaration, okay? And in OpenMP,
pragmas have traditionally been put in line, so when comparing the two things we
thought -- we think what is the difference and what are the difference in use of
those things. And it turns out that putting them inline saves you -- because the
compiler does it, saves you the outlining of the code. And that is -- that is nice.
That saves you some time.
But we have actually found also that the -- if you keep the pragmas inline, that
means that the tasks have no name. So you cannot apply to them all these
things that we have been mention about heterogeneity and multiple
implementations. You cannot provide multiple implementations and let the
runtime decide which is the best. And that may be an interesting feature.
So essentially what we have in this proposed extension to OpenMP is we have -we support the two things which traditionally in OpenMP is not supported. In
OpenMP traditionally had to be always just inline. Our proposal is that you can
put the parameters before a function declaration and means that every single
invocation of that function is going to be constitute a task.
More things about the -- about the -- well, there's Cholesky again and with the
same -- the same issues. Yes. Specify whether a different device is I think this
is not really much different what was -- yeah. This is what I just mentioned
before. This is -- I've annotated the pragmas go before the declaration of the
function, so it means that every single invocation of this function. This is the
typical [inaudible] way of doing it. This is the typical OpenMP way of doing it,
what we think is supporting both is interesting has both -- has the benefits of both
worlds.
Finally there is about array sections. This is not yet implemented. They are
thinking of again -- there is a little bit -- slightly different syntax, so from what we
saw in StarSs and what people is proposal in -- for OpenMP, we still have to
discuss more about this or unify it or which is better or the -- the idea will be to
support what I've been explaining on the StarSs, okay?
And also the idea is they are proposing ways of in some cases you don't know
the size of the arguments, so -- and you need to know the size of the arguments
in order to be able to really contact the data and so just a syntax for specifying
the size of the arguments.
That was about the MPI -- sorry, about the basic programming model. What I
said, even if we were develop this cluster superscalar version we should be able,
if the granularity of the tasks is large, should be able to work in clusters. I don't
expect it to work on a thousand -- certainly not on a thousand processors.
Maybe not on a hundred. But maybe on a few things should work. But is not
something that we propose for the very large end.
What to we think for the very large end is the most appropriate is this hierarchal
hybrid MPI plus SMP Superscalar or MPI plus StarSs in general.
How does it work in I'm going to show it on two examples. One of them is just a
very simple matrix multiply code. And essentially in this situation what you do is
you have -- you assume it's your MPI program. Every process has a set of rows
and a set of columns of the matrix, so this is a very specific distribution, it's not
really realistic for general purpose application maybe. But it's easy. You have
set of rows, set of columns, and then you have to generate a set of columns.
In order to do that, do you your multiplication and you generate a small block.
You shift the matrix, do again the multiplication, generate another block, shape
the matrix, okay? And in order to do that shift, essentially what you have is you
have an intermediate buffer and essentially what you do is at the -- and this is -there are two -- all done even iterations because it's usually a double buffer in the
scheme.
While you -- if one iteration you multiple these two things and at the same time
you are spending to the next -- your data to the next -- to your next processor
and receiving from there your previous processor data in the other path.
Next iteration, you multiply this by that and receive here, and next iteration
multiply this by that and receive here.
This is the area. If you execute this is the area, this is the source code, you do
the multiplication then the multiplication you call some [inaudible] here, and after
that you do a send direct to send to the next processor and receive from the
previous.
If you execute it, you get a trace something like that. This is eight processes.
White is computing the metrics multiply and blue is the communication. So in
white you compute, and then use rotate, compute, rotate and matrix compute
and rotate.
So the first consideration is what do you do when you typical way of doing hybrid
programming? You would say well, let me take the expensive part, the
computation, and let me parallelize it. Okay. So essentially you take expensive
computation part which is the computation and in some way or another with the
collapse, some type of mechanism you parallelize it. Okay?
This is the tracer for that case. We still have eight processes here. And each of
these eight processes has been run with three threads, okay? So what we have
is these three -- these are the three matrix multiple parts that are -- so this is the
-- what's previously we had here is now split among three threads here, for the
second processor among three threads, for the third processor, among three
thread.
So you parallelize that part and then you have a barrier and you wait until you
have analyzed this matrix multiply code and then use your MPI point-to-point
exchanges.
So you are parallelizing the computation part but not the communication part.
You are serializing the communication part local -- if the local point of view of the
process it's serialized.
So this is what you would expect. There would be another way of doing
parallelism of looking at trying to parallelize that loop, which is because we had to
do the matrix multiply on the exchange, the question is can we do both of them at
the same time? And you can write that with an OpenMP syntax something like
this. So you do. At the same time the matrix multiplication and the send-receive.
What is the problem, the situation here? The situation is that you do at the same
time the matrix multiply and the send-receive, the matrix multiply and the
send-receive, matrix multiply and send-receive but the send-receive is much
short infer this case than the matrix multiply, so you are not gaining much. You
are out of the original program. You are just overlapping the blue and the white.
Okay?
So this would be what you would get in MPI plus OpenMP if you tried this
parallelization. You still have these faces that are dominated by the matrix
multiple time.
Nevertheless, let's -- what happens if you think still -- if you still think of the same
thing, of the same approach, okay, but instead of doing it by OpenMP directives
here which this fork join type of semantics, you to it by saying this is -- this is a
task, the multiple, and this is a task, the send-receive. So encapsulated the thing
here is encapsulate in some in the StarSs tasks the actual MPI codes.
And for an a MPI code for the send-receive, you specify from the local view of the
process what are inputs and what outputs, okay?
For a send-receive, what you -- what this internal send-receive will trust me over
the wire is from the point of view of this task is an input. What the send-receive
will get from the wire, from the local point of view of this, local address space is
an output, okay? So then you just specify this is -- this is the input buffer, this is
the output buffer.
For the matrix multiply, you have this matrix, so these are the input, the two input
blocks and the input/output block, the resolve block. What happens when you
execute it this way? You will expect to see the same thing like that because what
you have done is what I have encapsulated this asset task and this asset task.
But if you look at it and you unroll the whole graph, your whole graph is
something like that. Okay? Your whole graph is you have the matrix multiply
index exchange, the matrix multiply index exchange, the matrix multiply index
exchange, but you have these dependencies, from one exchange you need it to
do the next and you need it to do the next and you need it to do the next. What
you get out of this exchange you needed for the next matrix multiply, what you
get out of this exchange you need it for a matrix multiply but you have an
anti-dependence here, while you are multiplying this thing you cannot use the
input buffer of this thing, so you can -- you have an anti-dependence to here.
You cannot receive what will go into that buffer.
So essentially you would expect the same type of behavior. These two things go
in parallel, then these two things go in parallel, then these two things go in
parallel, then these two things go in parallel.
When you execute the result -- the behavior is not like that. The behavior is that
because we have implemented this renaming mechanism, this anti-dependency
is broken, so actually you can proceed through this path and you can -- actually
you could proceed through this path at full speed because of this renaming, then
you could also proceed through this path as long as you have received the
communications.
The result if you [inaudible] with three threads, here we have three threads for
one process, three for another, three for another, is that you do -- you start the
communication at the same time that you start the multiplication so
communication and multiplication. When you finish the communication, you can
start immediately that multiplication but you can also start immediately this
communication.
Why? Because the system renames the memory address and this
communication is going to receive in a newly allocated space of memory. So
actually you can get -- for a very coarse grained type of program, so you don't
need to program to a very fine grain which was -- that was -- this approach was
going too much finer grain and granularity of parallel programming. Here is you
still have coarse grain parallel programming. But the mechanisms in the no-level
programming model in the StarSs inside the node lets you -- and actually
propagate to the whole MPI application because you see that all processes are
actually doing at the same time, these -- these blue things communicate between
them. These other blue things communicate between them and the same for
these others. Okay?
So what I'm trying to show is on this other example, which is Linpack, the same
idea. The same idea as Linpack you have to -- the algorithm is essentially this,
you do factor -- if you are the owner of a given panel, you factor it, you send it to
start the progress of that panel. If you are not the owner of that panel you
essentially receive it and retransmit it, if necessary. And the retransmission, this
is on a ring typically or there are differing approaches but there is one
dimensional ring or two rings. Well, the thing is essentially is you factor on send
and then if you are not the owner you just receive and retransmit.
And then you have -- with the panel that you have received, you had to update
the training matrix. What is the essential idea in merging MPI plus StarSs is that
you taskify also the MPI calls. You encapsulate the MPI calls, you encapsulate
the MPI calls in tasks. If the task is send, if the [inaudible] is going to send things
from your point of view of StarSs, this is an input argument. If you are receiving
things, this is an output argument.
So what happens when I execute this thing? Each of the processes, each of the
columns and rows totally in rows, the total whole execution graph of your Linpack
code. So you are -- you start with the factorizations and the dependencies are
from this factorization depends. This send, you have to send it and you have to
update. Once you have done this update, you could actually receive this data
and update the next part. And you could receive then again the next data and
update the next part.
On the other processor, this one for example receive the data, and what has
been received has to be sent. With what you have received, you have to update
your training matrix, and if you are -- you are the next, you have to factor in this
case and the whole thing repeats. Okay?
So essentially you are unrolling a data flow graph locally inside each process but
there are links between some tasks in one node and some tasks in another
whose relationship which order is maintained by the MPI point-to-point
semantics.
What is the result? The result is this guy will try to progress as fast as possible
through his task graph. This guy the same and this guy the same. And actually
it would be nice to be able to progress as fast as possible through the critical
path of the application. Okay? So that you can do, you can hint the scheduler to
say well, tasks like factorization and service if these tasks are high priority. And
the updates are not so high priority. You can use the updates to fill in holes.
What is the result of this? We did run this let's say three times. And in here is 16
processes. This is time. And the yellow lines are communications. Okay? By
luck one of the times you had communications spread over the whole execution
of the Linpack run.
Another time, also by luck, by randomly came out that there were lot of
communications in the very first phase, but there was a long phase here without
any communication at all. And then again communications. The color
represents the panel you are updating, okay?
So what happens in this case? In this case what happened is that you went
through the critical path so fast and you were -- because renaming everybody got
a lot of panels from their previous task, let's say, a lot of panels that he could use
to do all the updates that he had pending, okay?
So for a long time they were not able actually to generate farther. We have -this is another thing that we have to do. Because if you are too greedy, too
greedy with the renaming you are essentially exhaust your memory -- your
memory of your system. So you have to put a eliminate so saying if I enroll will
to this -- if I -- if I allocate this much memory for renaming, I stall my instruction
engine until I free some of the space.
So anyway, you see three different runs, very different time behaviors, all of them
essentially taking the total lapsed time because the problem -- the problem is not
limited by communication. You can move communication, spread it to uniformly,
compact it at the beginning, compact it at the end, but your limit is computation.
So we run this on 158 processors and we compared it with MPI plus OpenMP.
This is the MPI and this is the MPI plus SMP Superscalar, okay? We are looking
at small problem sizes. This is not where you run Linpack typically to report it to
the top 500. You run it much, much, much higher. When you run it here, the
thing is these lines will converge again because what we are actually here, which
is overlapping communication and computation is irrelevant when communication
is irrelevant compared to computation.
Okay. So we are looking at this situation which we think is relevant again
because it's, as I mentioned before, is the situation of the strong scaling. It's a
situation where you have fine granularities compared to your communication
costs. And you see that it behaves -- that the 128 processor behave -- the MPI
plus SMP Superscalar behaves significantly better than the other two
approaches.
The MPI plus OpenMP has these -- it's just the opposite of -- to MP plus StarSs.
It's much too rigid. And when your barriers in OpenMP, any thread suffers
anything, everybody suffers that. Here if one thread suffers some delay or some
-- some wait, well as long as it's not always the same guy which is just suffering
those delays, sometimes others will receive those penalizations, and the whole
thing will more or less progress.
Well, we have it with 512 processors, we have it with 1,000 processors. And
again, in the small program sizes there is -- this is difference. What I have in this
slide is two things. On top is the sensitivity to bandwidth. When -- what we did
was we ran the original -- this case, well, this is with 512, okay, so this
corresponds to this.
We ran this case for a given problem size and large problem size, and we ran it
with nominal bandwidth. So with the standard Linpack and the standard MPI
plus SMP Superscalar.
Then for every message we modify the application so that for every message
that we send we send another fake message of a given size, mimicking the effect
that our available bandwidth is less than the original one of the machine. So and
this is what we report going towards the left. The more towards the left is we are
simulating an environment with less and less and less bandwidth.
What is what we see? The beginning none of them is sensitive to that, so the
original Linpack for this problem size and -- it's also tolerant to bandwidth
reductions. But there's a point where the original XPL version starts feeling the
pain of less available bandwidth while the MPI plus SMP Superscalar version
feels that much later. And this is what we saw before is it didn't matter if
communication bandwidth is slow, should ways will never happen. The
communication will be more or less always spread across.
If memory bandwidth is enough, this situation may happen. But the system is
tolerant to that situation -- to that situation.
This other measurement was studying the impact of operating system noise. So
we took two configurations that in this case the MPI -- the SMP Superscalar was
better than the MPI Superscalar, but I'm not interested in the absolute
performance but in the effect of introducing more and more system noise. How
did we introduce more and more system noise? Simple. What process sleeps
for a while wakes up for a while, sleeps, wakes up, sleeps, wakes up?
The more frequent we do the wake-ups or the longer we do the computations this
guy takes -- when he wakes up is going towards the left, okay? So in this case is
the period of the preemption. So the more frequent we do the preemptions, so
what happens, the original HPL when we start preempting more and more and
more frequently starts feeling the pain in a linear way.
In the MPI plus SMP Superscalar what happens? As long as for a long -- for a
certain interval we start -- as long as not everybody suffers all the penalization,
so if I go slow now, but while you go slow I can catch up and we distributed the
penalization uniformly across the different processes, then the whole system is
tolerant to much more perturbation, to much more system noise. Of course
there's a point when the noise is so important that everybody feels it.
But I have the strong feeling that this asynchronous data flow type of execution is
the real way of fighting operating system noise. I think -- it's not operating
system noise. It's just variance. I mean, fighting variance I think is a lost battle.
>>: There are other sources besides the operating ->> Jesus Labarta: Yes, there are many, many, many other sources, the
hardware itself. There are many other sources. And it's a lot better. You will get
-- you will think you are gaining something here and then anyone -- a new source
will appear somewhere else. So I think the important thing is to learn how to live
with it, how to more or less in a dynamic way how to -- how to survive the
variance.
And I think asynchrony is a real -- is a really alternative, is a real way for that. So
just another topic. And on this one, I just got it feels like from Rosa, and this is
what Rosa Padilla will be coming in two months, I think, and she can explain
much more details.
We have a version of it that works in Java. Essentially you can specify in this
should be a normal Java mechanism. You specify for the different -- for the
different routines for the different classes and methods you can specify for this -for this method what is the type and the direction, the directionality clauses of the
arguments. So essentially you will have implemented this.
There is some other things that we have -- which are available here and we have
not implemented on the SMP Superscalar which is high-end resource
constraints. You can specifically say I need this type of resource for this, for this
method.
And this has been implemented on Java, as I said, using ProActive and Java,
Javassist for instrumenting -- is developing a custom class loader. But, apart
from the fact that they have been doing some already some significant runs in
Marenostrum with biological codes I really -- I'm not very familiar with this work
and I prefer if you attend the process presentation in a couple of months from
now.
>>: So [inaudible].
>> Jesus Labarta: There is another thing about load balance, just one very
simple comment about load balance. The way I think -- I mention we see it is for
large systems you'll have MPI plus StarSs. But in large systems you will have
large shared memory nodes. And the question is how many processes you put
in every node, okay? If you have eight weight nodes, do you put eight
processes, do you put four processes, each of them two weight, two threads?
You put two processes, each of them four threads or four processes each of
them two threads? What do you put?
The way we see a node is a domain within which it's easy to switch resources
from one entity to another. So you can easily switch cores from one process to
another one a node. So what have we done? We have implemented the runtime
that actually is able to do that. When one API process gets blocked it actually
lends its process -- it's cores to another of the MPI processes inside the same
node.
Because the StarSs mechanism lets you -- you have queue tasks, task queues
will just -- if you get a new additional core it's an additional worker that goes and
tries to execute that from this pool. So that's the idea. We have actually
implemented that on MPI plus OpenMP which is more restrictive from the point of
view that you cannot change dynamically the number of threads that are working
in a process at every point in time, has to be done at the parallel -- when you
enter the parallel clause, the parallel directive, sorry.
But even then we have been able to -- to do -- to do benefits like this one. 2.5
has speedup on a real application, real production mode on 800 cores. What
happened with that application? It did the happen that it was extremely
imbalanced in some parts of the computation, extremely imbalanced. So what -what did the we do by putting OpenMP. And you start -- you start the application
with as many MPI processes as cores. Essentially start the original MPI
application. You said new threads equal to one and you keep going with the
original MPI application.
If the original MPI application is fine, you can go in fine. If the original MPI
application is imbalanced and some MPI processes get blocked, then the
runtimes when one gets blocked lends the processor to another. And this other
one can shrink, and this is what we see here. The yellow part, this is the same -the situation without load balance one process takes very long, one takes a little
bit more, and the other takes nothing.
When you load balance, then one -- now this is one process, one process gets
immediately -- well after there's a small period here but immediately gets the four
threads where this is in Marenostrum, four threads per node, gets the four
threads working for him. The whole system is still imbalanced, okay, but there
are other processes in other nodes that are idling.
>>: This is -- what's the time scale up here. How long does it take to give a core
away in.
>> Jesus Labarta: How long does it take? We are looking at things like
microseconds, 10s, hundreds, 10s of microseconds. It's fine -- I mean, it's not
extremely fine, but -- I think it's that's the level of granularity that we are looking
at. And the system may -- so we will still have to evolve that, may become
unstable because you give threads to one guy but then the data -- then the
message arrives to the other -- to the one that lend the processor and then for a
moment you have two guys that want to -- that want to work at the same time.
So there are issues there that we -- we are still developing that and we are -- and
I think it -- we have to put some [inaudible] in mechanisms.
>>: If you only had one process per node as you could do with StarSs -- you
know, you could sort of ->> Jesus Labarta: Yes?
>>: Then the load balancing inside the process is sort of automatic. And so the
only time you feel this few microsecond granularity is when you're going even
further afield. I mean you have some other process running alongside your
[inaudible] something. So it's fine and great, it seems to me, StarSs is than ->> Jesus Labarta: Than this? Yes. No, the internal inside the StarSs is much
finer grain, okay? This [inaudible] is just one, two microseconds less than that,
okay. Going from one task to another.
But changing -- the problem is if you start with only one StarSs process, you
have the load balancing there goes well but you cannot balance between MPI
processes. Okay?
>>: Right.
>> Jesus Labarta: And actually the way I see it, I look at this more and more is
like a way of fighting Amdahl's law. And the way people parallelize these MPI
applications -- so the people has MPI applications, okay, which might be
relatively well or may have some imbalance in some point in time. But they are
MPI applications that if they want to put OpenMP what they do is they take this
part, which is long, and they put OpenMP here. They take another part and put
OpenMP. And they take -- so essentially you end up -- you need to put OpenMP
everywhere in order to -- so you have to parallelize again the whole application to
be competitive to the original -- to the original code MPI, run with more MPI
processes.
If you have look at this way of running, forgetting load balancing, you don't need
to parallelize everything. Typically you will need to parallelize just the parts that
are imbalanced, big parts at the end that are imbalanced. So with an incremental
parallelization of only small parts you can achieve the benefit with the load
balancing mechanism, you can achieve the benefit of really better usage of the
whole -- of the whole system. And I think that that is interesting. And it's
interesting -- yeah. I don't know at the very end running for very large systems
running with many, many MPI processes, I don't -- I don't -- I mean, I think the
current situation with hundreds of thousands of MPI processes I think is -- those
are too many processes from my point of view. I would prefer not so many
hundreds of thousand and each of them multi-threaded with many threads.
>>: [inaudible] load balancing due to blocking is one thing. But suppose work
grows to a large extent so the MPI process gets oversubscribed with work which
-- you would -- you would still like to borrow some workers from some neighbors
who are -- don't have so much work to do. You have to wait for them to run out
of work before you get any workers. So ->> Jesus Labarta: And the problem with -- if you have different nodes is in order
to borrow -- if you just establish mechanism to borrow work from the others it is
premiums that the StarSs relies on shared memory. Okay? So we would have
to use -- let's say we are looking at this clusters superscalar version which might
-- I said I don't [inaudible] for a thousand processors but maybe for 32 it might
work.
So might be one way of having -- of having let's say cluster -- using cluster
superscalar across and a put every superscalar across three, three nodes and
put three processes -- sorry, so every process across four nodes, for example,
and four processes all across the same four nodes. I don't know if I'm explaining
myself. But using this clusters superscalar as a mechanism to get work across -across nodes. Okay?
I think there are interesting things there. For the moment our load balancing we
are trying to do it within the node. I think for the moment with architectures we
are seeing with a few 10s of course. It will be -- this was with four cores per
node. It's very little with a few 10s I think will be very -- a very reasonable
situation, a very interesting situation. But still is true that you have the boundary
of the node. So you will -- you might like -- you might like processors in a node to
help processors in another node. And that makes -- is a little bit more
challenging because as in the other space mean different.
Okay. So that is what we do at the level of the load balancing. About
transactional memory, I have shown some examples. I'm not sure I went through
the whole detail of that. But in the -- in the example of the [inaudible] mixed
simulation code, which is a fine [inaudible] code, one of the tasks was doing
updates of three different positions. And you need to do that atomically.
And what we have there is we have actually leverage because our compiler is
source to source, and then we use a back-end compiler, we have leverage, for
example, that this is the capability -- OpenMP compiler. So essentially for this
atomic update, what we have put is an OpenMP directive, OpenMP atomic, okay,
in front of each of these lines.
Our compiler sees the OpenMP atomic and lets it get through, goes to the
OpenMP compiler of this [inaudible], the OpenMP compiler is capable of
generating the instructions for doing that with a typical low link store conditional
type of mechanisms. So we are leveraging some type of atomic super for only
one -- what OpenMP has, only one statement, but the atomic capability we are
leveraging that for an OpenMP. Examples where you don't -- are not based on a
single update of a variable, but you need atomicity for several updates, probably
we would leverage from some transactional memories who persist, and we could
leverage that capability.
This is where that -- we're just starting. We have [inaudible] starting and we are
considering this is very, very rudimentary. But we think this is an area where
transactional memory would help. There is another area which would be for
speculative dependencies. So in a situation where we have one task and
another and we know that there is a dependence or there most probably is or
there can be a dependence between both of them but we also know that this
dependence may not be there. So we know that sometimes if one will touch
certain part of a data object and if two different part and in this case there is
actually no real dependence between them, but there may be other instances
where if one and if two happen to touch the same region and then there is a
conflict, another instance where there is no conflict. So today we have to specify
this with a dependence and we have to say there is a dependence and we
serialize that.
We could actually be speculative, speculate start both F1 and F2 at the same
time and check whether there has been a conflict between them or not. Probably
the only difference with a standard typical transactional memory is that in this
case we know that in -- out of these two, if one of them fails, the one that has to
fail is F2. F1 is the real one, okay?
So this is another area where we could use transactional super for speculation.
How much is necessary to speculate in this approach or in your programs or
unless in the area we are looking at, which is mostly scientific computing, maybe
not that much. But it's an area. So speculation could also be done or some type
of transactional memory support should be -- could be to for control population
where you have conditions and you could execute both branches of the
conditional and just keep this -- the final state of one of those and drop the state
of the other. Or what looks -- and same thing. This is just the same type of
speculation and this is the same type of thing that the processors do with branch
prediction and try to go into certain directions and then graduate or not graduate
certain computations, depending on the later check of the conditions.
And for fault tolerance something similar again. You could issue tasks that if they
fail and they have not committed any [inaudible] they can just -- can be reissued
in a different -- in a different resource.
About memory management. Just to mention again the thing that we did on the
-- the way of doing the renaming probably we have played with the typical thing.
When you do the renaming very early or very late in the whole process and the -well, our initial implementation was doing the naming very, very early. As you
know, it's better to do it later. We are just playing with lazy or late
implementations of the renaming. We are doing that on different implementation
on the cell superscalar, we're using the mechanism based on an atomic -- the
hardware support that is available there for atomic updates of a block line. Of a
cache line.
On the SMPs we do also another implementation of this lazy renaming which
essentially in this case means that we don't allocate the instances of the objects
until they are needed. And in case -- in case we can do the in place updates we
can -- we don't need to reallocate these instances, and all we reuse are
previously allocated instance.
Just to say that this lazy renaming on the cell also enabled us to avoid many of
the -- because the renaming was not done on the main memory -- on the main
global address space, the renaming was done directly on the address space of
the SPE. In that way you avoid bringing the data in, and also you avoid bringing
the data out, okay? So this was -- this was the scale transfers that were sort of
interesting.
Other thing that we do local -- a you have a software cache which means you
bring an update to your SPE and if the next task uses the same object, you don't
need to bring it again.
One thing that we have been doing recently is that was done locally, locally, HHP
manage its own local cache, okay? So the problem is -- well or the potentially
interest would be can you implement kind of a shared cache so everybody be
able to access not only his local cache but also the cache of the neighbors,
okay? And we implemented the mechanism which is equivalent to -- also to
bypassing, okay? So one functional unit producing one data and maybe this
data is fed back into another functional unit without having to go, let's say, to the
registers or having to go to the main memory. Okay?
So this very same mechanism is something that we tried also based on this
atomic -- this atomic support it was possible for everybody to know what
everybody else had in their own cache.
And this other thing that we -- that we have here these are the amount of
memory accesses that you're required as you're giving cache size and this is
playing with a schedule to maximize the reuse. Essentially if you enroll totally the
whole task graph instead of what you -- what you have for example is if you
follow the dependencies in the that first mode, well, you are producing something
and you will consume it immediately and you will consume it immediately and
you -- in this way what will happen is you are avoiding transferring this -- you are
guaranteeing that you have data that you reuse. You are avoiding sending it
back to memory and bringing it back -- bringing it back to later.
So essentially the idea is how do we look at having the whole task graph, how
would we traverse it in ways that we try to maximize the locality. Once we bring
one data try to reuse it at maximum.
I think we have tried ways of doing this here essentially by putting additional
information in the task -- in the task graph and finding, for example, if you have
for one son you have two parents, if you put this -- you schedule this parent -- will
well will well, [inaudible] to the other parent and try to schedule both parent
together so that then you schedule the son. We have tried to do these things.
Ideally, and this is what we have here, ideally one should be able to write a
program which is, let's say -- and this is a matrix multiply, one should be able to
write a program matrix multiply where you multiply the whole row by column,
whole row by whole column, whole row by -- so the standard trivial matrix
multiply. And the system should automatically be able to execute a block portion
of the algorithm, which that's a small part by a small part, a small part by a small
part. By just traversing the task graph of the matrix multiply brand very large N
square dependence chains, so you should be able to execute just a little bit of
one chain, a little bit of another, a little bit of another and mimic what trying to do
when you write a block algorithm. So this would be the ideal thing.
So we did some experiments with that. Some of that is possible, but I think it's
still far for realistic -- and these are against. These are -- you can get a lot -reduce many -- the amount of its transfers compared to the original one. In some
cases. In some other cases not that much, but there still is something. And
saving memory, saving bandwidth I think is going to be good for accelerators, is
going to be good for any type of device. So the schedule there can save you
bandwidth which I think is an important experience. And the good thing is
because the StarSs gives you that information, you have the information of what
data you need. You should be able to achieve that.
So then just to conclude, we have an active research project with many ongoing
branches and many, many ongoing directions. We nevertheless think that we
are in a [inaudible] that it's stable enough to be used on relevant production
applications, and this is the type of thing that we are trying now. We are trying
internally in cooperation with other -- with other people, both in Europe -- both in
[inaudible] and both in put in for proposals to the commission to try and take
relevant, significant codes and try to use -- and what I think is the most relevant
to start with is the MPI plus SMP Superscalar, okay? This is the most general,
the most flexible. Then you can try it with essentially same source code. You
can -- you can optimize some task for a GPU, for example, and use an
expression with SMP Superscalar plus cell superscalar.
But still we have quite a lot of things to investigate. And that's essentially the
direction on the lines -- the idea we have. I hope I've been able to explain a little
bit the type of work that we're doing.
>>: May I ask something?
>> Jesus Labarta: Sure.
>>: I can look at this [inaudible] I suffered with four hours of the tutorial. I
suffered more hours but some parts. And you know that I like very much the
work you do here and [inaudible]. But when you look at the second line
[inaudible] I feel that what you were explaining to us is more [inaudible] that you
are solving many things that we cannot do at the computer architecture there, for
example [inaudible] last two slides. When you did with the prefetching [inaudible]
many feel that -- I mean, we have [inaudible] at the hardware level, we don't have
enough bandwidth, the bandwidth [inaudible]. But all this kind of prefetching,
load balancing, synchronization, and then I think that you should add to the -- to
this tutorial this [inaudible]. When you say we minimize the number of buffer you
[inaudible] you know that there are many [inaudible] now that [inaudible] so I will
represent you can -- because I used to be computer architect when I was -- and
now I am doing [inaudible].
>> Jesus Labarta: No. It's ->>: You should take [inaudible].
>> Jesus Labarta: Yes. We wouldn't do lot of things and lot of things in terms of
our support. Or we could at least discuss more things ->>: [inaudible] the memories [inaudible] data processor.
>> Jesus Labarta: [inaudible].
>>: [inaudible].
>> Jesus Labarta: I'm not doing anything new. I'm just trying to -- and as
[inaudible] said, there are many things that you have already done in computer
architecture which we have not done because [inaudible].
>>: [inaudible]. When you were talking about how the hardware [inaudible] you
got some question this morning about the overhead of the runtime and you said
okay, when I run this runtime the current microprocessor I [inaudible] scheduling
that but you know that we are working in the hardware.
>> Jesus Labarta: Yes. I ->>: So I didn't ->> Jesus Labarta: No, I know that -- so you may do hardware things to help. My
real thought is I just do not know how really necessary there are or not. I always
have in mind this -- this -- this think in common with -- discussions with Peter
Hofsdy [phonetic], okay?
Before putting something in hardware kind of his philosophy was before putting
something in hardware think very much whether you really need -- whether
cannot you do that in software?
>>: But who say that, you or Peter?
>> Jesus Labarta: Peter. Yes. Because you put these things in hardware, this
is -- this is power and this is -- and he had the very minimalist view of the power
->>: [inaudible] I mean the energy is powered by [inaudible].
>> Jesus Labarta: No, I say it's energy. You are going to put in hardware things
that could you put in software. The other thing I would really like to know, and I'm
missing -- I'm lacking experience. And I know that it's interesting to experiment
our field. It's interesting experimenting in finding out ways of putting these things
in hardware, yes, I agree. And I think it's important. What I still am missing is a
real -- feeling a real perception of to what level will be economically convenient or
necessary.
>>: I have another thing I think supports your point of this argument, and that is
that a lot of times we don't know what the policy is. And sometimes when we do
know what the policy is, it's hard to get the policy to respond as fast as the
hardware mechanism needs it to respond. Think about scheduling threads in
SMT versus scheduling threads in StarSs. Right? The tasks here are just fine to
schedule that granularity, but if you start getting into a situation where you really
don't know whether a task is ready to run or not when you're doing things in
hardware, then you have a policy implementation thing you have to do that will
inform the hardware fast enough so that it can keep doing things at these
hardware rates. And that's sometimes challenging to do.
The same thing comes up in other kind of resource allocation strategies where
prefetching is another example. But what is the prefetch policy and how is it set?
Some computer architects that I know, but not well, think prefetching is just
wonderful and we'll just prefetch everything. And that wastes bandwidth
sometimes. Well, where's the policy? And who -- how is it set? And what
mechanism sets that policy? Is it set for the whole app at the beginning or what?
How does that -- so I ->>: This is very well known. I mean you don't do prefetching always.
>>: Well, so -- how is -- what's the policy then?
>>: So the question is that -- I mean it's easier to [inaudible] without having any
kind of hardware because [inaudible] to be matching with [inaudible] but you
know very well that after some studies you decide that one -- functionality will be
very common normally [inaudible] to put that in hardware, okay? So now we
have -- my opinion that we have a huge opportunity from the acknowledge
provided by [inaudible] to decide what kind of things should be put in hardware or
not.
But just one side. Prefetching. I would say this guy Jesus proposed for the
numeric algorithm because all these did not have the [inaudible] that they
[inaudible] that they -- they -- he has -- he's doing the real prefetching because
he know in advance what they are going to need. This is very good. But the
thing that I -- I -- I like more is the reduce he does. Because we know -- we know
that in the current -- in the current [inaudible] we have a [inaudible] with the
bandwidth. So the problem that you propose this graph allow that. And the
question [inaudible] what are the [inaudible] that allows to you do that, I don't
know. But this is very important thing out [inaudible].
>>: Do you a lot just starting here and just working below here. You can do
quite a bit in terms of spatial and temporal ints, for example, you know a great
deal at this level.
>>: Local memory versus cache.
>>: Yeah, yeah, you know a lot of things here.
>>: Local memory is very important to ->>: Yeah, yeah. And you can also, you know, you can treat the critical path
differently than the non-critical path. And that's [inaudible]. When something on
the critical path is enabled, when all its predecessors are done and you know it's
runable, put it at the end of the run queue, not at the tail. Don't put it at the
bottom. Get it done now, because it's ready. So things like that.
And all of this can be done at this level. And it's hard to do this lower than this.
And you know, very rich interface to hardware to [inaudible] if you were to do the
hardware. So I think this is a good place to leverage. I always have thought that.
I've always thought that, you know, small -- if you can schedule small tasks
efficiently and build a system like this based on that, you have tremendous
leverage against a lot of important parallel computing problems.
And I bet you agree with that, too. I mean, SOAP works that way, too, right?
SOAP has very much the same kind of philosophy.
>>: Exactly.
>>: And the interesting thing is how eagerly you eliminate anti-dependencies.
Yeah. I mean, the usual thing is you get all of these block continuations eating
up space, but you have another source of space consumption which is all of this
buffering stuff.
>> Jesus Labarta: Yeah. And the -- let's say there are current situation for
example in the different versions that we have, the one where we use this -some of them we don't do any of that renaming. Depends on which application.
And we have to be very careful in order to limit the size, to limit the size we
devote to that because yes, it can be very, very eager. So this is an area where
we have done these versions with a little bit lazy renaming I think our important,
the original one that we did was very, very eager. And we had to control the
space. And this is a ->>: [inaudible] will generate a lot of renaming if you're not careful. Yeah. Yeah.
>> Jesus Labarta: Yes. That -- but it was curious in many cases. Some case
the example I showed of the matrix multiplier which how [inaudible] that gives
you a lot with -- I think with simple algorithms. Because the other way -- the
other thing people does is they do by hand this thing, they do expansions ->>: But [inaudible].
>> Jesus Labarta: No, no, this is -- at the moment.
>>: [inaudible].
>> Jesus Labarta: At the moment we are trying with grow max, we are going to
do gadget, we'll have to do wharf.
>>: [inaudible].
>>: So if you think about this evolving DAG the way you're generating these
tasks and you can sort of see how -- how wide things are sort of just in time to
avoid -- essentially to do either update in place or not for the predecessors of a
task. So if this task could run in parallel if it's renamed, but has to wait if it's not,
you could sort of on the fly sort of figure out ->> Jesus Labarta: Yeah, this is what is done here. This version what we call
here lazy renaming is a little bit -- I don't -- if I have -- if I have sufficient
parallelism we don't do the renaming. If you don't have a lot of parallelism, then
you do it. Yes. That is what is doing. You -- you try to avoid it as long as you
have parallelism. This is what I have. So this is -- we do it because of -- yeah,
this is what you said.
We have to.
>>: [inaudible] memory you do in such a way that you [inaudible] memory. You
mentioned that [inaudible] the allegation of memory, the model, did you mention
->> Jesus Labarta: Not yet. [inaudible] mention renaming yet. There's another
thing which I didn't mention because I didn't go into the example. One thing
about the way which has probably might have some issues with Cilk and the
recursion where you got the recursion this kind of -- these kinds of things it was
in the sort and the [inaudible]. And the [inaudible] example where -- we have not
tried it. This would be [inaudible] is the decision of whether -- when to cut the
recursion and when to start generating tasks and when not. That should be able
-- should be something adapted based on the actual -- on the actual number of
the width of the already enrolled graph and ->>: Well, or done the way Burt Halstead did it even, right? I mean, you sort of -or the way Haskell does it actually. Which is you materialize little -- little what
according to Haskell are called sparks and put them on stack actually. That's
what Cilk does too, I think.
>> Jesus Labarta: Yes, Cilk does this kind of ->>: Yeah. And you put these things on the stacks so that if you -- if you get to
the -- you do them yourself but they're still, they're runable by some other -- by
some worker that needs work to do.
>> Jesus Labarta: But still if you were -- one way is -- you make -- if you
generate them as tasks still you have some overhead.
>>: You don't generate them as tasks. They're sort of even pre-tasks. At
Haskell they're called sparks. I don't know what we call them on the NTA. And I
don't know what Cilk calls them either.
>>: We call them like DQ. It's [inaudible].
>>: Yeah. But it's just a description of the work. In fact, it looks like, you know,
task blah, blah, blah.
>> Jesus Labarta: Yeah. A descriptor, yeah, the descriptor which calls locally ->>: Yeah. It's just a descriptor and you -- and you ->> Jesus Labarta: But if you were -- if you were not even creating the descriptor,
you would also avoid some overheads, okay?
>>: Yeah. Yeah. So the cheapest thing to do is just leave it on the stack and let
it get encountered by -- by whatever core owns that stack at the time. Right?
And then you have to make that work visible to other cores that need -- that need
work to do.
What we did on the MTA was -- David Callahan and I suggested actually putting
it on the stack sort of late in the game before we left [inaudible] to come up here.
We were talking about doing the -- leaving it on the stack, but the way it worked
is we just had a FIFO of these descriptors and we would run unblocked
continuations as long as they existed and [inaudible] but if there were no more
unblocked continuation, no more things with stacks that needed to run, we would
go get work out of this thing.
And that -- that's sort of the way that worked. There's been variations on this
theme, but it's the only way though to really make divide and conquer work well,
recursive divide and conquer. It's really pretty simple when you do it this way.
So it's just more software for you, you know, just [inaudible].
>> Jesus Labarta: It would be nice to find what is a minimal functionality that's
implemented in hardware lets you apply policies, change the policies or adopt the
policies and so forth.
>>: Always. That has been our problem.
>> Jesus Labarta: Change the problem you have to address.
>>: People introduce the prospect and the process [inaudible] hardware this
could [inaudible] okay what part of the runtime you should do the hardware
[inaudible] very carefully [inaudible] I don't know, the [inaudible] I don't know a
[inaudible] we were discussing yesterday with [inaudible].
>>: That kind of synchronization.
>>: What kind of project you need some hardware to do that [inaudible].
>>: Yeah. Just -- I don't know if I talked to you. I talked to ->>: Osman.
>>: Osman and you, especially Osman and the students about this M weight
thing. That -- some of that we did in conjunction with Tim Harris, talking about
the work stealing of this kind in Haskell and how to implement that. There's a -there's two things you have to do. One is you have to resolve the race between
the core that owns the -- let me call it a spark. That's what Haskell calls it. So
Haskell's spark is one of these little descriptors of work that if you encounter it
inline in your own execution on your stack then you go ahead and do it, but
somebody else could steal it.
If it's stolen, you have to make the local core essentially wait for that thing which
may mean that it actually yields the core to some other computation. That whole
thing turns into a continued ways because it's been slower. Let's suppose these
two things have to combine to go up. One core's this way, let's say this way to
get on your side of the truth. So you go down this way and there's this
subcomputation rooted right here, okay? So this thing's working down here.
Some other core comes along and says I want that and takes that and starts
doing this. This one comes back up here. Now it has to block, waiting for this
value coming from below or waiting for the side effects for whatever it is it has to
block and then when this arrives there it has to unblock the whole computation.
So it's avoiding the -- you know, it's sort of dealing with the race between these
two things and also arranging for the right of completion semantics. So it's not
hard to do.
We did that with full NT bits on the FTA and Haskell does it with regular stock
x86 hardware some miraculous way. Well, you know, it's AMOs and compare
and swap and things like that. And, yeah. But we made sure we could do that
too. That was one of the important test cases was to be able to do all of that.
>>: Okay. I think that's the end of it. Thank you very much.
Download