Jesus Labarta: So this is the second part of the presentation from the Barcelona Supercomputer Center. Osman Unsal is going to do a brief presentation of the relationships that exist between the center in Barcelona and Microsoft. Then we'll have a five-minute video or maybe 10-minute video, I don't know how long. Five-minute video. And then we'll continue with the presentation that Jesus Labarta is doing on StarSs. So Osman. Or Mateo. >>: Okay. You can do that, but we thought that we are talking about Barcelona. Probably you know Barcelona because of Olympic. You don't like the circuit but you know Barcelona because we organize the Olympic in '92 again. So this here, this is Barcelona. Okay? So this is very famous La Ramblas, Christopher Columbus or too fast. I don't know. So just a few figures about Barcelona. This is [inaudible]. You know Gaudi the architect who build many, many buildings and these are all familiar, is one of the most important monuments in Barcelona. It's too fast. I'm sorry. Okay. So this is a very nice city, okay, where we are located. And this is the university and then here located here, and here is the stadium of the Barcelona soccer team. 120,000 people there. And I gave a lecture, a keynote in ICS in New York comparing a little computer -- together with the computer architecture Manchester was pioneered. Okay. And the same for in England for the football. But probably now we bet these guys in the final European league. As you probably know, we have very good players. So I'm sorry because I tried to show you Barcelona a little better. And what is the object of it? The object of it is the second part of the movie, as you know, because they initially thought Fabrizio Gagliardi, Tony Hey, many other people [inaudible] meet here, we know, then since many years ago we established our collaboration, our research center in collaboration with Microsoft. So the Barcelona Supercomputing Center and the director is -- we are 300 people. 250 people in research. Jesus Labarta is the director of computer science. But associated to the BSC we have a joint center with Microsoft, okay? And Osman is the person tending the director of the center. He will give you -- after our movie, I would like to show you -- he will give you three or four minutes of the work we do there. And the object of it again is to show Barcelona, to show the center, a very nice video of the supercomputing we have and the work we do here, we do with you and to invite you to come to Barcelona and spend some time with us doing research, even in superscalar or StarSs or whatever, computer architecture or the topics we are working with with you. So I don't know how it works, but now we have a video, this one. And the audio? Oh, okay. So [inaudible]. Okay. We are the Barcelona Supercomputing Center. See the Spanish National Lab on supercomputers. >>: You can see we had the supercomputing labs up there, okay. Very nice. >>: The main supercomputer of the BSC, of the Barcelona Supercomputing Center is the Marenostrum. You can see behind me. This supercomputer has been here two years and a half the number one in Europe, and we were number four and number five in the world. It has more than 10,000 processor working together and the possibility of using it is that we can do in one hour the work that any powerful laptop will need one year. Our partners are the Ministry of Education in Spain and Madrid, Catalonia, and our Technical University of Catalonia UPC. Recently the [inaudible] will join us. This is the building where many people are. We also do research inside the BSC. We have four groups: Computer science, computer application, life science and earth science. >>: Supercomputing is a key tool for the progress of the society. >>: Computer science department at BSC has cooperation with the industry in the computer business. We have cooperation with Microsoft on transactional memory or with IBM on the [inaudible] and [inaudible] will be next supercomputer so stated in Marenostrum, targeted to be around hundred times faster executing at around 10 petaflops. Our idea is that this computer sign on the -- based on the cell processor. So our cooperation with IBM is to work on them on things like the design of the next cell processor, including aspects such as memory bandwidth, optimization, tolerance to latency, scalability. So the project has six activities on applications, programming models, performance analysis tools, load balancing, interconnection network design and processor design. The project covers all aspects of the computer design and integrates the experience that know how different research groups at BSC on all these areas. This is one of the strengths of BSC. And by focusing all this research towards a unified objective, we intend to have important impact in the design of future supercomputers. >>: Fluid communication at research level between the public and private sectors is clear for the competitiveness of the country. High-performance computing is a competitive tool. If we don't compute, we won't be able to compete. >>: The computer application since [inaudible] in department develops parallel software that exploits the whole capabilities of the supercomputer. The software that we develop are numerical simulations of complex physical phenomenon. Our main software development is Alya. The fundamental characteristic of Alya is the highest scalability when using thousands of processor. Alya is able to manage message larger than 100 million elements. With Alya we have simulated important engineering programs like the noise produced by a high-speed train, the external aerodynamics of an airplane or a little wind tunnel. In the next future, the most interesting simulation that we are going to do with Alya are biomechanical simulations. The objective of this simulation is to provide a tool to experiment with different surgical possibilities. At the present we are developing a simulation of the complex human hair that [inaudible] in the brain and the interior of the nose. Also we develop software with external scientific groups. For example, we are developing the Siesta code that allows to perform ab initio molecular dynamics with extremely large molecular systems. Moreover, we develop software with industry. For example, we are in a giant project with Repsol, our Spanish other company, developing the most powerful subset since making machine tool. This kind of tool is mandatory when you are looking for oil in areas like the Gulf of Mexico where the oil is hidden by a salt layer. As a conclusion, we can say that supercomputing is a critical tool to maintain the competitivity of our entry. >>: We, thanks to the supercomputers, can model reality. And we can avoid some experiences where these experiences are too expensive, too risky, or are just impossible. >>: The science department here in the BCC we have to mind every [inaudible]. The third one is about air quality from custom system. Personally, we work with an [inaudible] system for Europe, Arabian Peninsula and [inaudible], Iceland. This personal system has a -- is possible solution of four kilometers in the grid site. This is important because personally is the operational air quality for custom system with bigger specialty solution in the world. Also we work [inaudible]. We test our operational modeling assistant every day to evaluate what it is part of this -- they test from the satellite to the military [inaudible] to America, also [inaudible]. At this moment, this is seeing [inaudible] with [inaudible] has been nominated for the [inaudible] as traditional center in the sun and thus is the warming system. This is important to evaluate a global level, the promise [inaudible]. Also we work in [inaudible] modeling with the idea to improve this application in the supercomputer Marenostrum and prepare the next EPPC modeling works. >>: The research results affect people rights by finding cures for serious illnesses, by fighting against climate change, or by looking for new sources of energy. >>: In the life science department at the Barcelona Supercomputer Center, we work in the theoretical study of living organisms, so we are trying to really understand them, we are trying to simulate them, trying to predict their behavior. And we have a very broad range of interests within this field, starting from the very detailed analysis of protein interactions and protein function to deploy analysis of entire genomes or even metagenomic systems. Particularly, we work in aspects like study of protein, nucleic acid flexibility, and the study of protein docking systems [inaudible] in the analysis of the genome information and also in aspect of drag design. So living organisms are extremely complex. Probably the most difficult to represent from the point of view of theory and simulation. And this is why we are so heavy users of Marenostrum and journal of computing supercomputer resources. We need to use very large computer to relieve the simulations of, for instance, flexibility from the chronologies or, for instance, the dynamic of [inaudible] pathways. The Barcelona Supercomputing Center is not an end, it's a means toward helping to convert Barcelona in a globally recognized innovation technology hub. >>: Barcelona Supercomputing Center is leading the supercomputing infrastructure in Spain. This is managing the Spanish supercomputing network with supercomputers in Madrid, Cantabria, Malaga, Valencia and the island of La Palma and Zaragoza. Barcelona is hosting Marenostrum, which is the largest one of these Spanish supercomputing network. We have hundreds of users in these supercomputing facilities. And they do research in different areas as sun eruption also earthquake analysis, analysis on the market economy, and some others do analysis on new materials. And also, and this is very important for us, we try to do research with companies. Let me tell you that 40 percent of our budget comes from companies, 30 percent of our budget comes from the [inaudible] project, 20 percent from our partners and 10 percent of Spanish [inaudible]. We have to provide services to energy research in Spain, so we have a national committee of 44 people receiving all the project. They evaluate the quality of the project, and they give us a list with all the project that we'll get access to the Marenostrum. We are more than 200 people working at BSC, more than 150 people doing research; and let me tell you that more than 60 of the researchers were not born in Spain. We attract talent from outside work, and this is very good value of BSC. [music played]. >>: So thank you. I think Osman, you should continue. This is two-years old video, so some of the data changing now, but this more or less reflect the place where the Marenostrum is tall and the people are the things we do there. Okay. >> Osman Unsal: I would like to talk about our collaboration with Microsoft Research. We started as a project in April 2006. We are two-year initially duration. And initial topic was transactional memory. Then we established a BSC-Microsoft Research Center in January of 2008. We have very heterogenous, mostly from Eastern Europe. And many people are young. So the list is long. But most of them are really engineer PhD students. And on the technical side, we are collaborating with Tim Harris and Satnam Singh from Microsoft Research Cambridge and Doug Burger from here. And Fabrizio Gagliardi has been our mentor and has been the most important person in the forming stages of the lab. >>: [inaudible]. >> Osman Unsal: Yes. Those are implicitly stated. >>: [inaudible]. >> Osman Unsal: We have worked for about a couple of years on transactional memory, and we are just starting on vector support for low-power hand-held processors and hardware acceleration for language runtime systems. So I will talk a little bit about those two as well. And this is a pet project of ours that I don't have time to go into to details. So if you look at the matrix, this is where Microsoft Research Cambridge is. They are more on the functional language and software transactional memory, and we are on the other end of the matrix working more on hardware transactional memory and imperative languages. And we have developed transactional memory applications. We have Haskell Transactional Benchmark. We worked with recognition mining synthesis type of applications and transactify them. We transactify the game server Quake. We have two versions of that, one serial parallelized from the serial version using Open MP and transactional memory, and the other one we took a parallel version based on locks and we transactified it. And we have a configureable transactional memory application to stress TM designs. We also worked on the programming model a little bit. We had some optimizations with using Haskell STM. We proposed some extensions to OpenMP for transactional memory. And we are currently working with TM supporting system libraries. On the architecture side of course we have our architectural simulators. We are in computer architecture area, so that's our main forte. And we have worked on some hardware transactional memory proposal that was in last year's micro conference. We are working on power-efficient transactional memory. It's a contradiction in terms, but however, we are working on that still. And we are also working on hardware acceleration of software transactional memory. We have had quite a lot of nice publications on this. So we are right now on the wrapping up stage of transactional memory research. And hopefully a couple of students will graduate this year, depending on how hard they work. And about the second topic that we just started with Doug Burger here is on vector support for emerging processors. So the idea is to provide future applications for the palm top. And we see that many of those applications require some kind of vector support. So those are the type of applications that we are looking at. So on one end of the spectrum we are looking at those applications and how we can give low power vector support for that and on the other hand we are leveraging to work done here by Doug Burger's group on edge architecture, which is basically an architecture that is -- that could be composable. So you can form larger, more high performance cores out of the basic building block of smaller cores. And this is good for scalar performance. And I will not yet into a lot of detail, but we want to extend this edge architecture for vector support looking into things like low-power pipeline floating point units, vector prefetching, mapping strategy for vectors, either SIMD-like or vector-chaining approaches are using a hybrid approach. And the idea is that to work both on the scalar code and the vector code and to have a high-performance, lower-power hand-held processor of the future. That's the theme of this collaboration that we are -- we have just recently started. And we are also looking into things like -- because bandwidth is an issue, about how to again in a power-aware way how to make use of stacking, dynamic bank allocation, 3D memory. Those are more on the hardware side. And on accelerating language runtime systems we are begin collaborating with Tim Harris here and we have an upcoming S plus paper on this. So the basic idea there, I'll just go to the end, is that is that a new instruction called the dynamic filter. And the idea this is that for a lot of the types of operations where there are read/write barriers like software transactional memory or garbage collection, we are doing a lot of checks that might be unnecessary. For example, in the case of software transactional memory, we add an element to the read set. And when we do the read again, later on, we do the same operation again. What we propose is to have a small hardware structure that will associatively check if it has seen an address before, and if it has, then that does not check for this any further. So we have seen that for across a couple of application domains. This simple ISA extension gives us with performance benefits. And that's all really. You know, those are the main three things that we are working. We are also -- we want to also look at hardware support for synchronization for mini cores, but that is something that has not started yet. So in a nutshell, that's -- I ran way over my three minutes. But yes? >>: [inaudible]. >> Osman Unsal: Sure. >>: I mainly want [inaudible] projects [inaudible] where we have reached some [inaudible]. >> Osman Unsal: We are actually continuing. But it looks like also from Microsoft side the feedback we get was that Tim Harris, who wrote the book on transactional memory was studying this, you know, this topic has kind of reached a plateau. So -- but we are continuing to work on transactional memory, it's just that we are not going to open new -- we are just trying to graduate the students that we put on the project. But we are looking at things that are -- that could be interesting for using the transactional memory programming model for doing things other than synchronization. I think that area is quite interesting. And one thing in particular that we were looking at was how do combine fault tolerance with transactional memory or how to combine data flow with transactional memory. So those are -- those are kind of things that we are looking into now. So we are really not done, but we are kind of trying to not open up too many -the new students will not work on transactional memory. >> Osman Unsal: Okay. So the idea will be to continue. And we had already seen this example. Was to continue here on the case where you support strided and partial alias referencing. So this will be the equivalent to the previous slides but now we support, for example, a task write this part of the space and another task writes a subset of it. Or just erase this and then just erase that. And the idea again is how to compute the dependencies here. Probably even before how to specify these different regions in the programming model point of view. And essentially it's what we want to be able to do is we want to be able to achieve say we can specify out of it to the dimensional matrix a set of rows or a set of columns or square block inside it or a rectangular block inside it. So the syntax just with an example would look like something like this where you essentially have on one side you have the argument definition and the definition of the size of the arguments and on the other side it specifies you have some type of region descriptors where you specify subsets of those domains that you access. I'm only accessing in this case from A of 5 to A of -- with size BS or A of 5 plus BS and A of K to A of K plus BS. Okay? So there is the notation you can specify the ranges by saying the whole dimension just a value from here to here or from here with this length. So you can specify this type of out of a larger data structure which a lower part -inner parts we access. And constitutional you can apply that to libraries which have not been just built always but can be more regular type of, program, class libraries. And essentially it's apply in the syntax you can taskify and leverage existing codes. You only need to apply the pragma on a dot H file. You don't even need the recompiler to modify the actual library. The same binaries that we have been using on sequential machines can be reused. Well, computing dependencies, we have to manipulate regions. And yes, it's complicated, okay. A region in two or in three dimensions. And finding out whether to reuse overlap or do not overlap is a potentially complex thing. How is just a very course idea of how is the approach that we follow. We actually -- we actually represent the addresses of the regions by representation of the addresses that belong to that region. And we use X where the address can be either a zero or a one, okay, and so this for example would represent an address both 0001100, 0110, and 11. So it's a lock of -- this would be a block of contiguous data, but this is a block of non-contiguous data. Okay? So we're representing in one of these vectors the possible one region. How do we represent all the regions, all the references of the previous accesses? We have a three which essentially would put this -- let's say in the tree we put this in vertical and we traverse, we have for every new address we have -- we can be either a zero, a one, or an X. Okay? And we represent that that tree with -- well, there are approximations for -- in -so this is the -- and later we have to compare my new reference, search it through the tree. We do approximations both in the way of generating the mapping of region to these representation as well as on the tree. What do you store on the tree and what happens when you have the region of a given size and then a new region is a subset of that one, what did you do with the previous -- with the representation? Do you keep both regions in the tree, did you partition them in such a way that you can separate the parts that are common and the parts that are not common? There are different options here. Is the -- the mechanism impact of these two decisions is that -- and the decision of making this representation of the regions is that the results do depend of the approximation on the base address, whether we use aligned addresses or not, whether you would use leading dimensions which are multiple of two or not. Okay? And whether block size are multiple two, powers of two or not. In general because you would use these approximations, the results are conservative. We -- our approach detects more dependencies than really are in a system, okay. And this is an example for the dependencies that there are in for example a 2D FFT, which is just the FFTs by rows, the transpositions, the FFT again by rows and the transpositions. And the graph that would result for the same, for the same execution if for example the data, the base address is not aligned. Of course, there's much more parallelism here than here, but there is still some parallelism. The question is does it pay off, is it still useable for some applications or for applications? The overhead this that this search takes, which is large, I will show you afterwards, is the overhead something that pays off? And I have some examples. Probably this is just to show the FFT for example you remember this is virtually the same coat that we had in the FFT before. The main difference is that now we specify out of the arguments which is for this is the argument with a full size, which are the parts that we are actually using of it in the declaration, the finishing of the functions? And the result now is that the system, we don't need those barriers that we had before, which were just to be able to handle complex dependencies, now the system will handle these complex dependencies. And the result is an execution which this is a trace. In blue we have the first set of FFTs. In red purple -- this purple we have the first set of transpositions, the second set of FFTs and the third set of transposition. The result is that the barriers have disappeared, now the execution is out further as long as you're satisfied the dependencies. And we have -- this is one instance of a random -- for example, here we had one instance of a transposition that happened much later than others. And we had already started doing several of the FFTs. The results are, we think are very interesting but probably will come in on that. So this is a Gauss-Seidel [inaudible] and again, same thing. You just specify the regions. And in this case, you have only one input argument, one input argument but you specify the part that you touch of it, that you really read and write, which is the yellow part, and you specify the parts that you only read, okay? By specifying these different parts of the original argument the system is able to compute essentially what you compute. This is the wave-front. Okay? You execute the -- you press it through the matrix -- the matrix in the wave-front approach. So compared to other approaches that cannot do this wave-front, this is one of the areas where compute independence is useful. You can be much faster than let's say OpenMP, pure OpenMP or the previous version which are required barriers can get. We can do sorting, for example. Okay? Maybe -- let me go to this example, which is -- shows a little bit the fact that we can leverage existing implementations of, for example, glass routines. So it's completing the Cholesky, again a Cholesky, but now the matrix is stored in the traditional row ways or column ways storage format. The course code at the very first level is still the same. We are leveraging the glass libraries and we are specifying for every task we are specifying this is the size of the argument, of the original data structure. This gives the compiler information and the leading dimension, size of leading dimensions. And the actual argument that we pass here gives the -- is the original reference, the original start of the subblock. And this part here gives to the compiler the information of how much of the total data structure that we are parsed are we using. Again, this is an example with for example block size is an argument to the call, so this is valid for any block size. So we have this is -- this is the performance on -- this is on [inaudible] again, 32 -- 60 -- up to 64 processors, and this is often and old version of MKL. I think this is a more up-to-date version of MKL. And this is what we get. And what we are actually doing here is we are just orchestrating your sequencing different instantiations, invocations of individual MKLs. We are using as the actual computation kernel, we are still using MKL 10.1, but we are executing several operations -- several of those executions currently, as long as the dependencies are satisfied. So this is about the runtime overheads. And you'll see the numbers here and the numbers are large. Okay. The time it takes to detect whether a task has dependencies with the previous ones or not, the numbers are sometimes very large, 270 microseconds is huge. Nevertheless, is that a lot compared to the total computation time or not, and it's typically not much of the total -- a small percentage of the total computation time. This is another row which says that overhead is actually paid by the master thread only, and that's what we said before, it's paid by the master thread only. Is the master thread just overloaded with that and has time for nothing else, or does it have time for something -- some additional work? Well, when you run with only one thread, for example, in all the cases the amount of work that -- for the master thread is minimal. On the amount of work in terms of overhead. So the rest can be devoted to computing. When you run with 32 threads, in this case, for example, we have the main thread devotes 10 percent of its time to doing such a generation handling of the task graph. It's something but while you still have 31 more threads that are spending their time in computing. In some cases the time you spend with the master thread as [inaudible] in the graph is more significant, it's 25 percent. So you lose a quarter of a processor just to handle the task graph. Is that a lot? Out of 32, it is not that much on one side. And on the other side is the less of your problems, the smaller of your problems. Okay? What is the larger one? You see here the task size graph, the size -- the graph of the -- the size of the task graphs? Okay. And here you see the duration of the tasks. What is the largest of your problems? The largest of your problems is that the machine is a [inaudible] machine with the limited bandwidth. So as you increase the number of processor or [inaudible] processors, what happens is that the individual tasks, their memory accesses take more time, there are more contention, more conflicts, and the individual access times increase. So sometimes you have some cases you have average access times that increase tremendously, okay? And this is average per task. And depend, there are very short tasks and very large tasks. So this is -- this is not this ratio how this affect the totally execution time. But it is true, it is true that in some cases the difference is not huge but in some cases the difference in execution time of the task is very significant. What does it mean? It means that a significant part of your problem is the locality issues, is the access to memory. So that was about the overhead. My -- our perception, our feeling is that in many of these cases where you can -- where you can really tackle the granularity which is more than enough, you can actually, as we said, by changing this block size which can be changed just prior to invocation. And this is another thing. I mean, the algorithm we are not setting up a layout of data. We are setting -- which is what you do in typical message parsing programming is what you do in typical -you do it even in biggest languages or you partition your data, and partitioning data implies that you are very statically data mining the schedules. This is much more dynamic. But still the type of problems are the same. The problem is how do you handle the locality, how do -- the question is do we have in our approach the mechanisms to address that, to address that locality? And the -- our view is that you do have such information. You have information on which data you access and which data you use for every task. This can be used to improve locality. And one example is again the Cholesky. What I showed you before, okay, this is more or less again what I showed you with 32 processors, this is the original MKL versions. This is the improved MKL versions. This is our version with our StarSs approach as is, just no special -- no special handling accept the dynamic asynchronous execution of the tasks. What is the difference to this other plot? In these other plots we have done two things. One of them is reshaping. Changing the association of the data. Because here we have information on how is the data in main memory? And what part of the data are we using? The compiler has the information to determine, well, it might be good to take out that data, bring it and put it, compact it together in a local storage for a set of positions and access it with unit stride for example. So the compiler has the information for doing that. The compiler can handle that, can make the copies, changing in the association, invoking the task on the new association. So it's what the normal programmers do by handle using work arrays and moving data from the global to the work arrays and sending it back. The compiler can do that alone. So this is a little bit the picture of how -- what are the different benefits of this type of transformation? And the picture a histogram. So we have executed Cholesky, and this is a histogram of the IPCs of the 32 processors. How you read this advertise tow gram? So there is one row per process. Okay? The different columns are the different bins of the histogram. So this is IPC, so -and this is probably 0. -- 0.1 or 0.02 of IPC. This is probably an IPC of four or five and four or five something around here is five -- yeah. Six is here. Okay. So if I have -- if I have an IPC of six and I have a hundred bins here or how many I have, this would be -- this would represent increments of IPC of 0.06. So then the value of the histogram entry is the color. Dark blue is a high value, light green is a low value. So what these represent what -- how do we read this thing then? What we read is the first four threads, one, two, three, and four, the first four threads, there's an empty one here. This is an internal thread of the system. The first four user level four threads have a high value of IPC here. This is close to six. So these guys have the four of them have very high IPC. Where is everybody else? Everybody else has an IPC which is not half of that but two-thirds, okay, of that. Much worse IPC. So even if we achieve that everybody is working, they are working at different IPCs. Why? In this case, the data was allocated in the memory of the node where the first four processors are. So the first four processors were accessing local memory while everybody else was accessing to this local memory. The result is these first four processors execute much better -- get much better IPC. It's also one advantage of the dynamic scheduling is that at least they will also execute more tasks, okay? So the scheduling is not static so at least they will get busy and they will execute most tasks. And the others are going to have much worse IPC. There is contention to -apart from the distance there is also the contention in the access to that memory module. So if there's way of proving, how one would improve the behavior of the program is just initializing the matrix scattered, distributed, interleaved across memory modules, okay, and keep with the same dynamic mechanism which is not locally [inaudible] just because the data -- if the data is distributed what will happen is there will be less contention for a given memory position. How is the same program run, how does the IPC behave? What would we expect? We would expect less contention so better IPC, okay? What is the result is this one. So these guys improve very significant their IPC. These other guys go a little bit worse. So they were -- these other guys were around here, and now everybody goes -- so everybody is a little bit worse but it's significantly better. So without any real locality awareness just by avoiding conflicts to a single memory bank, memory module, node, we improve quite significantly the IPC. So this is still every access, every, every task accesses the individual data with their non-unit stride. So what happens if our runtime brings the data together into contiguous regions? Just by bringing the data together in contiguous regions with we do get an improvement from here to here again, it's an improvement of maybe 10, 15 percent in the IPC. Okay? The result is -- and by the way, these are some of the task, additional activities that we generate which is just the copy of the data. The copy from the strided to the contiguous storage. [inaudible] a little bit better, well, the thing we do -- this is doing the copy from the strided to contiguous, but we could allocate the contiguous in the same place that we are going to execute. And then we get this IPC. Still a little bit better. So just to say that the importance of the [inaudible] of handling the [inaudible] properly is really high. This is the same histogram in terms of cache misses. There is one of this transformation which is when you reshape, when you bring the data together you go from very high cache miss ratios to much lower cache miss ratios. Okay. So that was about the locality. We are -- that is a -- only on the Cholesky. We are implement -- we are evaluating now that on other algorithms. But we consider that that is a very important set of transformations, both dynamically changing the association of the data, dynamically generating and handling work arrays with the unit stride as well as making them local to the actual execution. How do -- and so now what I'm going to do is go through different things with a few slides for each of them and how are we -- one of the first one is how do we address heterogeneity, how do you -- did you look at the systems with different types of cores. Back again to Cholesky. This is the standard Cholesky that we have seen until now with its dependence graph. One of the things that we could do is these tasks that we have hearsay well, maybe this task is very appropriate or not to be ran on a GPU. Okay? This task may be appropriate for another device. So the idea is for each task essentially you should be able to say this task is appropriate for this device, this task is appropriate for that device. This task, for example, performs well both in this and that device. So what we have added is this started device that at close which essentially tells the compiler which is the appropriate device, an appropriate device for that task. The compiler essentially what it does is the code that outline from this -- the code constructed from this routine is actually fed into the specific compiler for the specific auxiliary. Okay? This is the idea. This is the directive that we have implemented. Let me first comment about nesting. And I'll comment about the marriage of the two things. Nesting, so there will be two ways of looking at the hierarchal nest of the generation of tasks. One of them is inside a coarse grain task generate fine grain tasks. So in each of the -- you have a program which at the outer most level generates coarse grain tasks and then inside each of them the execution actually instantiates a new StarSs engine that generates within that environment, generates a new task graph and executes that task graph locally. The good thing is kind of the -- that the outer task is a container that hides the external wall from this local set of tasks. There is another thing which is -- we have an engine to generate coarse grain tasks so the overhead -- it generates very big tasks so the overhead even is -- with respect to the total time is very small. And when you start execution of the individual tasks, you [inaudible] for each of them a new engine. So actually in this case what we said before about there's only one thread -- one process generating a job work. In this approach, in this situation, that does not -- is not the case. Every process, every of this task has an independent generator. So the overhead is really much, much more -amortized much more -- you do it in parallel, you can do it in parallel. So this is one of the potential approaches. There would be another approach that would be to let every single task to generate additional tasks, but this is -- I do think this is extremely complicated because you are then having to handle a single task graph both a green and the red adding tasks to the task graph. And the problem -- the problem we have here -- or the issue that we have in the original case is that at least we have something -- somebody that set ups order. Okay. Here you lose order. So should this dependence be come from here or should the dependence come from there? This is something that you cannot really control. There is no -- in this approach, there is not really a good way to set up what is the real -- let's say the program order. In this case, you do have such a program order. So we have followed this case of the left, this example of the left. And this essentially you can complement any program for each of the tasks you can complement it internally with sometimes with very trivial dependencies because the algorithm is very simple, sometimes with not so trivial dependencies because the algorithm is more complicated. Or this could be a matrix multiply where you have the outer program first calls a level of matrix multiplies and this one calls a second level of matrix multiplies. For example, this is what we have implemented and this runs on a board with two cell tips, and what happens is that we devote one of the cell tips, so we consider the two accessible tips as being able to execute each of them one of the coarse grain tasks and for the execution of each coarse grain task, the PP generates work for the SPs, okay. So this is our code run implementation of the hierarchal nested model which essentially means something like having SMP Superscalar up close. The two PPs, the two main processors, and each of them sell superscalar with its local SPUs. Okay. So now let me merge a little bit the previous thing and this one. So this is an example where you have again a matrix multiply. This matrix multiply of outer level calls this first level of matrix multiply and this first level of matrix multiply calls SDM-1, but for SDM-1 you have two implementations. One is the generic implementations, which we say it's good for NSNP, and it's good for -- it's possible for a cell. And we have another implementation which we have the close implements, say in this is an implementation of that, that function of that task. And this function is another implementation of the multiplication. So essentially here we have our circled task. One implementation of the fine grain tasks for SMP, for the host processor, one implementation for the cell. And this implementation for the cell will be generated by giving this source code to the cell compiler so it will be very bad, very poor in performance. But we will have that implementation. And this implementation, this is another implementation for the SP that will be good because we are actually linked -- we are using here an assembly version of that routine. So what we have is this invocation can actually -- this task can be executed with this code running on the cell with this code running on the PP, or with this code running on the cell. And the thing is the runtime should be able to find out which of these is best. We had an example before where we said this can sometimes a version compiled with DCC is better than the version with OpenCL, and sometimes it's the other way around. So the idea is something like here. You put both versions, one compiled with it. This is a compiler, and one compiled with the OpenCL or one compiled with the handwritten assembly language. Okay? And the runtime should be able to identify which is most appropriate. Or even not just identify which is most appropriate but at least give some of them to one of the devices, some of them to the other, and keep everybody busy, everybody contributing within its capabilities, within its reach. This is an example of what happens in this situation for a sparse matrix vector multiply. What we have done is there were routines -- there were routines for gathering which are only con by the PPU because for these gathering the consideration cast that the SP was not performing, and then this were other operations which are the actual matrix vector multiplies for which we provided an implementation for the PPU and an implementation for the SPU. Then the system essentially decided where to run the different -- the different tasks, okay, and actually came out that it executed on the PPU this many and on the SPU those many. It's a case of very fine granularity. And even if the SPU is large, the overhead of sending the data to the SPU executing and sending it back for these granularities showed that it was better to do a large part of those computation on the PPU, okay? But still the SPUs did some of the executions. This is a very preliminary implementation but we think we are -- this has to be more extensively studied and developed. Okay? That was about the heterogeneity and nesting. Now I'm talking about -- a very little bit about nanos, about the OpenMP compiler that is going behind and trying to catch up by integrating to OpenMP, those features of the StarSs that we are consider are interesting. What is the currently proposal situation here? Let's look at this example. First, we of curious have the input/output for tasks, but we now do not force all arguments to be part of this input/output clause. We don't need every argument as part of -- to specify the directionality. And those it means that they in/out -- or the in/out clause are just used to compute dependencies, but you have to put only the ones you considered are necessary. So this saves you the burden or going through the burden of having to specify input/outputs for everything if you know what you do it's just saving overhead. If not, might be a little bit risky. So there's a situation where the current proposal is well, leave this thing to responsibility of the programmer and enable him at least to avoid -- to not to put some of these clauses in for some of the arguments if he's sure that -- they're usually sure that they are not necessary. About the heterogeneity, the same thing. We should [inaudible] this target device cool or target device cell or target device SMP, which is the default. For heterogeneity or even if the device is the same but you have several implementations. You have for a generic method, you can have different implementations, okay? And finally the inputs outputs that we have specified here which in the StarSs mend, the meaning in the StarSs was they were always used for dependencies. And whether they were used for data transfers or not dependent on the actual implementation. Here that is actually separated. The inputs/outputs are just used for dependencies. And if the user says for this device I need to copy in, copy out, so the transfer part it is separated, okay? So if I need to copy in or copy out the arguments, I have to specify separately. So I may I have combinations here, for example, where I can say for a given argument I don't put anything about in/out because I don't want to compute dependencies on that parameter because I know it's not the one that dominates the dependencies. Hello. And I can put the copies because if I know that the device has explicit separate local memory and I have to transfer really from the global space to this separate local memory. This is another example which is, for example, the sparse LU. And there's one thing. We have put the pragmas in line, okay? Until now in the StarSs the pragmas were just before a task that defend declaration, okay? And in OpenMP, pragmas have traditionally been put in line, so when comparing the two things we thought -- we think what is the difference and what are the difference in use of those things. And it turns out that putting them inline saves you -- because the compiler does it, saves you the outlining of the code. And that is -- that is nice. That saves you some time. But we have actually found also that the -- if you keep the pragmas inline, that means that the tasks have no name. So you cannot apply to them all these things that we have been mention about heterogeneity and multiple implementations. You cannot provide multiple implementations and let the runtime decide which is the best. And that may be an interesting feature. So essentially what we have in this proposed extension to OpenMP is we have -we support the two things which traditionally in OpenMP is not supported. In OpenMP traditionally had to be always just inline. Our proposal is that you can put the parameters before a function declaration and means that every single invocation of that function is going to be constitute a task. More things about the -- about the -- well, there's Cholesky again and with the same -- the same issues. Yes. Specify whether a different device is I think this is not really much different what was -- yeah. This is what I just mentioned before. This is -- I've annotated the pragmas go before the declaration of the function, so it means that every single invocation of this function. This is the typical [inaudible] way of doing it. This is the typical OpenMP way of doing it, what we think is supporting both is interesting has both -- has the benefits of both worlds. Finally there is about array sections. This is not yet implemented. They are thinking of again -- there is a little bit -- slightly different syntax, so from what we saw in StarSs and what people is proposal in -- for OpenMP, we still have to discuss more about this or unify it or which is better or the -- the idea will be to support what I've been explaining on the StarSs, okay? And also the idea is they are proposing ways of in some cases you don't know the size of the arguments, so -- and you need to know the size of the arguments in order to be able to really contact the data and so just a syntax for specifying the size of the arguments. That was about the MPI -- sorry, about the basic programming model. What I said, even if we were develop this cluster superscalar version we should be able, if the granularity of the tasks is large, should be able to work in clusters. I don't expect it to work on a thousand -- certainly not on a thousand processors. Maybe not on a hundred. But maybe on a few things should work. But is not something that we propose for the very large end. What to we think for the very large end is the most appropriate is this hierarchal hybrid MPI plus SMP Superscalar or MPI plus StarSs in general. How does it work in I'm going to show it on two examples. One of them is just a very simple matrix multiply code. And essentially in this situation what you do is you have -- you assume it's your MPI program. Every process has a set of rows and a set of columns of the matrix, so this is a very specific distribution, it's not really realistic for general purpose application maybe. But it's easy. You have set of rows, set of columns, and then you have to generate a set of columns. In order to do that, do you your multiplication and you generate a small block. You shift the matrix, do again the multiplication, generate another block, shape the matrix, okay? And in order to do that shift, essentially what you have is you have an intermediate buffer and essentially what you do is at the -- and this is -there are two -- all done even iterations because it's usually a double buffer in the scheme. While you -- if one iteration you multiple these two things and at the same time you are spending to the next -- your data to the next -- to your next processor and receiving from there your previous processor data in the other path. Next iteration, you multiply this by that and receive here, and next iteration multiply this by that and receive here. This is the area. If you execute this is the area, this is the source code, you do the multiplication then the multiplication you call some [inaudible] here, and after that you do a send direct to send to the next processor and receive from the previous. If you execute it, you get a trace something like that. This is eight processes. White is computing the metrics multiply and blue is the communication. So in white you compute, and then use rotate, compute, rotate and matrix compute and rotate. So the first consideration is what do you do when you typical way of doing hybrid programming? You would say well, let me take the expensive part, the computation, and let me parallelize it. Okay. So essentially you take expensive computation part which is the computation and in some way or another with the collapse, some type of mechanism you parallelize it. Okay? This is the tracer for that case. We still have eight processes here. And each of these eight processes has been run with three threads, okay? So what we have is these three -- these are the three matrix multiple parts that are -- so this is the -- what's previously we had here is now split among three threads here, for the second processor among three threads, for the third processor, among three thread. So you parallelize that part and then you have a barrier and you wait until you have analyzed this matrix multiply code and then use your MPI point-to-point exchanges. So you are parallelizing the computation part but not the communication part. You are serializing the communication part local -- if the local point of view of the process it's serialized. So this is what you would expect. There would be another way of doing parallelism of looking at trying to parallelize that loop, which is because we had to do the matrix multiply on the exchange, the question is can we do both of them at the same time? And you can write that with an OpenMP syntax something like this. So you do. At the same time the matrix multiplication and the send-receive. What is the problem, the situation here? The situation is that you do at the same time the matrix multiply and the send-receive, the matrix multiply and the send-receive, matrix multiply and send-receive but the send-receive is much short infer this case than the matrix multiply, so you are not gaining much. You are out of the original program. You are just overlapping the blue and the white. Okay? So this would be what you would get in MPI plus OpenMP if you tried this parallelization. You still have these faces that are dominated by the matrix multiple time. Nevertheless, let's -- what happens if you think still -- if you still think of the same thing, of the same approach, okay, but instead of doing it by OpenMP directives here which this fork join type of semantics, you to it by saying this is -- this is a task, the multiple, and this is a task, the send-receive. So encapsulated the thing here is encapsulate in some in the StarSs tasks the actual MPI codes. And for an a MPI code for the send-receive, you specify from the local view of the process what are inputs and what outputs, okay? For a send-receive, what you -- what this internal send-receive will trust me over the wire is from the point of view of this task is an input. What the send-receive will get from the wire, from the local point of view of this, local address space is an output, okay? So then you just specify this is -- this is the input buffer, this is the output buffer. For the matrix multiply, you have this matrix, so these are the input, the two input blocks and the input/output block, the resolve block. What happens when you execute it this way? You will expect to see the same thing like that because what you have done is what I have encapsulated this asset task and this asset task. But if you look at it and you unroll the whole graph, your whole graph is something like that. Okay? Your whole graph is you have the matrix multiply index exchange, the matrix multiply index exchange, the matrix multiply index exchange, but you have these dependencies, from one exchange you need it to do the next and you need it to do the next and you need it to do the next. What you get out of this exchange you needed for the next matrix multiply, what you get out of this exchange you need it for a matrix multiply but you have an anti-dependence here, while you are multiplying this thing you cannot use the input buffer of this thing, so you can -- you have an anti-dependence to here. You cannot receive what will go into that buffer. So essentially you would expect the same type of behavior. These two things go in parallel, then these two things go in parallel, then these two things go in parallel, then these two things go in parallel. When you execute the result -- the behavior is not like that. The behavior is that because we have implemented this renaming mechanism, this anti-dependency is broken, so actually you can proceed through this path and you can -- actually you could proceed through this path at full speed because of this renaming, then you could also proceed through this path as long as you have received the communications. The result if you [inaudible] with three threads, here we have three threads for one process, three for another, three for another, is that you do -- you start the communication at the same time that you start the multiplication so communication and multiplication. When you finish the communication, you can start immediately that multiplication but you can also start immediately this communication. Why? Because the system renames the memory address and this communication is going to receive in a newly allocated space of memory. So actually you can get -- for a very coarse grained type of program, so you don't need to program to a very fine grain which was -- that was -- this approach was going too much finer grain and granularity of parallel programming. Here is you still have coarse grain parallel programming. But the mechanisms in the no-level programming model in the StarSs inside the node lets you -- and actually propagate to the whole MPI application because you see that all processes are actually doing at the same time, these -- these blue things communicate between them. These other blue things communicate between them and the same for these others. Okay? So what I'm trying to show is on this other example, which is Linpack, the same idea. The same idea as Linpack you have to -- the algorithm is essentially this, you do factor -- if you are the owner of a given panel, you factor it, you send it to start the progress of that panel. If you are not the owner of that panel you essentially receive it and retransmit it, if necessary. And the retransmission, this is on a ring typically or there are differing approaches but there is one dimensional ring or two rings. Well, the thing is essentially is you factor on send and then if you are not the owner you just receive and retransmit. And then you have -- with the panel that you have received, you had to update the training matrix. What is the essential idea in merging MPI plus StarSs is that you taskify also the MPI calls. You encapsulate the MPI calls, you encapsulate the MPI calls in tasks. If the task is send, if the [inaudible] is going to send things from your point of view of StarSs, this is an input argument. If you are receiving things, this is an output argument. So what happens when I execute this thing? Each of the processes, each of the columns and rows totally in rows, the total whole execution graph of your Linpack code. So you are -- you start with the factorizations and the dependencies are from this factorization depends. This send, you have to send it and you have to update. Once you have done this update, you could actually receive this data and update the next part. And you could receive then again the next data and update the next part. On the other processor, this one for example receive the data, and what has been received has to be sent. With what you have received, you have to update your training matrix, and if you are -- you are the next, you have to factor in this case and the whole thing repeats. Okay? So essentially you are unrolling a data flow graph locally inside each process but there are links between some tasks in one node and some tasks in another whose relationship which order is maintained by the MPI point-to-point semantics. What is the result? The result is this guy will try to progress as fast as possible through his task graph. This guy the same and this guy the same. And actually it would be nice to be able to progress as fast as possible through the critical path of the application. Okay? So that you can do, you can hint the scheduler to say well, tasks like factorization and service if these tasks are high priority. And the updates are not so high priority. You can use the updates to fill in holes. What is the result of this? We did run this let's say three times. And in here is 16 processes. This is time. And the yellow lines are communications. Okay? By luck one of the times you had communications spread over the whole execution of the Linpack run. Another time, also by luck, by randomly came out that there were lot of communications in the very first phase, but there was a long phase here without any communication at all. And then again communications. The color represents the panel you are updating, okay? So what happens in this case? In this case what happened is that you went through the critical path so fast and you were -- because renaming everybody got a lot of panels from their previous task, let's say, a lot of panels that he could use to do all the updates that he had pending, okay? So for a long time they were not able actually to generate farther. We have -this is another thing that we have to do. Because if you are too greedy, too greedy with the renaming you are essentially exhaust your memory -- your memory of your system. So you have to put a eliminate so saying if I enroll will to this -- if I -- if I allocate this much memory for renaming, I stall my instruction engine until I free some of the space. So anyway, you see three different runs, very different time behaviors, all of them essentially taking the total lapsed time because the problem -- the problem is not limited by communication. You can move communication, spread it to uniformly, compact it at the beginning, compact it at the end, but your limit is computation. So we run this on 158 processors and we compared it with MPI plus OpenMP. This is the MPI and this is the MPI plus SMP Superscalar, okay? We are looking at small problem sizes. This is not where you run Linpack typically to report it to the top 500. You run it much, much, much higher. When you run it here, the thing is these lines will converge again because what we are actually here, which is overlapping communication and computation is irrelevant when communication is irrelevant compared to computation. Okay. So we are looking at this situation which we think is relevant again because it's, as I mentioned before, is the situation of the strong scaling. It's a situation where you have fine granularities compared to your communication costs. And you see that it behaves -- that the 128 processor behave -- the MPI plus SMP Superscalar behaves significantly better than the other two approaches. The MPI plus OpenMP has these -- it's just the opposite of -- to MP plus StarSs. It's much too rigid. And when your barriers in OpenMP, any thread suffers anything, everybody suffers that. Here if one thread suffers some delay or some -- some wait, well as long as it's not always the same guy which is just suffering those delays, sometimes others will receive those penalizations, and the whole thing will more or less progress. Well, we have it with 512 processors, we have it with 1,000 processors. And again, in the small program sizes there is -- this is difference. What I have in this slide is two things. On top is the sensitivity to bandwidth. When -- what we did was we ran the original -- this case, well, this is with 512, okay, so this corresponds to this. We ran this case for a given problem size and large problem size, and we ran it with nominal bandwidth. So with the standard Linpack and the standard MPI plus SMP Superscalar. Then for every message we modify the application so that for every message that we send we send another fake message of a given size, mimicking the effect that our available bandwidth is less than the original one of the machine. So and this is what we report going towards the left. The more towards the left is we are simulating an environment with less and less and less bandwidth. What is what we see? The beginning none of them is sensitive to that, so the original Linpack for this problem size and -- it's also tolerant to bandwidth reductions. But there's a point where the original XPL version starts feeling the pain of less available bandwidth while the MPI plus SMP Superscalar version feels that much later. And this is what we saw before is it didn't matter if communication bandwidth is slow, should ways will never happen. The communication will be more or less always spread across. If memory bandwidth is enough, this situation may happen. But the system is tolerant to that situation -- to that situation. This other measurement was studying the impact of operating system noise. So we took two configurations that in this case the MPI -- the SMP Superscalar was better than the MPI Superscalar, but I'm not interested in the absolute performance but in the effect of introducing more and more system noise. How did we introduce more and more system noise? Simple. What process sleeps for a while wakes up for a while, sleeps, wakes up, sleeps, wakes up? The more frequent we do the wake-ups or the longer we do the computations this guy takes -- when he wakes up is going towards the left, okay? So in this case is the period of the preemption. So the more frequent we do the preemptions, so what happens, the original HPL when we start preempting more and more and more frequently starts feeling the pain in a linear way. In the MPI plus SMP Superscalar what happens? As long as for a long -- for a certain interval we start -- as long as not everybody suffers all the penalization, so if I go slow now, but while you go slow I can catch up and we distributed the penalization uniformly across the different processes, then the whole system is tolerant to much more perturbation, to much more system noise. Of course there's a point when the noise is so important that everybody feels it. But I have the strong feeling that this asynchronous data flow type of execution is the real way of fighting operating system noise. I think -- it's not operating system noise. It's just variance. I mean, fighting variance I think is a lost battle. >>: There are other sources besides the operating ->> Jesus Labarta: Yes, there are many, many, many other sources, the hardware itself. There are many other sources. And it's a lot better. You will get -- you will think you are gaining something here and then anyone -- a new source will appear somewhere else. So I think the important thing is to learn how to live with it, how to more or less in a dynamic way how to -- how to survive the variance. And I think asynchrony is a real -- is a really alternative, is a real way for that. So just another topic. And on this one, I just got it feels like from Rosa, and this is what Rosa Padilla will be coming in two months, I think, and she can explain much more details. We have a version of it that works in Java. Essentially you can specify in this should be a normal Java mechanism. You specify for the different -- for the different routines for the different classes and methods you can specify for this -for this method what is the type and the direction, the directionality clauses of the arguments. So essentially you will have implemented this. There is some other things that we have -- which are available here and we have not implemented on the SMP Superscalar which is high-end resource constraints. You can specifically say I need this type of resource for this, for this method. And this has been implemented on Java, as I said, using ProActive and Java, Javassist for instrumenting -- is developing a custom class loader. But, apart from the fact that they have been doing some already some significant runs in Marenostrum with biological codes I really -- I'm not very familiar with this work and I prefer if you attend the process presentation in a couple of months from now. >>: So [inaudible]. >> Jesus Labarta: There is another thing about load balance, just one very simple comment about load balance. The way I think -- I mention we see it is for large systems you'll have MPI plus StarSs. But in large systems you will have large shared memory nodes. And the question is how many processes you put in every node, okay? If you have eight weight nodes, do you put eight processes, do you put four processes, each of them two weight, two threads? You put two processes, each of them four threads or four processes each of them two threads? What do you put? The way we see a node is a domain within which it's easy to switch resources from one entity to another. So you can easily switch cores from one process to another one a node. So what have we done? We have implemented the runtime that actually is able to do that. When one API process gets blocked it actually lends its process -- it's cores to another of the MPI processes inside the same node. Because the StarSs mechanism lets you -- you have queue tasks, task queues will just -- if you get a new additional core it's an additional worker that goes and tries to execute that from this pool. So that's the idea. We have actually implemented that on MPI plus OpenMP which is more restrictive from the point of view that you cannot change dynamically the number of threads that are working in a process at every point in time, has to be done at the parallel -- when you enter the parallel clause, the parallel directive, sorry. But even then we have been able to -- to do -- to do benefits like this one. 2.5 has speedup on a real application, real production mode on 800 cores. What happened with that application? It did the happen that it was extremely imbalanced in some parts of the computation, extremely imbalanced. So what -what did the we do by putting OpenMP. And you start -- you start the application with as many MPI processes as cores. Essentially start the original MPI application. You said new threads equal to one and you keep going with the original MPI application. If the original MPI application is fine, you can go in fine. If the original MPI application is imbalanced and some MPI processes get blocked, then the runtimes when one gets blocked lends the processor to another. And this other one can shrink, and this is what we see here. The yellow part, this is the same -the situation without load balance one process takes very long, one takes a little bit more, and the other takes nothing. When you load balance, then one -- now this is one process, one process gets immediately -- well after there's a small period here but immediately gets the four threads where this is in Marenostrum, four threads per node, gets the four threads working for him. The whole system is still imbalanced, okay, but there are other processes in other nodes that are idling. >>: This is -- what's the time scale up here. How long does it take to give a core away in. >> Jesus Labarta: How long does it take? We are looking at things like microseconds, 10s, hundreds, 10s of microseconds. It's fine -- I mean, it's not extremely fine, but -- I think it's that's the level of granularity that we are looking at. And the system may -- so we will still have to evolve that, may become unstable because you give threads to one guy but then the data -- then the message arrives to the other -- to the one that lend the processor and then for a moment you have two guys that want to -- that want to work at the same time. So there are issues there that we -- we are still developing that and we are -- and I think it -- we have to put some [inaudible] in mechanisms. >>: If you only had one process per node as you could do with StarSs -- you know, you could sort of ->> Jesus Labarta: Yes? >>: Then the load balancing inside the process is sort of automatic. And so the only time you feel this few microsecond granularity is when you're going even further afield. I mean you have some other process running alongside your [inaudible] something. So it's fine and great, it seems to me, StarSs is than ->> Jesus Labarta: Than this? Yes. No, the internal inside the StarSs is much finer grain, okay? This [inaudible] is just one, two microseconds less than that, okay. Going from one task to another. But changing -- the problem is if you start with only one StarSs process, you have the load balancing there goes well but you cannot balance between MPI processes. Okay? >>: Right. >> Jesus Labarta: And actually the way I see it, I look at this more and more is like a way of fighting Amdahl's law. And the way people parallelize these MPI applications -- so the people has MPI applications, okay, which might be relatively well or may have some imbalance in some point in time. But they are MPI applications that if they want to put OpenMP what they do is they take this part, which is long, and they put OpenMP here. They take another part and put OpenMP. And they take -- so essentially you end up -- you need to put OpenMP everywhere in order to -- so you have to parallelize again the whole application to be competitive to the original -- to the original code MPI, run with more MPI processes. If you have look at this way of running, forgetting load balancing, you don't need to parallelize everything. Typically you will need to parallelize just the parts that are imbalanced, big parts at the end that are imbalanced. So with an incremental parallelization of only small parts you can achieve the benefit with the load balancing mechanism, you can achieve the benefit of really better usage of the whole -- of the whole system. And I think that that is interesting. And it's interesting -- yeah. I don't know at the very end running for very large systems running with many, many MPI processes, I don't -- I don't -- I mean, I think the current situation with hundreds of thousands of MPI processes I think is -- those are too many processes from my point of view. I would prefer not so many hundreds of thousand and each of them multi-threaded with many threads. >>: [inaudible] load balancing due to blocking is one thing. But suppose work grows to a large extent so the MPI process gets oversubscribed with work which -- you would -- you would still like to borrow some workers from some neighbors who are -- don't have so much work to do. You have to wait for them to run out of work before you get any workers. So ->> Jesus Labarta: And the problem with -- if you have different nodes is in order to borrow -- if you just establish mechanism to borrow work from the others it is premiums that the StarSs relies on shared memory. Okay? So we would have to use -- let's say we are looking at this clusters superscalar version which might -- I said I don't [inaudible] for a thousand processors but maybe for 32 it might work. So might be one way of having -- of having let's say cluster -- using cluster superscalar across and a put every superscalar across three, three nodes and put three processes -- sorry, so every process across four nodes, for example, and four processes all across the same four nodes. I don't know if I'm explaining myself. But using this clusters superscalar as a mechanism to get work across -across nodes. Okay? I think there are interesting things there. For the moment our load balancing we are trying to do it within the node. I think for the moment with architectures we are seeing with a few 10s of course. It will be -- this was with four cores per node. It's very little with a few 10s I think will be very -- a very reasonable situation, a very interesting situation. But still is true that you have the boundary of the node. So you will -- you might like -- you might like processors in a node to help processors in another node. And that makes -- is a little bit more challenging because as in the other space mean different. Okay. So that is what we do at the level of the load balancing. About transactional memory, I have shown some examples. I'm not sure I went through the whole detail of that. But in the -- in the example of the [inaudible] mixed simulation code, which is a fine [inaudible] code, one of the tasks was doing updates of three different positions. And you need to do that atomically. And what we have there is we have actually leverage because our compiler is source to source, and then we use a back-end compiler, we have leverage, for example, that this is the capability -- OpenMP compiler. So essentially for this atomic update, what we have put is an OpenMP directive, OpenMP atomic, okay, in front of each of these lines. Our compiler sees the OpenMP atomic and lets it get through, goes to the OpenMP compiler of this [inaudible], the OpenMP compiler is capable of generating the instructions for doing that with a typical low link store conditional type of mechanisms. So we are leveraging some type of atomic super for only one -- what OpenMP has, only one statement, but the atomic capability we are leveraging that for an OpenMP. Examples where you don't -- are not based on a single update of a variable, but you need atomicity for several updates, probably we would leverage from some transactional memories who persist, and we could leverage that capability. This is where that -- we're just starting. We have [inaudible] starting and we are considering this is very, very rudimentary. But we think this is an area where transactional memory would help. There is another area which would be for speculative dependencies. So in a situation where we have one task and another and we know that there is a dependence or there most probably is or there can be a dependence between both of them but we also know that this dependence may not be there. So we know that sometimes if one will touch certain part of a data object and if two different part and in this case there is actually no real dependence between them, but there may be other instances where if one and if two happen to touch the same region and then there is a conflict, another instance where there is no conflict. So today we have to specify this with a dependence and we have to say there is a dependence and we serialize that. We could actually be speculative, speculate start both F1 and F2 at the same time and check whether there has been a conflict between them or not. Probably the only difference with a standard typical transactional memory is that in this case we know that in -- out of these two, if one of them fails, the one that has to fail is F2. F1 is the real one, okay? So this is another area where we could use transactional super for speculation. How much is necessary to speculate in this approach or in your programs or unless in the area we are looking at, which is mostly scientific computing, maybe not that much. But it's an area. So speculation could also be done or some type of transactional memory support should be -- could be to for control population where you have conditions and you could execute both branches of the conditional and just keep this -- the final state of one of those and drop the state of the other. Or what looks -- and same thing. This is just the same type of speculation and this is the same type of thing that the processors do with branch prediction and try to go into certain directions and then graduate or not graduate certain computations, depending on the later check of the conditions. And for fault tolerance something similar again. You could issue tasks that if they fail and they have not committed any [inaudible] they can just -- can be reissued in a different -- in a different resource. About memory management. Just to mention again the thing that we did on the -- the way of doing the renaming probably we have played with the typical thing. When you do the renaming very early or very late in the whole process and the -well, our initial implementation was doing the naming very, very early. As you know, it's better to do it later. We are just playing with lazy or late implementations of the renaming. We are doing that on different implementation on the cell superscalar, we're using the mechanism based on an atomic -- the hardware support that is available there for atomic updates of a block line. Of a cache line. On the SMPs we do also another implementation of this lazy renaming which essentially in this case means that we don't allocate the instances of the objects until they are needed. And in case -- in case we can do the in place updates we can -- we don't need to reallocate these instances, and all we reuse are previously allocated instance. Just to say that this lazy renaming on the cell also enabled us to avoid many of the -- because the renaming was not done on the main memory -- on the main global address space, the renaming was done directly on the address space of the SPE. In that way you avoid bringing the data in, and also you avoid bringing the data out, okay? So this was -- this was the scale transfers that were sort of interesting. Other thing that we do local -- a you have a software cache which means you bring an update to your SPE and if the next task uses the same object, you don't need to bring it again. One thing that we have been doing recently is that was done locally, locally, HHP manage its own local cache, okay? So the problem is -- well or the potentially interest would be can you implement kind of a shared cache so everybody be able to access not only his local cache but also the cache of the neighbors, okay? And we implemented the mechanism which is equivalent to -- also to bypassing, okay? So one functional unit producing one data and maybe this data is fed back into another functional unit without having to go, let's say, to the registers or having to go to the main memory. Okay? So this very same mechanism is something that we tried also based on this atomic -- this atomic support it was possible for everybody to know what everybody else had in their own cache. And this other thing that we -- that we have here these are the amount of memory accesses that you're required as you're giving cache size and this is playing with a schedule to maximize the reuse. Essentially if you enroll totally the whole task graph instead of what you -- what you have for example is if you follow the dependencies in the that first mode, well, you are producing something and you will consume it immediately and you will consume it immediately and you -- in this way what will happen is you are avoiding transferring this -- you are guaranteeing that you have data that you reuse. You are avoiding sending it back to memory and bringing it back -- bringing it back to later. So essentially the idea is how do we look at having the whole task graph, how would we traverse it in ways that we try to maximize the locality. Once we bring one data try to reuse it at maximum. I think we have tried ways of doing this here essentially by putting additional information in the task -- in the task graph and finding, for example, if you have for one son you have two parents, if you put this -- you schedule this parent -- will well will well, [inaudible] to the other parent and try to schedule both parent together so that then you schedule the son. We have tried to do these things. Ideally, and this is what we have here, ideally one should be able to write a program which is, let's say -- and this is a matrix multiply, one should be able to write a program matrix multiply where you multiply the whole row by column, whole row by whole column, whole row by -- so the standard trivial matrix multiply. And the system should automatically be able to execute a block portion of the algorithm, which that's a small part by a small part, a small part by a small part. By just traversing the task graph of the matrix multiply brand very large N square dependence chains, so you should be able to execute just a little bit of one chain, a little bit of another, a little bit of another and mimic what trying to do when you write a block algorithm. So this would be the ideal thing. So we did some experiments with that. Some of that is possible, but I think it's still far for realistic -- and these are against. These are -- you can get a lot -reduce many -- the amount of its transfers compared to the original one. In some cases. In some other cases not that much, but there still is something. And saving memory, saving bandwidth I think is going to be good for accelerators, is going to be good for any type of device. So the schedule there can save you bandwidth which I think is an important experience. And the good thing is because the StarSs gives you that information, you have the information of what data you need. You should be able to achieve that. So then just to conclude, we have an active research project with many ongoing branches and many, many ongoing directions. We nevertheless think that we are in a [inaudible] that it's stable enough to be used on relevant production applications, and this is the type of thing that we are trying now. We are trying internally in cooperation with other -- with other people, both in Europe -- both in [inaudible] and both in put in for proposals to the commission to try and take relevant, significant codes and try to use -- and what I think is the most relevant to start with is the MPI plus SMP Superscalar, okay? This is the most general, the most flexible. Then you can try it with essentially same source code. You can -- you can optimize some task for a GPU, for example, and use an expression with SMP Superscalar plus cell superscalar. But still we have quite a lot of things to investigate. And that's essentially the direction on the lines -- the idea we have. I hope I've been able to explain a little bit the type of work that we're doing. >>: May I ask something? >> Jesus Labarta: Sure. >>: I can look at this [inaudible] I suffered with four hours of the tutorial. I suffered more hours but some parts. And you know that I like very much the work you do here and [inaudible]. But when you look at the second line [inaudible] I feel that what you were explaining to us is more [inaudible] that you are solving many things that we cannot do at the computer architecture there, for example [inaudible] last two slides. When you did with the prefetching [inaudible] many feel that -- I mean, we have [inaudible] at the hardware level, we don't have enough bandwidth, the bandwidth [inaudible]. But all this kind of prefetching, load balancing, synchronization, and then I think that you should add to the -- to this tutorial this [inaudible]. When you say we minimize the number of buffer you [inaudible] you know that there are many [inaudible] now that [inaudible] so I will represent you can -- because I used to be computer architect when I was -- and now I am doing [inaudible]. >> Jesus Labarta: No. It's ->>: You should take [inaudible]. >> Jesus Labarta: Yes. We wouldn't do lot of things and lot of things in terms of our support. Or we could at least discuss more things ->>: [inaudible] the memories [inaudible] data processor. >> Jesus Labarta: [inaudible]. >>: [inaudible]. >> Jesus Labarta: I'm not doing anything new. I'm just trying to -- and as [inaudible] said, there are many things that you have already done in computer architecture which we have not done because [inaudible]. >>: [inaudible]. When you were talking about how the hardware [inaudible] you got some question this morning about the overhead of the runtime and you said okay, when I run this runtime the current microprocessor I [inaudible] scheduling that but you know that we are working in the hardware. >> Jesus Labarta: Yes. I ->>: So I didn't ->> Jesus Labarta: No, I know that -- so you may do hardware things to help. My real thought is I just do not know how really necessary there are or not. I always have in mind this -- this -- this think in common with -- discussions with Peter Hofsdy [phonetic], okay? Before putting something in hardware kind of his philosophy was before putting something in hardware think very much whether you really need -- whether cannot you do that in software? >>: But who say that, you or Peter? >> Jesus Labarta: Peter. Yes. Because you put these things in hardware, this is -- this is power and this is -- and he had the very minimalist view of the power ->>: [inaudible] I mean the energy is powered by [inaudible]. >> Jesus Labarta: No, I say it's energy. You are going to put in hardware things that could you put in software. The other thing I would really like to know, and I'm missing -- I'm lacking experience. And I know that it's interesting to experiment our field. It's interesting experimenting in finding out ways of putting these things in hardware, yes, I agree. And I think it's important. What I still am missing is a real -- feeling a real perception of to what level will be economically convenient or necessary. >>: I have another thing I think supports your point of this argument, and that is that a lot of times we don't know what the policy is. And sometimes when we do know what the policy is, it's hard to get the policy to respond as fast as the hardware mechanism needs it to respond. Think about scheduling threads in SMT versus scheduling threads in StarSs. Right? The tasks here are just fine to schedule that granularity, but if you start getting into a situation where you really don't know whether a task is ready to run or not when you're doing things in hardware, then you have a policy implementation thing you have to do that will inform the hardware fast enough so that it can keep doing things at these hardware rates. And that's sometimes challenging to do. The same thing comes up in other kind of resource allocation strategies where prefetching is another example. But what is the prefetch policy and how is it set? Some computer architects that I know, but not well, think prefetching is just wonderful and we'll just prefetch everything. And that wastes bandwidth sometimes. Well, where's the policy? And who -- how is it set? And what mechanism sets that policy? Is it set for the whole app at the beginning or what? How does that -- so I ->>: This is very well known. I mean you don't do prefetching always. >>: Well, so -- how is -- what's the policy then? >>: So the question is that -- I mean it's easier to [inaudible] without having any kind of hardware because [inaudible] to be matching with [inaudible] but you know very well that after some studies you decide that one -- functionality will be very common normally [inaudible] to put that in hardware, okay? So now we have -- my opinion that we have a huge opportunity from the acknowledge provided by [inaudible] to decide what kind of things should be put in hardware or not. But just one side. Prefetching. I would say this guy Jesus proposed for the numeric algorithm because all these did not have the [inaudible] that they [inaudible] that they -- they -- he has -- he's doing the real prefetching because he know in advance what they are going to need. This is very good. But the thing that I -- I -- I like more is the reduce he does. Because we know -- we know that in the current -- in the current [inaudible] we have a [inaudible] with the bandwidth. So the problem that you propose this graph allow that. And the question [inaudible] what are the [inaudible] that allows to you do that, I don't know. But this is very important thing out [inaudible]. >>: Do you a lot just starting here and just working below here. You can do quite a bit in terms of spatial and temporal ints, for example, you know a great deal at this level. >>: Local memory versus cache. >>: Yeah, yeah, you know a lot of things here. >>: Local memory is very important to ->>: Yeah, yeah. And you can also, you know, you can treat the critical path differently than the non-critical path. And that's [inaudible]. When something on the critical path is enabled, when all its predecessors are done and you know it's runable, put it at the end of the run queue, not at the tail. Don't put it at the bottom. Get it done now, because it's ready. So things like that. And all of this can be done at this level. And it's hard to do this lower than this. And you know, very rich interface to hardware to [inaudible] if you were to do the hardware. So I think this is a good place to leverage. I always have thought that. I've always thought that, you know, small -- if you can schedule small tasks efficiently and build a system like this based on that, you have tremendous leverage against a lot of important parallel computing problems. And I bet you agree with that, too. I mean, SOAP works that way, too, right? SOAP has very much the same kind of philosophy. >>: Exactly. >>: And the interesting thing is how eagerly you eliminate anti-dependencies. Yeah. I mean, the usual thing is you get all of these block continuations eating up space, but you have another source of space consumption which is all of this buffering stuff. >> Jesus Labarta: Yeah. And the -- let's say there are current situation for example in the different versions that we have, the one where we use this -some of them we don't do any of that renaming. Depends on which application. And we have to be very careful in order to limit the size, to limit the size we devote to that because yes, it can be very, very eager. So this is an area where we have done these versions with a little bit lazy renaming I think our important, the original one that we did was very, very eager. And we had to control the space. And this is a ->>: [inaudible] will generate a lot of renaming if you're not careful. Yeah. Yeah. >> Jesus Labarta: Yes. That -- but it was curious in many cases. Some case the example I showed of the matrix multiplier which how [inaudible] that gives you a lot with -- I think with simple algorithms. Because the other way -- the other thing people does is they do by hand this thing, they do expansions ->>: But [inaudible]. >> Jesus Labarta: No, no, this is -- at the moment. >>: [inaudible]. >> Jesus Labarta: At the moment we are trying with grow max, we are going to do gadget, we'll have to do wharf. >>: [inaudible]. >>: So if you think about this evolving DAG the way you're generating these tasks and you can sort of see how -- how wide things are sort of just in time to avoid -- essentially to do either update in place or not for the predecessors of a task. So if this task could run in parallel if it's renamed, but has to wait if it's not, you could sort of on the fly sort of figure out ->> Jesus Labarta: Yeah, this is what is done here. This version what we call here lazy renaming is a little bit -- I don't -- if I have -- if I have sufficient parallelism we don't do the renaming. If you don't have a lot of parallelism, then you do it. Yes. That is what is doing. You -- you try to avoid it as long as you have parallelism. This is what I have. So this is -- we do it because of -- yeah, this is what you said. We have to. >>: [inaudible] memory you do in such a way that you [inaudible] memory. You mentioned that [inaudible] the allegation of memory, the model, did you mention ->> Jesus Labarta: Not yet. [inaudible] mention renaming yet. There's another thing which I didn't mention because I didn't go into the example. One thing about the way which has probably might have some issues with Cilk and the recursion where you got the recursion this kind of -- these kinds of things it was in the sort and the [inaudible]. And the [inaudible] example where -- we have not tried it. This would be [inaudible] is the decision of whether -- when to cut the recursion and when to start generating tasks and when not. That should be able -- should be something adapted based on the actual -- on the actual number of the width of the already enrolled graph and ->>: Well, or done the way Burt Halstead did it even, right? I mean, you sort of -or the way Haskell does it actually. Which is you materialize little -- little what according to Haskell are called sparks and put them on stack actually. That's what Cilk does too, I think. >> Jesus Labarta: Yes, Cilk does this kind of ->>: Yeah. And you put these things on the stacks so that if you -- if you get to the -- you do them yourself but they're still, they're runable by some other -- by some worker that needs work to do. >> Jesus Labarta: But still if you were -- one way is -- you make -- if you generate them as tasks still you have some overhead. >>: You don't generate them as tasks. They're sort of even pre-tasks. At Haskell they're called sparks. I don't know what we call them on the NTA. And I don't know what Cilk calls them either. >>: We call them like DQ. It's [inaudible]. >>: Yeah. But it's just a description of the work. In fact, it looks like, you know, task blah, blah, blah. >> Jesus Labarta: Yeah. A descriptor, yeah, the descriptor which calls locally ->>: Yeah. It's just a descriptor and you -- and you ->> Jesus Labarta: But if you were -- if you were not even creating the descriptor, you would also avoid some overheads, okay? >>: Yeah. Yeah. So the cheapest thing to do is just leave it on the stack and let it get encountered by -- by whatever core owns that stack at the time. Right? And then you have to make that work visible to other cores that need -- that need work to do. What we did on the MTA was -- David Callahan and I suggested actually putting it on the stack sort of late in the game before we left [inaudible] to come up here. We were talking about doing the -- leaving it on the stack, but the way it worked is we just had a FIFO of these descriptors and we would run unblocked continuations as long as they existed and [inaudible] but if there were no more unblocked continuation, no more things with stacks that needed to run, we would go get work out of this thing. And that -- that's sort of the way that worked. There's been variations on this theme, but it's the only way though to really make divide and conquer work well, recursive divide and conquer. It's really pretty simple when you do it this way. So it's just more software for you, you know, just [inaudible]. >> Jesus Labarta: It would be nice to find what is a minimal functionality that's implemented in hardware lets you apply policies, change the policies or adopt the policies and so forth. >>: Always. That has been our problem. >> Jesus Labarta: Change the problem you have to address. >>: People introduce the prospect and the process [inaudible] hardware this could [inaudible] okay what part of the runtime you should do the hardware [inaudible] very carefully [inaudible] I don't know, the [inaudible] I don't know a [inaudible] we were discussing yesterday with [inaudible]. >>: That kind of synchronization. >>: What kind of project you need some hardware to do that [inaudible]. >>: Yeah. Just -- I don't know if I talked to you. I talked to ->>: Osman. >>: Osman and you, especially Osman and the students about this M weight thing. That -- some of that we did in conjunction with Tim Harris, talking about the work stealing of this kind in Haskell and how to implement that. There's a -there's two things you have to do. One is you have to resolve the race between the core that owns the -- let me call it a spark. That's what Haskell calls it. So Haskell's spark is one of these little descriptors of work that if you encounter it inline in your own execution on your stack then you go ahead and do it, but somebody else could steal it. If it's stolen, you have to make the local core essentially wait for that thing which may mean that it actually yields the core to some other computation. That whole thing turns into a continued ways because it's been slower. Let's suppose these two things have to combine to go up. One core's this way, let's say this way to get on your side of the truth. So you go down this way and there's this subcomputation rooted right here, okay? So this thing's working down here. Some other core comes along and says I want that and takes that and starts doing this. This one comes back up here. Now it has to block, waiting for this value coming from below or waiting for the side effects for whatever it is it has to block and then when this arrives there it has to unblock the whole computation. So it's avoiding the -- you know, it's sort of dealing with the race between these two things and also arranging for the right of completion semantics. So it's not hard to do. We did that with full NT bits on the FTA and Haskell does it with regular stock x86 hardware some miraculous way. Well, you know, it's AMOs and compare and swap and things like that. And, yeah. But we made sure we could do that too. That was one of the important test cases was to be able to do all of that. >>: Okay. I think that's the end of it. Thank you very much.