>> Paul Larso: So we are pleased to welcome Pinar Tozun here today. Pinar is going to talk about how to speedup the transaction processing and improve scalability. Many of you already know Pinar from various conferences and various discussions, but just for the record she works with Anastasia Ailamaki’s group at EPFL where they have been pounding away at how to speedup transactions for quite some time. So welcome. >> Pinar Tozun: Thanks Paul and thanks’ a lot for joining my talk. I am really happy to be here today. So as we run one of the most fundamental data management applications on today’s commonly used server hardware most of the underlying infrastructure is actually wasted. We can hardly utilize all of the processing units and more than half of the execution time goes to waiting for some item to come from memory when we execute transactions. So today I will initially talk about why this is the case and then we will see some techniques and insights to overcome these problems. So I wasn’t sure about the scope of people who might be watching so: What is transaction processing? Well, it was the primary reason why relational data based model was invented by [indiscernible] and today it is still one of the most fundamental applications in data management. Mainly because it is crucial while handling money in banks and finance sectors and so on. And some of the characteristics for these applications are that you see many concurrent requests to the data and these requests usually touch a small portion of the whole data. For example, in banks you usually deal with a few accounts in the scope of a transaction. And they also require high end predictable performance. Now, today we also have many online applications that deal with transactions and even though these have different higher level functionalities and goals the core characteristics don’t really change. And in fact these online applications they amplify these characteristics, meaning that you see a lot more concurrent requests to the data. And you have a lot bigger data, meaning that touching a small portion of it requires really smart indexing and caching mechanisms to be able to maintain high end predictable performance. Now moving from application space to the hardware we run these applications on. Before the last decade hardware initially inundated on implicit parallelism through techniques like pipelining, instruction level parallelism and simultaneously multi-threading. But around 2005 due to power concerns vendors stopped putting more complexity in a single core, but they started to add more cores within a processor so that we would still be able to turn Morris Law into more performance. And today we have multiple, increasingly have multiple processors in one machine so the core to core communication latencies are no longer uniformed. And finally in the future, and we already start seeing some examples of this today, there will be even more cores in one processor and more processors in one machine. And also we expect them to be heterogeneous in terms of their functionality. So looking at this picture basically what hardware has been giving us over the years was more and more opportunities for parallelism, even though it’s leveled and typed changed over time. And also it is becoming more and more non-uniform and heterogeneous in terms of memory management and functionality. So what is crucial for us for software applications is that we need to be exploiting this parallelism because it’s hardly unlikely to disappear with the emerging hardware. And also we need to become aware of [indiscernible] and the heterogeneity to know how to optimally do our memory management and task scheduling over these heterogeneous cores. So let’s now see how well traditional transaction processing systems can exploit what’s given from the modern hardware. Here we consider a very simple transaction that just looks for one customer in the database and reads the customers balance. We are running this transaction on Shore-MT Storage Manager, which is an open source storage manager that has traditional data management system components and also it’s specifically designed for the multi-core era and we are using an Intel Sandy Bridge Server here. So, on the X axis we are increasing the number of worker threads executing this transaction and Y axis plots the speedup over the throughput of a single worker thread. So ideally what we would like to see is as we increase the number of worker threads they would all be able to utilize a corresponding number of cores in the same way so we can achieve linear speedup. And that’s what the red line here shows. But what happens with the traditional system is that it’s not bad, you still see a straight line, but the problem is that the gap between the ideal line and the conventional line it increases as you increase the parallelism in the system. So this is not a problem just for the current hardware we have, this will become even worse with the next generations of hardware. Yeah? >>: [inaudible]? >> Pinar Tozun: It’s Shore-MT that we are using. So it’s similar to like really traditional commercial systems in terms of the components and how it is designed. So on the other hand, actually, the single thread can get the highest throughput it can get on this hardware. How well we can utilize a single core regardless of the parallelism that we have. And for that we are looking at the instructions retired in a cycle when we execute a single tread. And the hardware we use has 4 way issue processors, meaning that in each cycle each core can actually execute up to 4 instructions. But, for transaction processing you can barely retire one. So there is quite a bit of under utilization in this level as well. To sum up the problems of traditional transaction processing: one is that we cannot really take advantage of the available multi-core parallelism and there is quite a significant under utilization of the micro architecture resources in a core. Yeah? >>: Do you know current Intel CPUs? Like how, like how free of structural conflicts is the 4 way issue? Can you really issue 4 instructions and have them execute like top to bottom through the entire pipeline or will you end up running into a bunch of structural conflicts if you issue 4 instructions at the same time? >> Pinar Tozun: I mean it’s hard to get to 4, that is true, but with applications that have smaller instruction footprints, let’s say, you can usually go for more than 2 at least. Like with the speck benchmarks they can go way higher. >>: Ok. >> Pinar Tozun: So during the next half an hour I will present several techniques that aim to overcome this problem of under utilization, but I want to initially give the higher level insight behind all of them. So when we traditionally think of transactions we tend to think of them as one, big single task that we need to assign somewhere and execute. So they are a black box to us and this leads to sub optimal memory management decisions and scheduling decisions for the tasks at hand. Now, instead of doing this we will think of transactions as forms of finer-grained tasks or parts. And we will determine each of these parts based on the data and instructions accessed in each part. And based on the locality you want to improve you can give more importance to data or instructions. Then we will assign each of these cores to one of these parts and basically schedule transactions dynamically over these cores. And by doing that if we go back to the graphs we have seen before, now I added the green line and the bar, we can get much closer to the ideal lines that we would like to achieve, both in terms of scalability at the whole machine level and in terms of utilization in a single core. So for the rest of the core we will first focus on scaling up transaction processing on multi-core processors and then we will see how we can improve utilization in a single core. And in the end I will conclude with some higher level take away messages that I think are applicable for software systems in general and show what there is to focus on at this step. So let’s start with scalability. Now on this picture we see a sketch of shared debiting architecture, traditional shared debiting architecture for transaction processing. At the higher level we have the management of the worker threads in the system. And at the lower levels we have the management of database pages. And these pages can be either the data pages that keep the actual database records or the index pages that allow fast accesses to the data pages. Now when requests come from different clients we usually assign them in a random fashion to the worker threads of the system and here we have the blue and green workers. So since the data is shared we need to ensure safe accesses to this shared data because threads might have conflicts over the data. So we have a lock manager structure that basically ensures isolation among different transactions and protects database records. And at the lower levels of the system we have sortterm page latches that ensure consistent read/write operations to this data. So basically all these lock causation or you can have different METAS, but in the end you de-cluttered execution of a transaction with many critical sections. If you look at the critical section breakdown in this architecture using again a very simple transaction that just reads one customer and then updates the balance for this customer we see that even for this simple transaction you need to go through around 70 critical sections. And what’s more problematic is that most of these critical sections belong to a type that we call un-scalable critical sections. So these are the ones that as you increase the parallelism in your system the contention on these critical sections have high chances of increasing as well. And database locking/latching belongs to this type. Yeah? >>: Do you know off hand how many critical sections would it take to transfer money from one account to another [inaudible]? Would that be roughly double, like 150? >> Pinar Tozun: Yeah, it will roughly double this because both are like reading one data through the index structure and so on. >>: Excellent. >> Pinar Tozun: The other two types of critical sections I show them in lighter colors because they create less contention. So the cooperative critical sections are the ones that threads contended for a resource can combine their requests and do it all at once later. So in databases group commit for login can be an example for this. And we also have critical sections that create fixed contention. They usually satisfy some produce or consumer like relationship and they happen only among fixed number of threads regardless of the parallelism around. So some critical sections within the transaction manager are a good example for this. Yes? >>: Is this data sidestepping? Is it for a particular [inaudible]? >> Pinar Tozun: So this is, I say, it’s not exactly independent of the data size because like as you, if the index size increases for example, probably you will need to latch more pages so that might increase, but not exactly proportional, like linear. >>: So what is un-scalable critical access again? Sorry. >> Pinar Tozun: Uh, so those ones are the ones like where you, let’s say go from 4 cores to 8 cores. The contention on these, the threads contended for these critical sections you might have a similar increase. For example, like previously on this page you only have 4 threads that might get this large. Whereas if you move to 8 or more cores you have a higher number of threads that might be wanting to get this large, so that kind of increases. So these are the ones that might become like even worse when you have more cores in your system. That’s why we call them un-scalable. >>: Hm. >> Pinar Tozun: So to sum up the unpredictability that we have while accessing the data leads us to be pessimistic and therefore we execute many un-scalable critical sections during a transaction and this eventually hinders scalability in our systems. Now to overcome this we will redesign both layers of this architecture. At the higher level we will assign specific parts of the data to specific worker threads. For example here customers whose names start from A to M will be given to the blue worker and the rest of the customers will be given to the green one. So this way we can basically decentralize the lock manager and get rid of the un-scalable critical sections on the lock manager. And at this level this is just a logical partitioning of the data. Now going further down we will replace the single rooted index structure with a multi-rooted one. And we will ensure that each sub tree in this structure will map to one of the partition ranges at the higher level. And further down by slightly changing the database insert operation we will ensure that each data page is pointed by a single index leaf page. So by doing that we can ensure single threaded accesses to both database records and database pages. And this architecture we call it physiological partitioning. So like even though your partition the data structures you still have a single database instance so you are still in a shared everything system. So due to partitioning you might have some overheads with multi partition transactions and so on, but it’s less compared to pure shared nothing architecture. Yeah? >>: I am a little unclear on your goal here. So when you showed the ideal chart, [inaudible] is that correct? Because everybody is [inaudible] there is nothing much you can do, right? The ideal [inaudible] does not look like that. >> Pinar Tozun: Um, yes, but still like that’s the ideal for you. That’s what you would like to achieve. >>: So you won’t achieve that [inaudible]? >> Pinar Tozun: Yeah, or like get close to that as much as possible, because for that charge what I was executing were contention free transactions. >>: [inaudible]. >> Pinar Tozun: Yes. achieve this. So for that one at least you should be able to >>: And [inaudible] there is no such thing as non-scalable, un-scalable part? >> Pinar Tozun: Um, so there actually still is because, especially because of the lock manager, because in order to get a lock there you first need to latch the lock manager and so on. So you still have, unfortunately, those critical sections. So with this design, if we look back again to critical section breakdown we now eliminated the majority of the un-scalable critical sections and turned some of them to fixed type of critical sections. And this way what remains now are mostly lighter critical sections in the execution. >>: So if you are looking up a code name or customer name [inaudible]? The top level of the B tree will already have it. So what [inaudible]? >>: [inaudible]. >>: I see. >>: So what happens when you want to operate across accounts here though? >> Pinar Tozun: Yeah, so for those types of things you can actually, like if the requests are independent of each other those can happen like in parallel or if they are dependent then like you will need to pass control from one thread to the other. So of course there is some core associated with that, but as long as you can keep such communications within one processor, like if you are not doing [indiscernible] communication you still see the benefits with this design. That is our experience at least. So to also show some performance numbers here again we are looking at the transaction from before and increasing the number of worker threads executing those transactions on the X axis and we plot the throughput on the Y axis, which is transactions executed in a second and it’s in thousands. So the physiological partitioned design it performs better and scales better compared to the conventional design. Both on the Sun Niagara architecture that has support for more hardware context and on the AMD architecture that has faster CPUs. And what’s more important also for us is that you see more benefits as you increase the parallelism in the system. And that’s what we want to see especially for scalability. >>: And can you just explain what the critical sections are [indiscernible] because of partitioning? >> Pinar Tozun: So we eliminate basically latching on the lock manager. So you still need to lock for the database records, but like you have no contention on the records and again you have no latch for the lock manager itself. So that’s what you eliminate and also you eliminate latching on the index pages and latching on index pages. Like even if you are read only, sometimes like if you see many requests rushing to the index root it also creates a huge contention, even for some read only transactions. Do you have any more questions related to this part? >>: What if the workload becomes skewed over time and you are essentially running single threaded through one of the partitions at that point, do you re-partition? >> Pinar Tozun: Yeah, yeah, we have a mechanism, like we have a monitoring thread that basically monitors the sub-ranges. And a user can determine the granularity there, but we monitor the sub-ranges and also thread guesses to see how contended partitions are. And we can repartition online based on that. And also, like in this system for example, if you want to repartition, let’s say you want to put this part of the range to here you just copy what the content is here to here and you don’t have to worry about what’s underneath. So you don’t have so much data at once. Yeah? >>: [inaudible]. >> Pinar Tozun: So on this hardware, like in the end for 64 and 16 nonworker threads actually OS is the problem, because you start to hit some [indiscernible] in terms of scheduling. Like it’s the boundary for the machine so that’s what creates the problem for this experiment. And also you still have like some un-scalable critical sections here which either comes from buffer pool or some --. Again, you still have some centralized structures, internal structures. And we have recent work actually that also partitions them at second boundaries so you can eliminate these as well. >>: So it’s a CPU or an IO [indiscernible]? >> Pinar Tozun: We don’t have IO in those experiments, so it’s mostly -. >>: [inaudible]. >> Pinar Tozun: I will say for these missions it’s mostly the operating system because we don’t enforce any core affinities in these experiments. So it’s like the way the scheduling becomes the problem towards the end. Yeah? >>: You get much more improvement on the AMD than the Niagara. the reason, the simple reason? What is >> Pinar Tozun: So that’s actually I think just related to a core being faster there. So it’s just the processors being faster, it gives more benefits. >>: So I am trying to understand, so between 16 and 32 let’s say in one of your graphs. So the 16 minute has 16 partitions in a B tree and 32 left, 32 partitions in B tree. Are these two different B trees or is [inaudible]? >> Pinar Tozun: Okay, for this experiment it’s like you actually have the whole data in place, but for 32 worker threads yeah, you deal with 32 partitions. >>: So what happens if you have a transaction that’s complex with just multiple B trees? Is it partitioned across multiple workers or is a single worker responsible for [indiscernible] transfers? >> Pinar Tozun: No, it’s partitioned across different workers. each table like you have a different set of workers. So for >>: It’s the Shore architecture to support that? >> Pinar Tozun: Uh, yeah, so this is just for this design that we have that. So in default Shore you will have a single thread executing a single transaction. So, for this design we --. >>: It looks like a big change; I mean how transactions are executed. >> Pinar Tozun: Um, yeah, we had to change some things at the thread management layer. Quite a bit of things at the thread management layer as well, that’s true. But, you can automate it to some extent using like the SQL for intent, because we were using the SQL for intent of post [indiscernible] to determine basically the different independent parts of a transaction, like you can automate it to some extent. Um, so we have seen how we can improve utilization at the whole processor level. Now, let’s see what the problems are within a core for transaction processing. Okay, so initially let’s remind ourselves how the memory hierarchy of a typical Intel processor looks like today. So we have closet to the core, we have the L1 caches which are split between instructions and data. Then there are the L2 caches which are also private per core. And then there is a shared L3 cache or last level cache. And in the end we have the domain memory which today is large enough to fit the working said sizes for most transaction processing applications. So as we go down in this hierarchy basically the access latencies drastically increase as expected, but like in practice if you find some item in L1 because of the other 4 direct executions you usually pay no cost for this. But, if you cannot find the instructions or data you need in L1 caches then you need to go further down and the core might stall because of this. Now, let’s see how much of these stalls we actually have for transaction processing. On this graph we are breaking the execution cycles into busy cycles and the stall cycles. We are looking at the standardized TPCC and TCPE benchmarks here. We are running [indiscernible] on Shore-MT and we are using again a Intel Sandy Bridge Server. Now busy cycles for us they mean the cycles that you can retire at least one instruction. Install cycles are the ones where you cannot retire any instructions. So from these two bars we see that more than half of the execution time goes to various stalls. Now if you want to investigate why these stalls happens the graph on the right hand side shows the breakdown of the stall cycles in a period of a thousand instructions into where they come from in the hierarchy and also whether they happen due to data or instructions. For example L3-D means the stall time due to data misses from the L3 cache. As a side note, drawing these components on top of each other it’s a bit misleading because in practice some of these stall times will be overlapped with others. Again, due to our further execution, but it doesn’t change our higher level conclusions here. So looking at both of the bars we see basically the L1 instruction misses being a significant contributor to the stall time for these workloads. might be able to overlap the --. Huh? You >>: Does someone have a question? >>: Yeah, no, I jut want to understand what the complete software stack here was. You were running Shore on the bottom or did you have [indiscernible]? >> Pinar Tozun: Uh, we have the application layer for Shore, which has the benchmarks implemented in C++. >>: Okay, so it’s just test driving? >> Pinar Tozun: Yes, yes, yes, so there is no communication. >>: And does this include the test driver part of it as well or just the Shore? >> Pinar Tozun: It includes the test driver part as well, but it’s not, like compared to Shore, it’s not so big. >>: [inaudible]. >> Pinar Tozun: Yeah, yeah, yeah. So here you might be able to overlap some of the data misses to L1 caches to some of the long latency data misses from L3, but we cannot really overlap all of the instruction related stall time here with something else. And also, it’s harder to overlap instruction miss penalties because when you miss an instruction you don’t know what to do next. So it’s hard to compare to data. Seeing the problem related to instructions now let’s see like what are the instructions transactions actually execute where they come from. So when we look at that we basically see that no matter how different transactions are from each other they execute a subset of database operations from a predefined set. And these operations might be an index problem, index scan or record update late inserts operations. So they are not so numerous. Now, let’s also say that we have an example control flow for a possible transaction and basically it has itself a conditional branch and based on that branch it might insert a record to table Z or not. So we also have two threads in our system executing this transaction with different input parameters. So T1 might execute it without taking the branch, whereas T2 might execute it by taking the branch and inserting a record. So overall even threads executing the same transactions they might not exactly execute the same instructions, but still they share a lot of common instructions because they share database operations. To quantify this commonality we have analyzed the memory access traces from the standardized TPC benchmarks to see the instruction and data overlaps across different instances of transactions. And here I have the results for the TPC-C benchmark. We have the TPC-C mix then the top two most frequent transactions in the TPC-C mix which are new order and payment. So each of the pies here represent the instruction and data footprint for the indicated transaction types and each of the slices indicate the frequency of that portion of the footprint appearing across different instances. So the darkest red part for example represents the instructions and data that are common in all the instances for these transactions and the light blue part is for instructions and data that are only common in less than 30 percent of the instances. So from these charts what we see is that there is a significant, like a really non-negligible amount of overlap for instructions, whereas we don’t see as much overlap for data. And here we have 100 gigabytes of data meaning that at the database records level, considering random accesses, you don’t see that much overlap and what is common is mostly the Meta data information or index roots. Also, if we look at the granularity of a single transaction type we see more overlaps naturally; because the same type of transactions have higher chances of executing the same database operations. Yeah? >>: [inaudible]. >> Pinar Tozun: Okay, so the problem is that even though you share instructions the instruction footprint is too big to fit in the L1 caches. So even though you have common instruction parts you keep remising them because of the footprint. So it’s like capacity related and for data it’s easier to overlap the data misses and also, when you bring --. It’s mostly related to overlaps at the hardware level. So if you want to basically exploit this instruction commonality across transactions we decided to design an alternative scheduling mechanism and for that we again have, to give an example, we again have two threads executing the same transaction type, but because of different inputs they don’t execute the same thing exactly. Different colorful parts here indicate the different code parts and they are picked at a granularity that can fit in an L1 instruction cache. >>: Before you get into that I have a question. So the instruction cache misses and how far do they go? Can most of them be satisfied from the L2 cache? Did you look at that? >> Pinar Tozun: Um, yes, yes, for sure yes you can. So you don’t have to go from the processor. You mainly satisfy them either from L2 or some of them from L3. >>: From L3, so that’s what [inaudible]. >> Pinar Tozun: So in the conventional case what we would do is to assign these transactions to one of the idle cores and unless you see an IO operation or some context switching they will finish their execution on that core. So let’s say we scheduled T1 to one of the idle cores and it will need to bring the instructions it needs for the red code part. So it will observe a penalty for the instructions it will see for this part and I am going to count such misses on the side as cache fills. So next when T2 comes to the system again both of these threads will observe instruction misses for the code parts they need to execute. And this will continue until the rest of the execution. Now, rather than doing this we will do the following: we will again schedule T1 to one of the idle cores and it will again observe instruction misses for the red code part, but once it fills up the L1 instruction cache we are going to migrate it to another idle core and instead schedule the new [indiscernible] to the initial core. Yeah? >>: Why would [indiscernible] instruction [indiscernible]? >> Pinar Tozun: But, you still have some, like either jumps or branches in the code. So the next [indiscernible] doesn’t handle all of it. I mean the stall time results I show, they are from a run on the real hardware. So it cannot get rid of all the instruction misses. I mean you can try to have a better code layout, but like it changes sometimes from workload to workload so it’s not that easy. So yeah, this time T1 will still observe instruction misses, but T2 can reuse the instructions already brought by T1. And again, when these two fill up their instruction caches we will migrate T1 to another idle core and T2 to the core that already has the instructions. >>: Can T2 follow the branch as T1? >> Pinar Tozun: It doesn’t have to be exactly the same, but similar at least. So we are assuming that you are trying to budge as much as possible the same type of transactions, but we deployed it without considering that as well and it still works. >>: Just curious, what is the overhead of scheduling a tread from another code verses reading from an L2 [inaudible]? Are the comparable or is it much worse? >> Pinar Tozun: So the scheduling cost, of course, if you consider just L [indiscernible] verses just one thread migration the migration is costlier, but in the end you have this instruction locality so you reduce many reads from L2. Like in the end it pays off. Yeah? >>: This is touching on the crucial part of it. When you migrate a thread, that’s when it costs something, that’s overhead and then it’s going to execute for a little while from that target core and it really boils down to: how long does it need to execute on that target core for that overhead to get properly advertised? Do you have a feeling for that, a thousand instructions, 10,000 or 100,000? >> Pinar Tozun: Um, so we didn’t exactly measure that. So we just like left the transaction mixes running over these cores and we saw improvement still, but usually since you see lots of instruction commonality you benefit from this type of locality. But, I don’t have an exact number in mind. >>: So you don’t know how big the chunks were when they actually executed once they migrated? >> Pinar Tozun: So most of the time it’s close to the cache size, but it doesn’t have to be like exactly all the instructions in a cache. >>: So it’s following the instruction cache. Okay. >>: Something related to this thing: what exactly is a red chance and yellow chance? Is it based on the structure of the chance action or is it based on [inaudible] instructions? How do you [inaudible]? >> Pinar Tozun: So for this technique basically it all happens at the hardware level so it’s just L1 cache size chunks as far as the hardware is concerned. Like, it’s a programmer’s transparent way of doing this. I will show another technique will try to align this with database operations, but the important thing is that it just has to be L1 cache size. That’s all the restriction there is. >>: [inaudible]. >> Pinar Tozun: Yes, yes, I will show numbers related to that, that’s a good point. >>: Do you know, just how long does the handout take? handout between two threads take in general? How long does a >> Pinar Tozun: So what we do there is basically you need to carry what’s in the register files and the last program counter. So we measured it based on the number --. So we put them in the last level cache, migrate the thread and read them back from there and we calculated it, like it’s around 90 cycles. That’s the cost to do that as long as you are within the same processor and can use it. >>: It’s synchronous too, so there is some kind of cache line going back and forth [inaudible]. >> Pinar Tozun: So yeah, by doing this basically we both exploit the aggregate L1 instruction cache capacity for a transaction and we localize the instructions to caches so you can exploit instruction overlaps. And this technique we call it SLICC, it comes from self assembly of L1 instruction caches. And as I mentioned it’s a programmer transparent hardware technique so you need to dynamically track the recent misses for a transaction and also the cache contents in all of the caches to be able to know when and where to migrate. So this of course this incurs some hardware cost. It’s not so big, it’s less than a kilobyte space cost still, but we are doing something specific for our application. So what we want to investigate is basically can we have more software [indiscernible] to eliminate some of these hardware calls at runtime? And for that we looked at the instruction overlaps again, but this time at much more final granularity. It’s the granularity of database operations and these charts, again they show the instruction overlaps for different instances of update prob and insert operations called within the new order and payment transactions of TPC-C. And if we compare these overlaps with the previous ones the dark red part basically becomes a lot more apparent. So if we can align our migration decisions with the boundaries of database operations or cache size chunks in each database operation we might eliminate some of the hardware cause and also achieve more precise locality for transactions. And to do that we designed ADDICT which comes from advanced instruction chasing. So it’s formed of two phases, you need to do initially a profiling to determine the migration points in the transaction and also the second phase does the actual migrations based on the decisions of the first phase. So for the profiling phase, we as an example, we again have the transactions from before and let’s say we determine the following migration points. So initially you need to mark the start of think points, starting think point, starting instruction for a transaction. Then whenever you enter a database operation you need to mark the starting point for this operation as well. And here we have an index probe and as you are executing the index probe while you fill up your L1 instruction cache you need to mark that point as another migration point and start the second phase of the probe. And we do this for all the operations called within each transaction type. And in practice you do this analysis for several transaction instances and you pick the most frequently occurring migration points for the transactions. So then during the second phase we will first assign each of the migration points to a core and now when a thread comes you wills start it from the start core. Once it enters the probe operation we migrate it to the core for the probe operation and instead schedule the next thread in the queue to the start core. And this continues as long as you hit the migration points. So this time we again exploit the aggregate instruction cache capacity. We again localize instructions to caches so we can minimize the instruction misses for instructions, but compared to the previous techniques leak we don’t have to have as much hardware changes. All we need to check is whether the current instruction is a migration point or not. And also we can force more precise instruction locality by doing this. Now looking at how these techniques perform in practice the evaluation we do for this part it’s on a simulator because thread migrations, if you try to do it with the current operating systems we have it has a lot of cause. The operating system has a lot of bookkeeping for the context switching and doing that dynamically with some of the P thread infinity functions. Sometimes it doesn’t migrate the threads exactly where you want them to go if the destination core is already full. So we implemented our ideas on a simulator which simulates an X86 architecture. Here we have 16 out of order cores and 32 kilobyte L1 caches. And we are looking at 1,000 transactions, we are running and playing a 1,000 transaction traces on this simulator and they are taken as we run the workload mixes for TPCC and TPC-E on Shore-MT. So, on this graph we have the L1 instruction misses in a period of 1,000 instructions and we are normalizing the numbers based on the numbers for the core national scheduling. So both the program transparent techniques leak and the more software error technique ADDICT, they can significantly reduce the L1 instruction misses. But, as you also pointed out there is a catch here and that is when you migrate a thread you leave its data behind. So if we also look at the L1 data misses in a period of 1,000 instructions we increase the data misses in a non-negligible way. But, our claim is that for data misses at the L1 level they are not on the critical path for transaction execution because they are easier to overlap with out of order execution. And the important thing here is to basically not increase long latency misses from the last level cache. So not migrating a thread from 1 processor socket to another one. Yeah? >>: So I want to understand this. So actually physically sort of moving a thread from one core to the other you say that it’s 9 cycles and so on, but that does not include all the bookkeeping that the OS maintains for the threads, right. >> Pinar Tozun: Yes. >>: I mean that’s a lot of bookkeeping. get rid of that bookkeeping? So in practice how would you >> Pinar Tozun: So in practice what you can do is you can maybe the hack the OS to have some sort of more like more lightweight context switching specifically for this case where you just copy the register values and the last program counter and don’t care about the rest of the things. And actually there was this work called steps which tried context which transactions, a batch of transactions on the same core, to basically take advantage of a similar overlap and for that work they actually hacked the OS to have more lightweight context switching operation like that and they in lined all theses context switching points in the code. So we wanted to be like a bit more programmer transparent while doing this as well, but of course both have tradeoffs. In one case you need to put more burden of the program developer and in the other case you need to add some stuff in the hardware. Yeah? >>: It looks like for the scheduling that they have is very fine grained and is also very specific. So what happens when knowing the fast transaction when it’s in the middle of a lock? It tries to get a lock, and it’s able to get a lock and needs to be scheduled out. So [inaudible]? Does it help us to do such things? >> Pinar Tozun: So you don’t really, like you don’t affect correctness of the thing, but because of those cases you might, like by doing this, you might add some additional deadlocks maybe to the system. But, like at least in our traces, like considering that at the higher level you design your system in a way that you don’t have such high contention that shouldn’t occur so much. >>: So going back to Paul’s question, do you model thread migration at all in your simulation? >> Pinar Tozun: Yeah, yeah, so we add a cause, that 90 cycle cause like whenever we do that migration. >>: Yeah, so you wouldn’t actually use like OS level handoff here. you use your [inaudible]? Do >> Pinar Tozun: So the way we do it now is just hardware trusts or hardware context. You can again do it maybe with user level threads as well, but you need specialized migration cause at the OS layer to be able to do this in an efficient way or you can like completely change your software architecture if you really want to take it on. Like have threads fixed on a core and like make them responsible for a specific code portion that is small and handoff, like not migrate the threads, but handoff work in a way. Attach the attached transactions on the way on different threads. So let me, I have just a few slides. So to also see the throughput numbers with the same setup basically improving instruction locality even though you hinder data locality a bit it pays off. So we get like around 40 percent improvement with the programmer transparent technique and we can get around 80 percent improvements with the more transaction aware one. Now to sum up what we have seen so far, so we initially looked at scalability issues at the whole machine level and how we can improve utilization therefore transaction processing systems. And for that we initially analyze the critical sections in the system and focused on eliminating the most problematic ones. And we did that in our case using physiological partitioning to have much more predictable data accesses and this helped us in eliminating the majority of the unscalable critical sections. Then we looked at the major sources of under utilization within a core and we solved that L1 instruction misses play a significant role there. But, we also observed that transactions have a lot of common instructions. So, to exploit that we developed two alternative scheduling mechanisms: one was programmer transparent and one was more transaction aware to maximize instruction locality and then minimize the instruction latest the whole time. And we also saw that being more software aware it actually reduces some of the drawbacks these types of techniques might have at the hardware level. Now I will mention some of the higher level messages from this line of work that I think are valid for software systems in general and what there to do, to focus on in the near future. So throughout my PhD I focused on exploiting hardware for transaction processing applications, but some things both regarding scalability and scheduling, these I think are applicable in software systems in general. So regarding scalability when we show how well we scale we tend to show the performance numbers we have, which is also valuable, but they don’t indicate scalability on the next generations of hardware that you might have. So I think we need more robust matrix to measure our scalability and at least I think that looking at critical sections and analyzing critical sections in a system categorizing them based on which are the ones that hinder scalability the most might help us in terms of better measuring our scalability. Also, with the increased parallelism and heterogeneity basically scheduling tasks in an optimal way will become a real challenge for our systems. And the crucial thing here is to know where you need the locality to be. Like for instructions for transactions for transaction processing systems this was the L1 level and for data I didn’t go into detail to that line of work, but for data it’s important to keep data as much as possible either in your local L3 or in your local memory bank. So basically we need to adapt our scheduling putting emphasis on the type of locality we want to have for instructions and data. And for the feature now we already start seeing some proposals both from academia and some industrial places on specializing hardware for various database operations. Most of this work now focuses on analytics workloads, but there is some that deal with transactions as well. And like even companies like Google and Amazon who used to, who don’t like this idea very much, they are now realizing that when you have an application that runs on thousands of servers on many datacenters it’s better to have more specialized hardware for it to be more cost effective and performance effective. And also from the software side basically we need to be a lot more aware of the underlying hardware and hardware topology to be able to know how to do the memory management and how to schedule tasks in an optimal way. And also we need to know which type of special hardware we can actually exploit for which type of task. And in order, not to bloat our cause for all the different hardware types, we can also get maybe a half of the compiler community to dynamically generate some of the code for these specialized cases. So with that I also like building systems. It’s not the work of a single person so these are all the people I had a chance to work with during my PhD and with that I am going to conclude. If you want to know more details about the various projects, or see papers, or presentations, here is my website and the code base we use is also online, available and well documented so you can check it out. Thank you very much and I will be happy to have any further questions you might have. [clapping] >> Paul Larso: All right any more questions or did we –-? everything’s answered. Wow, all right. >>: Yes, yes. >> Paul Larso: All right let’s thank the speaker again. >> Pinar Tozun: Thank you. [clapping] That’s it,