>> Paul Larso: So we are pleased to welcome Pinar... Pinar is going to talk about how to speedup the...

advertisement
>> Paul Larso: So we are pleased to welcome Pinar Tozun here today.
Pinar is going to talk about how to speedup the transaction processing
and improve scalability. Many of you already know Pinar from various
conferences and various discussions, but just for the record she works
with Anastasia Ailamaki’s group at EPFL where they have been pounding
away at how to speedup transactions for quite some time. So welcome.
>> Pinar Tozun: Thanks Paul and thanks’ a lot for joining my talk. I
am really happy to be here today. So as we run one of the most
fundamental data management applications on today’s commonly used
server hardware most of the underlying infrastructure is actually
wasted. We can hardly utilize all of the processing units and more
than half of the execution time goes to waiting for some item to come
from memory when we execute transactions.
So today I will initially talk about why this is the case and then we
will see some techniques and insights to overcome these problems. So I
wasn’t sure about the scope of people who might be watching so: What is
transaction processing? Well, it was the primary reason why relational
data based model was invented by [indiscernible] and today it is still
one of the most fundamental applications in data management. Mainly
because it is crucial while handling money in banks and finance sectors
and so on. And some of the characteristics for these applications are
that you see many concurrent requests to the data and these requests
usually touch a small portion of the whole data.
For example, in banks you usually deal with a few accounts in the scope
of a transaction. And they also require high end predictable
performance. Now, today we also have many online applications that
deal with transactions and even though these have different higher
level functionalities and goals the core characteristics don’t really
change. And in fact these online applications they amplify these
characteristics, meaning that you see a lot more concurrent requests to
the data. And you have a lot bigger data, meaning that touching a
small portion of it requires really smart indexing and caching
mechanisms to be able to maintain high end predictable performance.
Now moving from application space to the hardware we run these
applications on. Before the last decade hardware initially inundated
on implicit parallelism through techniques like pipelining, instruction
level parallelism and simultaneously multi-threading. But around 2005
due to power concerns vendors stopped putting more complexity in a
single core, but they started to add more cores within a processor so
that we would still be able to turn Morris Law into more performance.
And today we have multiple, increasingly have multiple processors in
one machine so the core to core communication latencies are no longer
uniformed. And finally in the future, and we already start seeing some
examples of this today, there will be even more cores in one processor
and more processors in one machine. And also we expect them to be
heterogeneous in terms of their functionality.
So looking at this picture basically what hardware has been giving us
over the years was more and more opportunities for parallelism, even
though it’s leveled and typed changed over time. And also it is
becoming more and more non-uniform and heterogeneous in terms of memory
management and functionality. So what is crucial for us for software
applications is that we need to be exploiting this parallelism because
it’s hardly unlikely to disappear with the emerging hardware. And also
we need to become aware of [indiscernible] and the heterogeneity to
know how to optimally do our memory management and task scheduling over
these heterogeneous cores.
So let’s now see how well traditional transaction processing systems
can exploit what’s given from the modern hardware. Here we consider a
very simple transaction that just looks for one customer in the
database and reads the customers balance. We are running this
transaction on Shore-MT Storage Manager, which is an open source
storage manager that has traditional data management system components
and also it’s specifically designed for the multi-core era and we are
using an Intel Sandy Bridge Server here. So, on the X axis we are
increasing the number of worker threads executing this transaction and
Y axis plots the speedup over the throughput of a single worker thread.
So ideally what we would like to see is as we increase the number of
worker threads they would all be able to utilize a corresponding number
of cores in the same way so we can achieve linear speedup. And that’s
what the red line here shows. But what happens with the traditional
system is that it’s not bad, you still see a straight line, but the
problem is that the gap between the ideal line and the conventional
line it increases as you increase the parallelism in the system. So
this is not a problem just for the current hardware we have, this will
become even worse with the next generations of hardware.
Yeah?
>>: [inaudible]?
>> Pinar Tozun: It’s Shore-MT that we are using. So it’s similar to
like really traditional commercial systems in terms of the components
and how it is designed.
So on the other hand, actually, the single thread can get the highest
throughput it can get on this hardware. How well we can utilize a
single core regardless of the parallelism that we have. And for that
we are looking at the instructions retired in a cycle when we execute a
single tread. And the hardware we use has 4 way issue processors,
meaning that in each cycle each core can actually execute up to 4
instructions. But, for transaction processing you can barely retire
one. So there is quite a bit of under utilization in this level as
well.
To sum up the problems of traditional transaction processing: one is
that we cannot really take advantage of the available multi-core
parallelism and there is quite a significant under utilization of the
micro architecture resources in a core.
Yeah?
>>: Do you know current Intel CPUs? Like how, like how free of
structural conflicts is the 4 way issue? Can you really issue 4
instructions and have them execute like top to bottom through the
entire pipeline or will you end up running into a bunch of structural
conflicts if you issue 4 instructions at the same time?
>> Pinar Tozun: I mean it’s hard to get to 4, that is true, but with
applications that have smaller instruction footprints, let’s say, you
can usually go for more than 2 at least. Like with the speck
benchmarks they can go way higher.
>>: Ok.
>> Pinar Tozun: So during the next half an hour I will present several
techniques that aim to overcome this problem of under utilization, but
I want to initially give the higher level insight behind all of them.
So when we traditionally think of transactions we tend to think of them
as one, big single task that we need to assign somewhere and execute.
So they are a black box to us and this leads to sub optimal memory
management decisions and scheduling decisions for the tasks at hand.
Now, instead of doing this we will think of transactions as forms of
finer-grained tasks or parts. And we will determine each of these
parts based on the data and instructions accessed in each part. And
based on the locality you want to improve you can give more importance
to data or instructions. Then we will assign each of these cores to
one of these parts and basically schedule transactions dynamically over
these cores.
And by doing that if we go back to the graphs we have seen before, now
I added the green line and the bar, we can get much closer to the ideal
lines that we would like to achieve, both in terms of scalability at
the whole machine level and in terms of utilization in a single core.
So for the rest of the core we will first focus on scaling up
transaction processing on multi-core processors and then we will see
how we can improve utilization in a single core. And in the end I will
conclude with some higher level take away messages that I think are
applicable for software systems in general and show what there is to
focus on at this step.
So let’s start with scalability. Now on this picture we see a sketch
of shared debiting architecture, traditional shared debiting
architecture for transaction processing. At the higher level we have
the management of the worker threads in the system. And at the lower
levels we have the management of database pages. And these pages can
be either the data pages that keep the actual database records or the
index pages that allow fast accesses to the data pages.
Now when requests come from different clients we usually assign them in
a random fashion to the worker threads of the system and here we have
the blue and green workers. So since the data is shared we need to
ensure safe accesses to this shared data because threads might have
conflicts over the data. So we have a lock manager structure that
basically ensures isolation among different transactions and protects
database records. And at the lower levels of the system we have sortterm page latches that ensure consistent read/write operations to this
data.
So basically all these lock causation or you can have different METAS,
but in the end you de-cluttered execution of a transaction with many
critical sections. If you look at the critical section breakdown in
this architecture using again a very simple transaction that just reads
one customer and then updates the balance for this customer we see that
even for this simple transaction you need to go through around 70
critical sections. And what’s more problematic is that most of these
critical sections belong to a type that we call un-scalable critical
sections. So these are the ones that as you increase the parallelism
in your system the contention on these critical sections have high
chances of increasing as well. And database locking/latching belongs
to this type.
Yeah?
>>: Do you know off hand how many critical sections would it take to
transfer money from one account to another [inaudible]? Would that be
roughly double, like 150?
>> Pinar Tozun: Yeah, it will roughly double this because both are like
reading one data through the index structure and so on.
>>: Excellent.
>> Pinar Tozun: The other two types of critical sections I show them in
lighter colors because they create less contention. So the cooperative
critical sections are the ones that threads contended for a resource
can combine their requests and do it all at once later. So in
databases group commit for login can be an example for this. And we
also have critical sections that create fixed contention. They usually
satisfy some produce or consumer like relationship and they happen only
among fixed number of threads regardless of the parallelism around. So
some critical sections within the transaction manager are a good
example for this.
Yes?
>>: Is this data sidestepping?
Is it for a particular [inaudible]?
>> Pinar Tozun: So this is, I say, it’s not exactly independent of the
data size because like as you, if the index size increases for example,
probably you will need to latch more pages so that might increase, but
not exactly proportional, like linear.
>>: So what is un-scalable critical access again?
Sorry.
>> Pinar Tozun: Uh, so those ones are the ones like where you, let’s
say go from 4 cores to 8 cores. The contention on these, the threads
contended for these critical sections you might have a similar
increase. For example, like previously on this page you only have 4
threads that might get this large. Whereas if you move to 8 or more
cores you have a higher number of threads that might be wanting to get
this large, so that kind of increases. So these are the ones that
might become like even worse when you have more cores in your system.
That’s why we call them un-scalable.
>>: Hm.
>> Pinar Tozun: So to sum up the unpredictability that we have while
accessing the data leads us to be pessimistic and therefore we execute
many un-scalable critical sections during a transaction and this
eventually hinders scalability in our systems. Now to overcome this we
will redesign both layers of this architecture. At the higher level we
will assign specific parts of the data to specific worker threads. For
example here customers whose names start from A to M will be given to
the blue worker and the rest of the customers will be given to the
green one. So this way we can basically decentralize the lock manager
and get rid of the un-scalable critical sections on the lock manager.
And at this level this is just a logical partitioning of the data. Now
going further down we will replace the single rooted index structure
with a multi-rooted one. And we will ensure that each sub tree in this
structure will map to one of the partition ranges at the higher level.
And further down by slightly changing the database insert operation we
will ensure that each data page is pointed by a single index leaf page.
So by doing that we can ensure single threaded accesses to both
database records and database pages.
And this architecture we call it physiological partitioning. So like
even though your partition the data structures you still have a single
database instance so you are still in a shared everything system. So
due to partitioning you might have some overheads with multi partition
transactions and so on, but it’s less compared to pure shared nothing
architecture.
Yeah?
>>: I am a little unclear on your goal here. So when you showed the
ideal chart, [inaudible] is that correct? Because everybody is
[inaudible] there is nothing much you can do, right? The ideal
[inaudible] does not look like that.
>> Pinar Tozun: Um, yes, but still like that’s the ideal for you.
That’s what you would like to achieve.
>>: So you won’t achieve that [inaudible]?
>> Pinar Tozun: Yeah, or like get close to that as much as possible,
because for that charge what I was executing were contention free
transactions.
>>: [inaudible].
>> Pinar Tozun: Yes.
achieve this.
So for that one at least you should be able to
>>: And [inaudible] there is no such thing as non-scalable, un-scalable
part?
>> Pinar Tozun: Um, so there actually still is because, especially
because of the lock manager, because in order to get a lock there you
first need to latch the lock manager and so on. So you still have,
unfortunately, those critical sections.
So with this design, if we look back again to critical section
breakdown we now eliminated the majority of the un-scalable critical
sections and turned some of them to fixed type of critical sections.
And this way what remains now are mostly lighter critical sections in
the execution.
>>: So if you are looking up a code name or customer name [inaudible]?
The top level of the B tree will already have it. So what [inaudible]?
>>: [inaudible].
>>: I see.
>>: So what happens when you want to operate across accounts here
though?
>> Pinar Tozun: Yeah, so for those types of things you can actually,
like if the requests are independent of each other those can happen
like in parallel or if they are dependent then like you will need to
pass control from one thread to the other. So of course there is some
core associated with that, but as long as you can keep such
communications within one processor, like if you are not doing
[indiscernible] communication you still see the benefits with this
design. That is our experience at least.
So to also show some performance numbers here again we are looking at
the transaction from before and increasing the number of worker threads
executing those transactions on the X axis and we plot the throughput
on the Y axis, which is transactions executed in a second and it’s in
thousands. So the physiological partitioned design it performs better
and scales better compared to the conventional design. Both on the Sun
Niagara architecture that has support for more hardware context and on
the AMD architecture that has faster CPUs. And what’s more important
also for us is that you see more benefits as you increase the
parallelism in the system. And that’s what we want to see especially
for scalability.
>>: And can you just explain what the critical sections are
[indiscernible] because of partitioning?
>> Pinar Tozun: So we eliminate basically latching on the lock manager.
So you still need to lock for the database records, but like you have
no contention on the records and again you have no latch for the lock
manager itself. So that’s what you eliminate and also you eliminate
latching on the index pages and latching on index pages. Like even if
you are read only, sometimes like if you see many requests rushing to
the index root it also creates a huge contention, even for some read
only transactions.
Do you have any more questions related to this part?
>>: What if the workload becomes skewed over time and you are
essentially running single threaded through one of the partitions at
that point, do you re-partition?
>> Pinar Tozun: Yeah, yeah, we have a mechanism, like we have a
monitoring thread that basically monitors the sub-ranges. And a user
can determine the granularity there, but we monitor the sub-ranges and
also thread guesses to see how contended partitions are. And we can
repartition online based on that. And also, like in this system for
example, if you want to repartition, let’s say you want to put this
part of the range to here you just copy what the content is here to
here and you don’t have to worry about what’s underneath. So you don’t
have so much data at once.
Yeah?
>>: [inaudible].
>> Pinar Tozun: So on this hardware, like in the end for 64 and 16 nonworker threads actually OS is the problem, because you start to hit
some [indiscernible] in terms of scheduling. Like it’s the boundary
for the machine so that’s what creates the problem for this experiment.
And also you still have like some un-scalable critical sections here
which either comes from buffer pool or some --. Again, you still have
some centralized structures, internal structures. And we have recent
work actually that also partitions them at second boundaries so you can
eliminate these as well.
>>: So it’s a CPU or an IO [indiscernible]?
>> Pinar Tozun: We don’t have IO in those experiments, so it’s mostly -.
>>: [inaudible].
>> Pinar Tozun: I will say for these missions it’s mostly the operating
system because we don’t enforce any core affinities in these
experiments. So it’s like the way the scheduling becomes the problem
towards the end.
Yeah?
>>: You get much more improvement on the AMD than the Niagara.
the reason, the simple reason?
What is
>> Pinar Tozun: So that’s actually I think just related to a core being
faster there. So it’s just the processors being faster, it gives more
benefits.
>>: So I am trying to understand, so between 16 and 32 let’s say in one
of your graphs. So the 16 minute has 16 partitions in a B tree and 32
left, 32 partitions in B tree. Are these two different B trees or is
[inaudible]?
>> Pinar Tozun: Okay, for this experiment it’s like you actually have
the whole data in place, but for 32 worker threads yeah, you deal with
32 partitions.
>>: So what happens if you have a transaction that’s complex with just
multiple B trees? Is it partitioned across multiple workers or is a
single worker responsible for [indiscernible] transfers?
>> Pinar Tozun: No, it’s partitioned across different workers.
each table like you have a different set of workers.
So for
>>: It’s the Shore architecture to support that?
>> Pinar Tozun: Uh, yeah, so this is just for this design that we have
that. So in default Shore you will have a single thread executing a
single transaction. So, for this design we --.
>>: It looks like a big change; I mean how transactions are executed.
>> Pinar Tozun: Um, yeah, we had to change some things at the thread
management layer. Quite a bit of things at the thread management layer
as well, that’s true. But, you can automate it to some extent using
like the SQL for intent, because we were using the SQL for intent of
post [indiscernible] to determine basically the different independent
parts of a transaction, like you can automate it to some extent.
Um, so we have seen how we can improve utilization at the whole
processor level. Now, let’s see what the problems are within a core
for transaction processing. Okay, so initially let’s remind ourselves
how the memory hierarchy of a typical Intel processor looks like today.
So we have closet to the core, we have the L1 caches which are split
between instructions and data. Then there are the L2 caches which are
also private per core. And then there is a shared L3 cache or last
level cache. And in the end we have the domain memory which today is
large enough to fit the working said sizes for most transaction
processing applications.
So as we go down in this hierarchy basically the access latencies
drastically increase as expected, but like in practice if you find some
item in L1 because of the other 4 direct executions you usually pay no
cost for this. But, if you cannot find the instructions or data you
need in L1 caches then you need to go further down and the core might
stall because of this.
Now, let’s see how much of these stalls we actually have for
transaction processing. On this graph we are breaking the execution
cycles into busy cycles and the stall cycles. We are looking at the
standardized TPCC and TCPE benchmarks here. We are running
[indiscernible] on Shore-MT and we are using again a Intel Sandy Bridge
Server. Now busy cycles for us they mean the cycles that you can
retire at least one instruction. Install cycles are the ones where you
cannot retire any instructions.
So from these two bars we see that more than half of the execution time
goes to various stalls. Now if you want to investigate why these
stalls happens the graph on the right hand side shows the breakdown of
the stall cycles in a period of a thousand instructions into where they
come from in the hierarchy and also whether they happen due to data or
instructions. For example L3-D means the stall time due to data misses
from the L3 cache.
As a side note, drawing these components on top of each other it’s a
bit misleading because in practice some of these stall times will be
overlapped with others. Again, due to our further execution, but it
doesn’t change our higher level conclusions here. So looking at both
of the bars we see basically the L1 instruction misses being a
significant contributor to the stall time for these workloads.
might be able to overlap the --. Huh?
You
>>: Does someone have a question?
>>: Yeah, no, I jut want to understand what the complete software stack
here was. You were running Shore on the bottom or did you have
[indiscernible]?
>> Pinar Tozun: Uh, we have the application layer for Shore, which has
the benchmarks implemented in C++.
>>: Okay, so it’s just test driving?
>> Pinar Tozun: Yes, yes, yes, so there is no communication.
>>: And does this include the test driver part of it as well or just
the Shore?
>> Pinar Tozun: It includes the test driver part as well, but it’s not,
like compared to Shore, it’s not so big.
>>: [inaudible].
>> Pinar Tozun: Yeah, yeah, yeah.
So here you might be able to overlap some of the data misses to L1
caches to some of the long latency data misses from L3, but we cannot
really overlap all of the instruction related stall time here with
something else. And also, it’s harder to overlap instruction miss
penalties because when you miss an instruction you don’t know what to
do next. So it’s hard to compare to data.
Seeing the problem related to instructions now let’s see like what are
the instructions transactions actually execute where they come from.
So when we look at that we basically see that no matter how different
transactions are from each other they execute a subset of database
operations from a predefined set. And these operations might be an
index problem, index scan or record update late inserts operations. So
they are not so numerous.
Now, let’s also say that we have an example control flow for a possible
transaction and basically it has itself a conditional branch and based
on that branch it might insert a record to table Z or not. So we also
have two threads in our system executing this transaction with
different input parameters. So T1 might execute it without taking the
branch, whereas T2 might execute it by taking the branch and inserting
a record.
So overall even threads executing the same transactions they might not
exactly execute the same instructions, but still they share a lot of
common instructions because they share database operations. To
quantify this commonality we have analyzed the memory access traces
from the standardized TPC benchmarks to see the instruction and data
overlaps across different instances of transactions. And here I have
the results for the TPC-C benchmark. We have the TPC-C mix then the
top two most frequent transactions in the TPC-C mix which are new order
and payment.
So each of the pies here represent the instruction and data footprint
for the indicated transaction types and each of the slices indicate the
frequency of that portion of the footprint appearing across different
instances. So the darkest red part for example represents the
instructions and data that are common in all the instances for these
transactions and the light blue part is for instructions and data that
are only common in less than 30 percent of the instances.
So from these charts what we see is that there is a significant, like a
really non-negligible amount of overlap for instructions, whereas we
don’t see as much overlap for data. And here we have 100 gigabytes of
data meaning that at the database records level, considering random
accesses, you don’t see that much overlap and what is common is mostly
the Meta data information or index roots. Also, if we look at the
granularity of a single transaction type we see more overlaps
naturally; because the same type of transactions have higher chances of
executing the same database operations.
Yeah?
>>: [inaudible].
>> Pinar Tozun: Okay, so the problem is that even though you share
instructions the instruction footprint is too big to fit in the L1
caches. So even though you have common instruction parts you keep
remising them because of the footprint. So it’s like capacity related
and for data it’s easier to overlap the data misses and also, when you
bring --. It’s mostly related to overlaps at the hardware level.
So if you want to basically exploit this instruction commonality across
transactions we decided to design an alternative scheduling mechanism
and for that we again have, to give an example, we again have two
threads executing the same transaction type, but because of different
inputs they don’t execute the same thing exactly. Different colorful
parts here indicate the different code parts and they are picked at a
granularity that can fit in an L1 instruction cache.
>>: Before you get into that I have a question. So the instruction
cache misses and how far do they go? Can most of them be satisfied
from the L2 cache? Did you look at that?
>> Pinar Tozun: Um, yes, yes, for sure yes you can. So you don’t have
to go from the processor. You mainly satisfy them either from L2 or
some of them from L3.
>>: From L3, so that’s what [inaudible].
>> Pinar Tozun: So in the conventional case what we would do is to
assign these transactions to one of the idle cores and unless you see
an IO operation or some context switching they will finish their
execution on that core. So let’s say we scheduled T1 to one of the
idle cores and it will need to bring the instructions it needs for the
red code part. So it will observe a penalty for the instructions it
will see for this part and I am going to count such misses on the side
as cache fills. So next when T2 comes to the system again both of
these threads will observe instruction misses for the code parts they
need to execute. And this will continue until the rest of the
execution.
Now, rather than doing this we will do the following: we will again
schedule T1 to one of the idle cores and it will again observe
instruction misses for the red code part, but once it fills up the L1
instruction cache we are going to migrate it to another idle core and
instead schedule the new [indiscernible] to the initial core.
Yeah?
>>: Why would [indiscernible] instruction [indiscernible]?
>> Pinar Tozun: But, you still have some, like either jumps or branches
in the code. So the next [indiscernible] doesn’t handle all of it. I
mean the stall time results I show, they are from a run on the real
hardware. So it cannot get rid of all the instruction misses. I mean
you can try to have a better code layout, but like it changes sometimes
from workload to workload so it’s not that easy.
So yeah, this time T1 will still observe instruction misses, but T2 can
reuse the instructions already brought by T1. And again, when these
two fill up their instruction caches we will migrate T1 to another idle
core and T2 to the core that already has the instructions.
>>: Can T2 follow the branch as T1?
>> Pinar Tozun: It doesn’t have to be exactly the same, but similar at
least. So we are assuming that you are trying to budge as much as
possible the same type of transactions, but we deployed it without
considering that as well and it still works.
>>: Just curious, what is the overhead of scheduling a tread from
another code verses reading from an L2 [inaudible]? Are the comparable
or is it much worse?
>> Pinar Tozun: So the scheduling cost, of course, if you consider just
L [indiscernible] verses just one thread migration the migration is
costlier, but in the end you have this instruction locality so you
reduce many reads from L2. Like in the end it pays off.
Yeah?
>>: This is touching on the crucial part of it. When you migrate a
thread, that’s when it costs something, that’s overhead and then it’s
going to execute for a little while from that target core and it really
boils down to: how long does it need to execute on that target core for
that overhead to get properly advertised? Do you have a feeling for
that, a thousand instructions, 10,000 or 100,000?
>> Pinar Tozun: Um, so we didn’t exactly measure that. So we just like
left the transaction mixes running over these cores and we saw
improvement still, but usually since you see lots of instruction
commonality you benefit from this type of locality. But, I don’t have
an exact number in mind.
>>: So you don’t know how big the chunks were when they actually
executed once they migrated?
>> Pinar Tozun: So most of the time it’s close to the cache size, but
it doesn’t have to be like exactly all the instructions in a cache.
>>: So it’s following the instruction cache.
Okay.
>>: Something related to this thing: what exactly is a red chance and
yellow chance? Is it based on the structure of the chance action or is
it based on [inaudible] instructions? How do you [inaudible]?
>> Pinar Tozun: So for this technique basically it all happens at the
hardware level so it’s just L1 cache size chunks as far as the hardware
is concerned. Like, it’s a programmer’s transparent way of doing this.
I will show another technique will try to align this with database
operations, but the important thing is that it just has to be L1 cache
size. That’s all the restriction there is.
>>: [inaudible].
>> Pinar Tozun: Yes, yes, I will show numbers related to that, that’s a
good point.
>>: Do you know, just how long does the handout take?
handout between two threads take in general?
How long does a
>> Pinar Tozun: So what we do there is basically you need to carry
what’s in the register files and the last program counter. So we
measured it based on the number --. So we put them in the last level
cache, migrate the thread and read them back from there and we
calculated it, like it’s around 90 cycles. That’s the cost to do that
as long as you are within the same processor and can use it.
>>: It’s synchronous too, so there is some kind of cache line going
back and forth [inaudible].
>> Pinar Tozun: So yeah, by doing this basically we both exploit the
aggregate L1 instruction cache capacity for a transaction and we
localize the instructions to caches so you can exploit instruction
overlaps. And this technique we call it SLICC, it comes from self
assembly of L1 instruction caches.
And as I mentioned it’s a programmer transparent hardware technique so
you need to dynamically track the recent misses for a transaction and
also the cache contents in all of the caches to be able to know when
and where to migrate. So this of course this incurs some hardware
cost. It’s not so big, it’s less than a kilobyte space cost still, but
we are doing something specific for our application. So what we want
to investigate is basically can we have more software [indiscernible]
to eliminate some of these hardware calls at runtime?
And for that we looked at the instruction overlaps again, but this time
at much more final granularity. It’s the granularity of database
operations and these charts, again they show the instruction overlaps
for different instances of update prob and insert operations called
within the new order and payment transactions of TPC-C. And if we
compare these overlaps with the previous ones the dark red part
basically becomes a lot more apparent. So if we can align our
migration decisions with the boundaries of database operations or cache
size chunks in each database operation we might eliminate some of the
hardware cause and also achieve more precise locality for transactions.
And to do that we designed ADDICT which comes from advanced instruction
chasing. So it’s formed of two phases, you need to do initially a
profiling to determine the migration points in the transaction and also
the second phase does the actual migrations based on the decisions of
the first phase. So for the profiling phase, we as an example, we
again have the transactions from before and let’s say we determine the
following migration points. So initially you need to mark the start of
think points, starting think point, starting instruction for a
transaction. Then whenever you enter a database operation you need to
mark the starting point for this operation as well.
And here we have an index probe and as you are executing the index
probe while you fill up your L1 instruction cache you need to mark that
point as another migration point and start the second phase of the
probe. And we do this for all the operations called within each
transaction type. And in practice you do this analysis for several
transaction instances and you pick the most frequently occurring
migration points for the transactions. So then during the second phase
we will first assign each of the migration points to a core and now
when a thread comes you wills start it from the start core. Once it
enters the probe operation we migrate it to the core for the probe
operation and instead schedule the next thread in the queue to the
start core. And this continues as long as you hit the migration
points.
So this time we again exploit the aggregate instruction cache capacity.
We again localize instructions to caches so we can minimize the
instruction misses for instructions, but compared to the previous
techniques leak we don’t have to have as much hardware changes. All we
need to check is whether the current instruction is a migration point
or not. And also we can force more precise instruction locality by
doing this. Now looking at how these techniques perform in practice
the evaluation we do for this part it’s on a simulator because thread
migrations, if you try to do it with the current operating systems we
have it has a lot of cause.
The operating system has a lot of bookkeeping for the context switching
and doing that dynamically with some of the P thread infinity
functions. Sometimes it doesn’t migrate the threads exactly where you
want them to go if the destination core is already full. So we
implemented our ideas on a simulator which simulates an X86
architecture. Here we have 16 out of order cores and 32 kilobyte L1
caches. And we are looking at 1,000 transactions, we are running and
playing a 1,000 transaction traces on this simulator and they are taken
as we run the workload mixes for TPCC and TPC-E on Shore-MT.
So, on this graph we have the L1 instruction misses in a period of
1,000 instructions and we are normalizing the numbers based on the
numbers for the core national scheduling. So both the program
transparent techniques leak and the more software error technique
ADDICT, they can significantly reduce the L1 instruction misses. But,
as you also pointed out there is a catch here and that is when you
migrate a thread you leave its data behind. So if we also look at the
L1 data misses in a period of 1,000 instructions we increase the data
misses in a non-negligible way. But, our claim is that for data misses
at the L1 level they are not on the critical path for transaction
execution because they are easier to overlap with out of order
execution.
And the important thing here is to basically not increase long latency
misses from the last level cache. So not migrating a thread from 1
processor socket to another one.
Yeah?
>>: So I want to understand this. So actually physically sort of
moving a thread from one core to the other you say that it’s 9 cycles
and so on, but that does not include all the bookkeeping that the OS
maintains for the threads, right.
>> Pinar Tozun: Yes.
>>: I mean that’s a lot of bookkeeping.
get rid of that bookkeeping?
So in practice how would you
>> Pinar Tozun: So in practice what you can do is you can maybe the
hack the OS to have some sort of more like more lightweight context
switching specifically for this case where you just copy the register
values and the last program counter and don’t care about the rest of
the things. And actually there was this work called steps which tried
context which transactions, a batch of transactions on the same core,
to basically take advantage of a similar overlap and for that work they
actually hacked the OS to have more lightweight context switching
operation like that and they in lined all theses context switching
points in the code.
So we wanted to be like a bit more programmer transparent while doing
this as well, but of course both have tradeoffs. In one case you need
to put more burden of the program developer and in the other case you
need to add some stuff in the hardware.
Yeah?
>>: It looks like for the scheduling that they have is very fine
grained and is also very specific. So what happens when knowing the
fast transaction when it’s in the middle of a lock? It tries to get a
lock, and it’s able to get a lock and needs to be scheduled out. So
[inaudible]? Does it help us to do such things?
>> Pinar Tozun: So you don’t really, like you don’t affect correctness
of the thing, but because of those cases you might, like by doing this,
you might add some additional deadlocks maybe to the system. But, like
at least in our traces, like considering that at the higher level you
design your system in a way that you don’t have such high contention
that shouldn’t occur so much.
>>: So going back to Paul’s question, do you model thread migration at
all in your simulation?
>> Pinar Tozun: Yeah, yeah, so we add a cause, that 90 cycle cause like
whenever we do that migration.
>>: Yeah, so you wouldn’t actually use like OS level handoff here.
you use your [inaudible]?
Do
>> Pinar Tozun: So the way we do it now is just hardware trusts or
hardware context. You can again do it maybe with user level threads as
well, but you need specialized migration cause at the OS layer to be
able to do this in an efficient way or you can like completely change
your software architecture if you really want to take it on. Like have
threads fixed on a core and like make them responsible for a specific
code portion that is small and handoff, like not migrate the threads,
but handoff work in a way. Attach the attached transactions on the way
on different threads.
So let me, I have just a few slides. So to also see the throughput
numbers with the same setup basically improving instruction locality
even though you hinder data locality a bit it pays off. So we get like
around 40 percent improvement with the programmer transparent technique
and we can get around 80 percent improvements with the more transaction
aware one.
Now to sum up what we have seen so far, so we initially looked at
scalability issues at the whole machine level and how we can improve
utilization therefore transaction processing systems. And for that we
initially analyze the critical sections in the system and focused on
eliminating the most problematic ones. And we did that in our case
using physiological partitioning to have much more predictable data
accesses and this helped us in eliminating the majority of the unscalable critical sections.
Then we looked at the major sources of under utilization within a core
and we solved that L1 instruction misses play a significant role there.
But, we also observed that transactions have a lot of common
instructions. So, to exploit that we developed two alternative
scheduling mechanisms: one was programmer transparent and one was more
transaction aware to maximize instruction locality and then minimize
the instruction latest the whole time. And we also saw that being more
software aware it actually reduces some of the drawbacks these types of
techniques might have at the hardware level.
Now I will mention some of the higher level messages from this line of
work that I think are valid for software systems in general and what
there to do, to focus on in the near future. So throughout my PhD I
focused on exploiting hardware for transaction processing applications,
but some things both regarding scalability and scheduling, these I
think are applicable in software systems in general. So regarding
scalability when we show how well we scale we tend to show the
performance numbers we have, which is also valuable, but they don’t
indicate scalability on the next generations of hardware that you might
have.
So I think we need more robust matrix to measure our scalability and at
least I think that looking at critical sections and analyzing critical
sections in a system categorizing them based on which are the ones that
hinder scalability the most might help us in terms of better measuring
our scalability. Also, with the increased parallelism and
heterogeneity basically scheduling tasks in an optimal way will become
a real challenge for our systems. And the crucial thing here is to
know where you need the locality to be. Like for instructions for
transactions for transaction processing systems this was the L1 level
and for data I didn’t go into detail to that line of work, but for data
it’s important to keep data as much as possible either in your local L3
or in your local memory bank. So basically we need to adapt our
scheduling putting emphasis on the type of locality we want to have for
instructions and data.
And for the feature now we already start seeing some proposals both
from academia and some industrial places on specializing hardware for
various database operations. Most of this work now focuses on
analytics workloads, but there is some that deal with transactions as
well. And like even companies like Google and Amazon who used to, who
don’t like this idea very much, they are now realizing that when you
have an application that runs on thousands of servers on many
datacenters it’s better to have more specialized hardware for it to be
more cost effective and performance effective.
And also from the software side basically we need to be a lot more
aware of the underlying hardware and hardware topology to be able to
know how to do the memory management and how to schedule tasks in an
optimal way. And also we need to know which type of special hardware
we can actually exploit for which type of task. And in order, not to
bloat our cause for all the different hardware types, we can also get
maybe a half of the compiler community to dynamically generate some of
the code for these specialized cases.
So with that I also like building systems. It’s not the work of a
single person so these are all the people I had a chance to work with
during my PhD and with that I am going to conclude. If you want to
know more details about the various projects, or see papers, or
presentations, here is my website and the code base we use is also
online, available and well documented so you can check it out.
Thank you very much and I will be happy to have any further questions
you might have.
[clapping]
>> Paul Larso: All right any more questions or did we –-?
everything’s answered. Wow, all right.
>>: Yes, yes.
>> Paul Larso: All right let’s thank the speaker again.
>> Pinar Tozun: Thank you.
[clapping]
That’s it,
Download