Composability for Application-Specific Transactional Optimizations Rui Zhang Zoran Budimli´c

advertisement
Composability for Application-Specific
Transactional Optimizations
Rui Zhang
Zoran Budimlić
William N. Scherer III
Department of Computer Science, Rice University, Houston, Texas 77005
Email: {ruizhang, zoran, scherer}@rice.edu
Software Transactional Memory (STM) has made
great advances towards acceptance into mainstream
programming by promising a programming model
that greatly reduces the complexity of writing concurrent programs. Unfortunately, the mechanisms
in current STM implementations that enforce the
fundamental properties of transactions — atomicity,
consistency, and isolation — also introduce considerable performance overhead. This performance
impact can be so significant that in practice, programmers are tempted to leverage their knowledge
of a specific application to carefully bypass STM
calls and instead access shared memory directly.
While this technique can be very effective in improving performance, it breaks the consistency and
isolation properties of transactions, which have to
be handled manually by the programmer for the
specific application. It also tends to break another
desirable property of transactions: composability.
In this paper, we identify the composability problem and propose two STM system extensions to
provide transaction composability in the presence
of direct shared memory reads by transactions. Our
proposed extensions give the programmer a similar
level of flexibility and performance when optimizing the STM application as the existing practices,
while preserving composability. We evaluate our
extensions on several benchmarks on a 16-way
SMP. The results show that our extensions provide performance competitive with hand-optimized
non-composable techniques, while still maintaining
transactional composability.
I. I NTRODUCTION
Achieving higher performance is a major driving
force for concurrent programming. With the catalyst
“power wall” crisis that the hardware industry has
been facing for several years now, the transition
from sequential programming to concurrent programming has arrived quickly and decisively. This
leads to a very urgent requirement for programming tools to lower the complexity of concurrent
programming. These tools need to provide elegant,
easy-to-use programming constructs as well as competitive performance. As a construct, Transactional
Memory [10] provides many appealing features
that conceal most of the complexity in concurrent
programming, which greatly reduces the difficulty
of writing concurrent programs.
Though transactions have been widely and successfully used for a long time in databases, they
still face many problems as a general programming
construct. One key obstacle to widespread adoption
is their performance overhead. This overhead can be
sufficiently large to compel programmers to carefully bypass certain TM calls based on applicationspecific knowledge. Such an optimization usually
breaks isolation [10] and consistency [10], two key
properties of transactions, so the programmer has to
encode by hand consistency checking into the application. However, this type of optimization can also
break composability [10], another desirable property
of transactions; nesting transactions optimized in
such a way can lead to incorrect results. In this paper
we identify the composability problem associated
with optimizing Software Transactional Memory
(STM) [4], [5], [6], [7], [9], [12], [17] application
performance using application-specific knowledge.
We also propose an extension to STM systems
that addresses this problem. Our proposed extension
offers comparable flexibility in optimizing STM
performance while maintaining the composability.
Why would anyone want to bypass TM system
by accessing the shared memory directly in the first
be inconsistent and aborted. This type of validation
is very conservative and gives false positives in
many execution scenarios, since a change of a value
that has been previously read by a transaction does
not necessarily imply an order cycle among transactions. Consequently, the STM system will abort
many transactions that could have been allowed to
complete.
In timestamp-based STM systems such as TL
II [3], the systems often assume that if an object
being opened is newer than the transaction’s timestamp, a conflict has occurred, another conservative
design. Conservativeness can also be found in other
aspects of STM design. For example, some systems [16] mix the use of timestamp- and list-based
validation to reduce the number of false positives.
A. Performance Overhead in STM
Another part of the overhead comes from the
Performance overhead comes from several dif- optimistic nature of STM systems. The STM system
ferent aspects of an STM system. For example, allows multiple transactions to run concurrently,
an STM system must perform additional work to hoping that they will not conflict. But transactions
guarantee consistency, even though consistency can do conflict and thus abort. Worse, the amount of
be trivially guaranteed in a sequential program. In conflict often increases with the number of threads
order to provide this guarantee, an STM system has concurrently running in the system, which leads in
to perform sufficient recording of transactional reads turn to yet more wasted work.
to validate them when needed. This additional work
For some applications, the overhead from the
can sometimes entirely negate the initial purpose STM system can be high enough to make programof using a parallel implementation: better perfor- ming based on STM less desirable. Consequently,
mance [18]. The argument for parallelization does for certain applications, the programmer may have
not hold if the performance of a parallel version enough performance gain incentive to design custom
trails that of a sequential counterpart that takes less consistency guarantee mechanisms. This would efeffort to develop. Though a lot of research has been fectively reduce part of the system overhead associconducted on designing efficient STM systems to ated with the STM system. Importantly, though the
reduce the overhead, the performance of software performance gains can be substantial, the work to
transactional memory is still far from being solved. apply such optimizations is not trivial and requires
Although part of the overhead of STM, such good understanding of both the application and of
as consistency checking in general, is unavoidable concurrent programming.
due to the nature of concurrency, other parts arise
because of the conservative assumptions made by
B. Optimization vs. Composability
the STM system. Designed as a general programIn the course of optimizing STM applications
ming model, the STM system lacks applicationspecific knowledge. This lack of information forces through reading the shared data directly, programthe STM system to perform additional work that mer can rely on the STM system for atomicity
a more application-specific approach could avoid. guarantees, but rely on themselves for consistency
For example, many STM systems keep a set of and isolation guarantees. We note that atomicity is
past transactional reads [9], [12] and validate that guaranteed provided that user-level code only reads
a transaction is still consistent by checking whether directly from the shared data and write access to
any of these variables has changed. If any of the shared data is performed via transactional interfaces.
past reads has changed, the transaction is deemed to Guaranteeing atomicity for user-level direct write
place and lose isolation, consistency, composability
and reuse guarantees? Would it not be a better
idea to forego the transactional memory completely
and just design a non-blocking algorithm from the
ground up, if the performance is of such importance? The answer is that the TM system still
guarantees transaction atomicity, a very important
property, even in the presence of optimizations that
bypass the TM system calls and read the shared
memory directly. Ensuring application correctness
without the atomicity guarantees provided by the
TM system would require very significant effort
on the part of the programmer — a nonblocking
algorithm for even simple data structures such as
queues is a publishable result, for example.
updates would require mechanisms such as userlevel cleanup procedures, which is beyond the scope
of this work.
There is a lurking composability problem associated with the existing practice of optimizing
transactions by performing direct reads to shared
memory (which we explain in Section III). When
a transaction is optimized based on applicationspecific information this way, it can no longer be
safely composed to write larger transactions. In
order to compose these optimized transactions, a
developer would need to both understand the implementation details of the optimized transaction and
reason about the correctness. This is very unfriendly
to a software developer.
To meet the need of further optimizing STM
applications performance, we propose STM system
extensions to enable programmers to perform such
optimizations. These extensions hide the implementation details, so the programmer can reuse the
optimized transactions with only understanding of
the programming interface but not the actual implementation details. Importantly, these extensions
provide composability support and thus allow optimized transactions to be composed into larger ones.
We illustrate the composability problem itself in
section III. In section IV we show our proposed
interface extensions. Then we show the experimental results in section V. Afterwards, we discuss the
results and conclude in section VI and section VII
respectively.
II. RELATED WORK
Improving STM performance has recently been
a focus of numerous research efforts. Researchers
have explored different STM designs, careful performance tuning of the system, and different validation techniques such as Lazy Snapshot [16],
Transactional Locking II [3], etc. These efforts focus
primarily on pure system performance, allowing the
programmer to write transactions following strict
transaction specifications. However, there are also
other types of work that consider approaches to
performance improvement through relaxing some of
the transactional properties and extending the tools
programmers can use to encode program-specific
knowledge in their transactional applications. Using
these techniques sacrifices some programmability
for better performance. Since they are not strictly
conforming to the TM requirements, programmers
need to spend more time and effort on reasoning
about correctness of their applications. These techniques include early release [9], open nesting [13],
[14], [15], and lowering the overhead of shared
memory access [2].
Early release is a technique first proposed by
Herlihy, et al. [9] to reduce contention in their
DSTM system. Early release allows a transaction
to release some transactional reads before the transaction commits. A transactional read, once released,
does not conflict with other concurrent transactions
or incur the associated validation overhead. This
reduces the probability of aborting the transaction
and improves the overall performance. Early release
requires more programming effort because it lays
the burden of ensuring there are no conflicts on
the programmer. Besides the programming effort,
transactions using early release are not composable,
so this technique affects program modularity as
well. For example, a transaction A verifies if a node
exists in a sorted linked list. It can go through the list
and early release every node it opens. Thus, when
transaction A finishes, all reads are already released.
If the information gathered in transaction A is used
in a later nested transaction B, the later transaction
can no longer validate if the condition is still true
because all relevant nodes have been released.
Open nesting [13], [14], [15] is a technique to
exploit a higher semantic level of concurrency than
is defined at the physical memory access layer
where TM systems usually reside. It allows transactions to commit even in the presence of a physical
conflict not affecting the application’s semantics.
The system usually needs to support necessary
virtual constructs (not necessarily locks) that can be
mapped to the semantic level where the significant
conflicts actually happen. A TM system supporting
open nesting works with these high level constructs
rather than constructs associated with raw memory
access. For example, consider a nested transaction
composed of two transactions, each of which inserts
a node to a list. If the application only requires
that these two transactions either both complete
or neither of them complete but does not care
if another transaction intervenes in between, open
nesting can be used to improve performance. Similar
to early release, using open nesting requires deep
understanding of the application’s semantics and is
more difficult than a pure transactional approach.
Moreover, open nesting is not composable.
SNAP is a low overhead interface for shared
memory access [2]. It provides functions to get,
validate and upgrade the snapshot of an object.
SNAP read, unlike a regular transactional read,
does not involve bookkeeping of any information.
Instead, the read returns a snapshot of the object.
SNAP validation can verify if a snapshot held by
the program is still valid. The programmer can also
upgrade a read snapshot to a write snapshot when
needed. Memory accesses in SNAP mode do not
suffer the transactional overhead of recording the
reads and validating them. This is possible through
additional programming effort of making correctness guarantees at the application level. It gives the
programmer the flexibility to optimize some memory accesses. The programmer explicitly manages
the way a memory location is accessed and the
switch between transactional and non-transactional
modes. SNAP access mode is not composable as
well.
atomic {
copy global_matrix to local_matrix;
find a route in local_matrix;
if a route is found
add the route found to global_matrix
}
Fig. 1.
Labyrinth Transaction Pseudo Code
and destination, the routing process expands its
neighbor frontier till the destination is found. Then
the routing process rewinds back to the source for
a feasible route. The benchmark handles multiple
pairs of sources and destinations in parallel. A
straightforward pure TM implementation conducts
the computation in the shared matrix, and generates
many transactional reads along the search from the
source to the destination. Given multiple concurrent
threads, these reads have high probability to conflict with transactional writes and greatly decrease
performance. In order to improve performance, the
benchmark is implemented so that the matrix is first
copied to a local array through non-transactional
reads, and then all subsequent computation is done
on the local array. The implementation employs
a user-customized consistency checking algorithm.
III. COMPOSABILITY P ROBLEM
The consistency algorithm is used to write back
In this section, we identify the composability routes found back to the shared memory. The cusproblem in the user’s practice to improve STM tomized consistency checking only needs to verify
application’s performance by partially bypassing the the nodes on the route and therefore greatly reduces
TM system’s consistency guarantee mechanisms. the contention level and achieves much better perThis type of user practice integrates application- formance.
specific knowledge into the code in order to recover
Figure 1 is a pseudo code sketch of the algorithm
performance lost due to the conservativeness of used in Labyrinth. In the beginning of the transacthe STM system. For example, a programmer may tion, a local matrix is created by copying from the
choose not to use the general validation technique global matrix. The read from the global matrix is
of the TM system, but instead to read directly from a direct memory read without using TM interface.
the shared memory and maintain necessary data When a route is written back to the shared memory,
structures by himself to ensure the correctness of the program only writes it back when all cells on
the transaction. The programmer can still freely rely the route have not been changed in the meantime.
on the TM system for other guarantees such as The check of whether the cells are changed is done
atomicity and consistency of the transactional reads. by the programmer.
One motivating example to show this composThis transaction is well formed to run concurability problem is the Labyrinth application in the rently with other transactions even though it reads
STAMP [1] benchmark.
from the global matrix non-transactionally. The
The Labyrinth benchmark uses Lee’s algo- transaction is atomic, consistent and isolated within
rithm [11] to find routes for a set of sources and the application’s context because the programmer
destinations in a three-dimensional matrix modeled carefully manages the consistency issues. But the
after circuit board routing. For each pair of source composability problem arises when several of these
transactions are merged into a larger transaction. For
example, if the user wants to route a pair of sources
and destinations in one transaction, he cannot use
a nested transaction that has two of the routing
transaction in Labyrinth. One of the problems is
that in a delayed update TM model, the updates of
the first transaction are delayed until the top level
transaction commits. This makes the read of the
array in the second transaction get the stale value
and leads to incorrect results.
Below, we will take a look at a simple linked
list example to illustrate the composability problem
more clearly. Lists are frequently used in transactional applications. For example, both Genome and
Intruder benchmarks from the STAMP benchmark
suite [1] use a linked list and it is reasonable to
expect that the programmer would want to optimize
the list transactions in ways similar to the one
explained above. Unfortunately, list transactions are
very often called within nested transactions. User
optimizations would very likely break the composability of the transactions, making the code with
nested transactions incorrect. This is exactly where
the composability support described in this paper
can help the programmer.
For example, imagine a set data structure based
on a sorted linked list that supports three transactions: insert, remove, and lookup, where the insert
transaction incorporates two steps. The first step
is to iterate through the list to find the location
to insert the new node. The second step performs
node insertion. A very straightforward approach,
illustrated by the pseudo code in Figure 2, would
be to use pure transactions and open every shared
object transactionally.
While this is a very straightforward and simple
way of creating such transactions, this approach
does not scale very well with multiple threads. The
main reason is that the nodes opened up to the
insert point will incur many conflicts with other
transactions, even though many of these conflicts
can be shown to be benign. Knowing this, one
of the ways to improve performance is to use
the technique similar to the lazy concurrent listbased set algorithm developed by Heller, et al. [8].
In the optimized version shown in figure 3, the
code searching for the location to insert the node
reads the intermediate nodes directly (without going
atomic insert(List* list_ptr, int key) {
Node* prev = TM_READ(list_ptr->head);
Node* curr = TM_READ(prev->next);
while(TM_READ(curr->key) < key) {
prev = curr;
curr = TM_READ(curr->next);
}
if (TM_READ(curr->key) != key) {
Node* new_node = TM_MALLOC(sizeof(Node));
new_node->next = curr;
new_node->key = key;
TM_WRITE(prev->next, new_entry);
}
}
Fig. 2.
Insert - Pure TM version
through the TM system calls). The correctness of
this approach is ensured by validating that the
two nodes neighboring the insertion point did not
change during the insertion by using an additional
field, marked, in each node that indicates whether
a node has been removed or not. The optimized
algorithm relies on the TM system to make sure
the neighboring two nodes are consistently opened.
The correctness is ensured by additional consistency
checking after the neighboring nodes have been
opened transactionally.
As we will illustrate later in Section V, the differences in overall application performance between
accessing the data directly and accessing through
the transactional interface can be very significant,
suggesting that the optimized version is the way to
go. Unfortunately, even though the insert transaction
works well by itself within the context of the list
based set, it is not composable. For example, the
programmer may want to have two insert transactions in the nested transaction as in figure 4.
Another example is that the programmer may
want to insert a node to the set when such a node
does not exist in the set and do this in a single
transaction. The transaction will look similar to
figure 5.
In a pure TM implementation, the above two
examples will work as expected. However, neither
of the above nested transactions work correctly
when the optimized versions of insert and lookup
are used. There are two reasons that make the
atomic insert_optimized(List* ptr, int key) {
Node* prev = ptr->head;
Node* curr = prev->next;
while(curr->key < key) {
prev = curr;
curr = curr->next;
}
bool p_m = (bool_t)TM_READ(prev->marked);
bool c_m = (bool_t)TM_READ(curr->marked);
Node* next = TM_READ(prev->next);
if (!p_m && !c_m && next == curr) {
if (TM_READ(curr->key) != key) {
Node* new_node =
(Node*)TM_MALLOC(sizeof(Node));
atomic A {
local_a = global_a;
local_a++;
TM_WRITE(global_a, local_a);
}
atomic B {
A;
A;
}
Fig. 6.
A Hidden Update Example
tional writes in the first insert. Consequently the
computation of the nested transaction is no longer
new_node->next = curr;
consistent. In comparison, in the pure TM version,
new_node->marked = FALSE;
all reads from the shared memory are transactional.
new_node->key = key;
Transactional reads respect the read-after-write deTM_WRITE(prev->next, new_node);
pendencies across transactions and therefore do not
}
} else {
incur this problem. We call this problem the hidden
TM_RESTART();
update problem, and it is tied to the implementation
}
choice to use delayed updates within the STM. The
}
hidden update problem does not occur in the STM
systems that use eager updates that immediately
Fig. 3. Insert - Uncomposable version
update the transactional writes.
The hidden update problem can be seen more
atomic {
clearly in the following, even simpler example in
insert(x);
figure 6.
insert(y);
}
Transaction A reads the shared global a variable
and saves it to a local copy local a through a direct
Fig. 4. A Composed List Insertion Example
memory read. It then increments local a and writes
the new value to global a using a transactional
atomic {
write. Note that the read of global a is not through
if (lookup(value) == FALSE)
the TM interface and therefore does not suffer
insert(value);
}
the associated performance overhead. Let us also
assume that the application’s semantics allows the
Fig. 5. A Composed Conditional List Insertion Example
transaction A to be implemented this way so that
it does not interfere with other transactions in the
application.
optimized version uncomposable.
The hidden update problem arises when we nest
The first problem arises in the delayed-update two calls to transaction A in a nested transaction B.
STM systems [9], [12]. In a delayed-update STM Suppose global a is initialized to 0. The expected
system, transactional writes are first made to a value of global a is 2 after executing transaction B.
cached copy rather than into the shared variable. But the second A’s read of global a reads the stale
The cached copy is only committed to be visible value 0; the final result of transaction B will be 1.
The second problem is a consistency issue. Direct
to other threads when the transaction is committed.
In the nested transaction, transactional writes in reads do not force bookkeeping of any information
the first insert are not committed until the entire in the transactional system. Therefore later transnested transaction commits. So the direct reads in actions are not able to validate the consistency of
the second insert read stale values of the transac- previous optimized nested transactions. For exam-
ple, consider a nested transaction in figure 5 that
is composed of one lookup transaction and one
insert transaction. This transaction inserts node x
only when lookup(x) indicates x is not already in
the list. Note that all direct reads in lookup(x) are
not recorded in the transaction and therefore cannot
be validated later in insert(x). insert(x) is dependent
on the validity of the condition of lookup(x), so
if another concurrent transaction inserts x into the
list after lookup(x) but before insert(x), resulting list
will have a duplicate element. The nested transaction is not able to discover this inconsistency. A
pure TM implementation does not have this problem
since all necessary information is recorded for all
transactional reads and later nested transactions are
able to validate the consistency of previous transactions.
To summarize, allowing optimizations that bypass
TM system calls and access the shared memory
directly breaks transactional composability because
the read-after-write and write-after-read dependence
might not be respected properly when such transactions are nested.
Even though the read-after-write problem only
occurs in the STM systems with a delayed write,
we will address the composability for such systems
as well, since delayed writes are frequently used
in existing STM systems because they have the
advantage of having a smaller conflict window and
enable greater potential concurrency.
•
of a shared memory location. T xF astRead
alone does not completely guarantee the consistency of fast reads; rather, it must work together with our T xF lush extension operation.
T xF astRead should be used at every place
where a direct shared read would have been
used in an optimized transaction. T xF astRead
does not employ the same heavyweight bookkeeping as does a regular transactional read, but
adapts to execution context instead. It incurs
no performance penalty in non-nested contexts,
yet provides composability when nested. Full
details of this operation may be found later in
this section.
T xF lush is the counterpart operation for
T xF astRead. First, it ensures that read-afterwrite dependencies are respected. Second, it
provides the necessary operations to guarantee consistency. T xF lush should be placed at
places where these properties might be broken;
typically, this is immediately before and after
an optimized sub-transaction.
A. Composability Mechanisms
We experimented with two mechanisms for
ensuring composability, we call them lookup
scheme and partial commit scheme. With both,
fast path T xF astReads return the shared value
without any additional bookkeeping or validation.
On the slow path, T xF astRead does perform some
bookkeeping and validation. We carefully designed
IV. FAST R EAD I NTERFACE E XTENSION
these two schemes so that the optimizations applied
To meet both the need of optimizing the per- by the programmer can be preserved as much as
formance of STM applications using application- possible. Within the optimized transaction, even
specific knowledge and of providing composability when it is in a nested transaction, the TM system
for optimized transactions so they can be more validation does not validate the fast reads. These
widely reused, we propose to extend the trans- fast reads are only validated when they are merged
actional memory programming interface with two into the transactional read set. Therefore we expect
additional operations. We introduce these two oper- less conflicts in the optimized transactions even
ations and their semantics next. We designed these when used as nested transactions.
extensions to maximize the extent to which programmers can benefit from optimizations embedded Lookup Scheme: Our lookup scheme solves readin the optimized transaction.
after-write dependencies by searching previous
transactional writes. It ensures read consistency by
• T xF astRead encapsulates a fast read operation from shared memory. It also provides recording the transaction reads in a fast read set.
This scheme does not commit transactional writes
a hook for necessary operations to guarantee
composability. T xF astRead provides compa- before the entire nested transaction commits. Thererable performance compared to a raw read fore transactional writes are not committed to the
TxFastRead(addr) {
if not nested
return *addr;
else if addr is in write set
return value in write set;
else if validate succeeds
record in fast read set;
return *addr;
else
abort;
}
Fig. 7.
Fast Read in Lookup Scheme
TxFlush() {
merge fast read set to read set;
clear fast read set;
}
Fig. 8.
Flush in Lookup Scheme
TxFastRead(addr) {
if not nested
return *addr;
else if addr is locked
cleanup partial commits;
abort;
else if validate succeeds
record in fast read set;
return *addr;
else
cleanup partial commits;
abort;
}
Fig. 9.
Fast Read in PCM Scheme
TxFlush() {
merge fast read set with read set;
clear fast read set;
commit current transactional writes;
}
Fig. 10.
shared memory when T xF lush is called. In this
scheme, T xF astRead either looks up the address
from within the write list or reads directly from
shared memory. In particular, in the context of a
nested transaction, T xF astRead first searches the
write list; however, in a non-nested context, this is
skipped and the read is performed directly from
shared memory. If the transaction is nested and
the address is not found in the write set, then the
transactional read set need to be validated (note that
the fast read set is not validated here).
Figure 7 shows the pseudo code of the
T xF astRead function in lookup scheme.
In lookup mode, T xF lush merges the fast read
set to the read set. Following nested transactions
can validate the consistency of the enclosing
transaction by validating only the read set. This
addresses the inconsistency problem in nested
transactions. Figure 8 shows the pseudo code for
the T xF lush operation in the lookup scheme.
Flush in PCM Scheme
aborts or waits for a while; otherwise, the fast read
performs necessary validation and returns the shared
data or aborts. In the partial commit scheme, the
read does not search the write set. This is particularly useful when the write set is large and there
are many variables in the write set. Note that the
fast read is not a transactional read because it does
not provide a mechanism to guarantee consistency
with other reads. The flush operation performs a
partial commit that commits all pending writes to
shared memory and locks them. It also merges
the fast read set to the transactional read set. The
nested transactions also needs to save necessary
information to clean up the committed writes if the
entire transaction fails. Figure 9 shows the pseudo
code for the T xF astRead operation in the partial
commit scheme.
In PCM, T xF lush not only merges fast read
set into the read set, but it also commits the write
Partial Commit Scheme: Our partial commit set to shared memory. The transaction holds locks
scheme (PCM) solves the read-after-write depen- for the committed writes, so accesses from other
dencies by eagerly updating shared memory when a threads will detect a conflict and abort. Figure 10
nested transaction commits. It solves the inconsis- shows the pseudo code for T xF lush operation in
tent read problem by recording fast reads.
the partial commit scheme.
In PCM, if the transaction is not nested, the
Both schemes have their advantages and drawshared value is returned directly. Otherwise, fastread operations first check if the address to be backs. On the fast path (when the transaction is
read is locked. If it is locked, the transaction either not nested), both schemes perform similarly. They
both perform a direct read of the shared memory
and return that value. The difference is in their
slow path. The lookup scheme’s slow path needs
to search the write set for every fast read so it suffers the associated performance penalty. The PCM
scheme’s slow path does not search the write set
and therefore can be faster here especially when a
large write set needs to be searched. In the lookup
scheme, transactional writes are only locked when
the enclosing transaction commits and therefore the
locks are held for a shorter period of time. This can
reduce the contention over the locks. In the partial
commit scheme, transactional writes are committed
in steps when each nested transaction commits.
Therefore the locks of early transactions are held
for a longer time compared to lookup mode. This
can lead to higher contention on the locks.
T xF lush can be exposed to the programmer or
inserted by a compiler after every nested transaction.
While exposing it may allow the programmer to
identify the cases where T xF lush is not necessary
and further optimize the performance, it could also
increase complexity of the code and the possibility
of writing erroneous code.
B. Composability vs. Reuse
We would like to point out here that the techniques we have described in this paper do not
guarantee full reuse of the transactions, only composability. Composability is only a part of the reuse,
even though a very important one. While this can
be seen as a limitation of our approach, we want to
point out that hand-optimized transaction code that
bypasses the TM interface also cannot be reused in
general, within a nested transaction or otherwise.
Therefore our original claim that we provide composability for optimized code that bypasses the pure
TM interface still stands.
To illustrate the reuse issue, let us consider the the
sorted linked list described earlier. It supports three
transactions: lookup, insert, and remove. Suppose
the programmer wants to add a new increment
transaction that increments a value of a node by
1 (swapping it with the next node if necessary to
preserve the order). Even if this new transaction is
implemented using the pure TM interface without
any optimizations, it will still break the existing
hand-optimized code. The reason is that adding a
transaction that changes a node’s value breaks the
assumptions made in the original hand-optimized
code, namely that only insert, remove and lookup
can be performed on the list, which made the
optimization possible. This is true even if the programmer has used our extended TM interface to implement the optimizations. The programmer would
have to revisit the assumptions made about the
whole application and re-implement the optimized
transactions with the new increment transaction in
mind.
However, we also note that our TM interface
extension does allow for the increment transaction to
be implemented by simply making it a composition
of a remove and an insert transaction. No changes
to the existing code would be necessary.
Our extension still enables a significant amount
of reuse, with a restriction that the new code does
not contain transactional writes. If the added transactions only perform transactional reads and/or call
the existing transactions, everything will perform
as expected. Otherwise, a programmer will have to
revisit the assumptions about the whole application
and re-implement the optimized transactions.
C. Composability Guarantees
As discussed in Section III, the composability
problem can arise due to two issues — ignored readafter-write dependence and consistency violations.
Here, we provide an informal argument as to
why our extensions guarantee transaction composability. The argument for general composability of
transactions in our scheme can be made inductively
from the four claims below for two transactions that
are nested within a larger one. When composing
new transactions, transactional reads in the new
code containing calls to the nested transactions can
be considered as simple and very small transactions. Transactional writes are not allowed in the
new code, as discussed in the previous subsection.
Following the strict nested transaction semantics,
larger transactions containing more than two nested
transactions can be rewritten with deeper nesting
so that every level of nesting contains only two
transactions. There is (conceptually) a T xF lush at
the beginning and at the end of every transaction
that is optimized by using fast reads.
1) Pure transaction followed by a pure transaction does not violate consistency or read after
write dependencies. This is trivially true in all
STM systems.
2) Pure transaction followed by an optimized
transaction. The reads in the pure transaction
are recorded in the read set, which is validated
later in the optimized transaction on every fast
read. The writes within the pure transaction
will be kept in the write set in the lookup
scheme, which will be validated (in both
schemes) against fast reads in the following
optimized transaction. The writes from the
pure transaction will be committed to shared
memory in the partial commit scheme during
the T xF lush between the two transactions, so
they will be locked to the access from other
threads, but can be accessed (either through
fast reads or through transactional reads) from
the following optimized transaction.
3) Optimized transaction followed by a pure
transaction. In both schemes, the T xF lush
between the transactions will merge the fast
reads from the optimized transaction into
the pure transactional read set, so the pure
transaction will validate all memory accesses
against those reads. The writes in both the
optimized and the pure transaction are transactional so they will be either kept in the write
set or partially committed in the T xF lush
operation and validated accordingly in the
pure transaction.
4) Optimized transaction followed by an optimized transaction. The fast reads from the first
optimized transaction will be merged into the
pure read set during the T xF lush operation,
so they will be validated against during the
fast reads within the second optimized transaction. Transaction writes can be shown to be
correct similar to the 2) and 3) above.
V. E XPERIMENTAL R ESULTS
We conducted our experiments on six benchmarks: Labyrinth, Genome, Intruder, Vote, sorted
linked list, and nested sorted linked list. The experiments are performed on a 16-core SMP machine
with four quad-core Intel Xeon CPU E7330 running
at 2.40GHz. Three of the benchmarks in our experi-
ments - Labyrinth, Genome, and Intruder - are from
the STAMP [1] benchmark suite. The other three
were developed previously by the authors [19].
We have (where applicable) four versions of
each benchmark — pure TM, uncomposable, lookup
scheme, partial commit scheme. We refer to the
version that strictly follow TM requirements as the
pure TM version. In the pure TM version, every
read from or write to the shared memory passes
through the TM interface. The pure TM version
enjoys all the benefits from the TM system and
requires the least effort to develop. We refer to the
version using direct shared memory reads as the
uncomposable version. The programmer is responsible for ensuring correctness through implementing
isolation and consistency by hand. This version requires much more programmer effort. The versions
that use our proposed fast read interface include the
lookup scheme and partial commit scheme. In both
schemes, the direct accesses in the uncomposable
version were replaced with the calls to the fast read
interface. Similar to the uncomposable version, the
programmer needs to guarantee the correctness. But
unlike the uncomposable version, the transactions
using our fast read interface can still be composed
into larger transactions.
We implemented our extension on top of TL
II [3]. All of the experiments are performed using
the lazy acquire mode in TL II.
• The Labyrinth benchmark is a maze routing
application. It uses Lee’s algorithm [11] to
find the shortest-distance routes for a set of
sources and destinations in a three dimensional
matrix. The algorithm expands from the source
point using a breadth first search. The search
is guaranteed to reach the destination and a
reverse back search formulates the route. The
maze size used in the benchmark is 512*512*7.
The routing process involves many memory
writes that can have an enormous impact on
performance if all are performed in the shared
memory. In order to improve performance, the
program is optimized by first copying the matrix to a local array through non-transactional
reads and doing all subsequent computation
on the local array. When a route is found,
it is written back to the shared memory. The
version provided within STAMP benchmark
•
•
•
Labyrinth
100
90
80
70
Exe Time (sec)
suite is the uncomposable version. We created
two composable versions by changing the local
array copy to fast reads. The pure TM version
is created by changing the local array copy to
transactional reads.
The Genome benchmark is a gene sequencing
program. It takes a number of DNA segments
and matches them to reconstruct the original
genome. In the phase one of the benchmark,
all segments are put into a hash set to remove
segment duplicates. To allow concurrent access, the hash set is implemented as a set of
unique buckets, each of which is implemented
as a linked list. In our experiments, we created
different versions of the Genome by modifying
the list data structure.
The Intruder benchmark is a signature-based
network intrusion detection application. It
scans network packets for matches against a
known set of intrusion signatures. The main
data structure in the capture phase is a simple
non-transactional FIFO queue. Its reassembly
phase uses a dictionary (implemented by a selfbalancing tree) that contains lists of packets
that belong to the same session. We modified
the list data structure used in the dictionary to
create the different versions of Intruder.
The Vote benchmark is an application to simulate a voting process. It supports three operations — vote, count and modify. The underlying data structure maintaining the voting information is a binary search tree. Each
node contains a two fields — voter and candidate voted by the voter. The vote(ssn,
candidate) transaction casts a vote for a
candidate on behalf of the voter with his ssn.
The count(candidate) transaction returns
the total number of votes a candidate has
got. This transaction is a nested transaction
composed of two transactions - verify and
cast vote. The verify(ssn) transaction verifies if the voter with a given ssn has voted. The
cast_vote(ssn, candidate) transaction casts the actual vote if this voter
has not voted yet. The modify(ssn,
candidate) transaction changes a voter’s
vote to the new candidate if the voter
with a given ssn exists. Since vote(ssn,
60
50
40
30
20
10
0
1
2
uncomposable
4
Number of Threads
lookup scheme
Fig. 11.
•
•
8
16
partial commit scheme
Labyrinth
candidate) is a nested transaction, the uncomposable version does not work correctly
here. So we created three different versions
of the benchmark: pure TM (where shared
data is accessed through transactional memory
interface), and two versions that access the data
through our fast reads composable interface
(with both the lookup scheme and the partial
commit scheme as the underlying implementation). There are 65,536 possible unique voters.
The mix of operations of count, vote and
modify is 10%, 80% and 10%.
The set benchmark is an application that implements a set using a sorted linked list. It
supports three operations and each is implemented as a transaction - insert, remove and
lookup. Insert(key) inserts a key to the
set. Remove(key) removes a key from the
set. Lookup(key) searches for the key in the
set. In our experiments, there are all together
512 possible unique keys in the set. The operation mix of insert, remove and lookup is 10%,
10% and 80%.
The nested set benchmark is a nested version of
application above. It has three nested transactions, nested insert, nested remove, and nested
lookup. Each of the nested transaction has two
of the corresponding single transactions within
it. For example, a nested insert transaction is
shown in figure 4. The number of unique keys
and operation mix is the same as the set above.
Genome
Nested Set
2754
636
30
20.00
Exe Time per 1M Transactions (sec)
18.00
25
Exe Time (sec)
20
15
10
5
16.00
14.00
12.00
10.00
8.00
6.00
4.00
2.00
0.00
0
1
2
4
Number of Threads
pure TM
lookup scheme
Fig. 12.
8
1
16
2
4
8
16
Thread Number
partial commit scheme
lookup scheme
Genome
partial commit scheme
Fig. 15.
Vote
pure TM
Nested Set
Intruder
300
100
80
70
200
Exe Time (sec)
Exe Time per 1M Transactions (sec)
90
250
150
100
60
50
40
30
20
50
10
0
0
1
2
3
4
5
1
2
lookup scheme
partial commit scheme
Fig. 13.
4
8
16
Number of Threads
Number of Threads
pure TM
pure TM
Vote
lookup scheme
Fig. 16.
partial commit scheme
Intruder
Set
16.00
Exe Time per 1M Transactions (Sec)
14.00
12.00
10.00
8.00
6.00
4.00
2.00
0.00
1
2
4
8
16
Number of Threads
uncomposable
lookup scheme
Fig. 14.
partial commit scheme
pure TM
Set
VI. D ISCUSSION
In Labyrinth, Set, Vote, and Genome we observe
that the optimized version’s performance improves
by a significant margin relative to a pure TM implementations. We also observe some performance
overhead compared to the direct read version in
cases where an uncomposable version exists.
Labyrinth is the application that shows the largest
performance gain, clearly illustrating that certain
types of applications can achieve enormous performance benefits by applying application-specific
optimizations. In fact, the pure TM version of the
Labyrinth version runs so slowly that we were
unable to finish the experiments for cases with more
than 4 threads. The two thread case takes over a
day to complete (compared to about 50 seconds
for the hand-optimized uncomposable version and
the composable version using our TM interface
extension). This is because many of the transactional
reads in the pure TM version cause an immense
number of conflicts and effectively create a live lock
in the application. For this reason, we have omitted
the pure TM results for Labyrinth in Figure 11,
in order to illustrate the performance differences
between the uncomposable optimized version of the
benchmark and the composable version using our
TM interface.
In Genome, the list operations are either a stand
alone transaction or a part of a nested transaction
that can be easily shown to be independent of other
list operations. As shown in Figure 12, we observe
a performance improvement of 26% in the 4 thread
case using our lookup scheme. Across all thread
counts, we observe a performance improvement
ranging from 18% to 26%. The lookup scheme fares
better in this benchmark than the partial commit
scheme. The partial commit scheme encounters performance issues for more than 4 threads and the
results are clipped in Figure 12.
Vote has a nested transaction vote that is composed of a lookup transaction to verify whether
the voter has voted and an insert transaction that
casts the actual vote. Figure 13 shows the results.
We observe a performance improvement up to 150%
when comparing our composable version with the
pure TM version. There is no uncomposable version
of this benchmark. The results of lookup scheme
and PCM scheme are very close and are almost
overlapped on the graph.
For the set benchmark based on a sorted linked
list, Figures 14 and 15 show the results for single
and nested transactions. We observe up to a 1.8x
performance improvement over the pure TM version
with our lookup scheme. We also observe that the
performance overhead compared with the uncomposable version is up to 47% for the single thread
case. The overhead is majorly from the extra work
of maintaining the fast read set and merging it to the
transactional read set. The performance overhead
decreases as the amount of parallelism increases
which indicates better scalability. For the case of
nested transactions, the performance improvement
is up to 1.3x. Though the two transactions nested
clearly have no dependencies, the benchmark is
implemented by assuming they might and inserting T xF lush between them. Better performance
could be achieved if the programmer were to take
advantage that there is no dependence between
transactions and remove the T xF lush call.
The last benchmark, Intruder, does not show
performance improvement (Figure 16), but rather a
minimal (couple of percent) performance degradation when compared to the pure TM version. The
reason is that the list operations in Intruder are only
used in nested transactions and infrequently at that.
The trade-off of spending additional time to set up
a separate read set and merge it to the transactional
read set does not pay off in this case. Note that an
uncomposable version of this benchmark does not
even exist.
Following are some of the design choices and
issues that arose while developing the TM interface
extension, in no particular order:
• Lookup Scheme vs. Partial Commit Scheme:
In our experiments, the lookup scheme outperforms the partial commit scheme in most cases.
The larger time window in which we hold
locks in the partial commit scheme increases
the possibility of a conflict. The advantage of
the partial commit scheme is that it does not
require later nested transactions to search the
past write set. For applications that feature an
expensive search process, we expect that the
partial commit scheme would achieve better
performance.
• Non-in-place update STM: We only evaluate an
in-place update TM system in our experiments.
For STM systems that do not use in-place
update, similar problems will arise when the
programmer wants to read directly the visible copy. We will explore non-in-place update
STM systems in the future.
• Fast Read Advantages and Disadvantages: Our
fast read extension addresses the problem that
application-specific optimized transactions do
not integrate with the STM runtime/library to
provide composability guarantees. Under the
hood, the extension provides the necessary link
between the programmer and the consistency
guarantee mechanism integrated in the STM
system. Therefore the places to use the extension are similar to ones where the original
optimizations apply. The major performance
incentive of the optimization is to eliminate
part of the time-consuming validation work
involved in a pure STM implementation. If
an application’s consistency depends on most
•
•
of its past reads, the work it can save could
become rather limited and might not be worth
the effort of developing an optimized version.
Pure TM version implementation of Labyrinth:
The pure TM version of Labyrinth is implemented by first copying the entire matrix to a
local version transactionally. Then, routing is
performed in the local matrix. The route found
is written back to the global matrix. This is
very similar to the original algorithm used in
the Labyrinth benchmark.
A second approach would be to perform the
routing computation in the global matrix directly. The first step of the routing algorithm
is a breadth first search from the source to the
destination. In this phase, the each cell visited
is marked with a distance to the source. The
first step stops when the destination is found.
The second step of the algorithm searches from
the destination backward to the source for the
shortest path from source to destination. The
third step is to clean up the marks left by the
first step. We expect that this approach would
perform just as bad as the first one, since the
updates of marks in the first and third step will
cause an enormous number of conflicts and the
application would be very vulnerable to live
locks. We implemented the first approach in
our experiments.
TxFlush Programmer Visibility: For correctness, a T xF lush is required between every
transactional write to a memory location loc
and a subsequent fast read to loc. If T xF lush
is exposed to the programmer, he/she can reason about the dependencies between nested
transactions and only insert T xF lush where
strictly necessary. Alternatively, one can envision a compiler that inserts T xF lush before
and after every nested transaction that uses
the fast read interface, then analyzes the dependencies between shared memory accesses
and removes the unnecessary T xF lush calls,
further improving the performance of our proposed techniques. In this paper, we have used a
straightforward and conservative approach that
inserts T xF lush before and after every nested
transaction that uses the fast read interface.
VII. CONCLUSIONS
AND
F UTURE WORK
In this paper, we have identified that common
programmer practices in optimizing transactional
applications by bypassing TM system calls and
accessing the shared memory directly, in addition
to breaking consistency and isolation (which have
to be handled by the programmer manually), also
break another desirable property of transactions:
composability.
We have proposed two extensions to the TM system interface: T xF astRead and T xF lush. These
extensions enable the programmer to access the
shared memory in a much more controlled manner, without breaking transaction composability. We
have presented two techniques for implementing
these system extensions: lookup scheme and partial
commit scheme. We implemented these techniques
on top of the TL II software transactional memory implementation and demonstrated on a set of
benchmarks that we can obtain performance that is
competitive to the non-composable hand optimized
code, while preserving composability.
In summary, compared to the existing practices,
our system extensions require similar programmer
effort (the programmer still needs to manually
ensure isolation and consistency), provide similar
performance, and preserve composability.
For future work, we will investigate the composability issues in non-in-place updates for transactional writes. We will also explore compiler optimizations to further simplify the extended TM interface available to the programmer for optimizations.
R EFERENCES
[1] C. Cao Minh, J. Chung, C. Kozyrakis, and K. Olukotun.
STAMP: Stanford transactional applications for multiprocessing. In IISWC ’08: Proceedings of The IEEE
International Symposium on Workload Characterization,
September 2008.
[2] C. Cole and M. Herlihy. Snapshots and software transactional memory. In PODC Workshop on Concurrency
and Synchronization in Java Programs, 2004.
[3] D. Dice, O. Shalev, and N. Shavit. Transactional locking
ii. In Proceedings of the 20th International Symposium on
Distributed Computing, DISC 2006. Springer, Sep 2006.
[4] K. Fraser. Practical lock freedom. PhD thesis, Cambridge
University Computer Laboratory, 2003. Also available as
Technical Report UCAM-CL-TR-579.
[5] R. Guerraoui, M. Herlihy, and B. Pochon. Polymorphic
contention management. In DISC ’05: Proceedings of
the nineteenth International Symposium on Distributed
Computing, pages 303–323. LNCS, Springer, Sep 2005.
[6] T. Harris and K. Fraser. Language support for lightweight
transactions. In OOPSLA ’03: Proceedings of the 18th
annual ACM SIGPLAN conference on Object-oriented
programing, systems, languages, and applications, pages
388–402, New York, NY, USA, 2003. ACM Press.
[7] T. Harris, M. Plesko, A. Shinnar, and D. Tarditi. Optimizing memory transactions. SIGPLAN Not., 41(6):14–25,
2006.
[8] S. Heller, M. Herlihy, V. Luchangco, M. Moir, W. N,
W. N. S. III, and N. Shavit. A lazy concurrent list-based
set algorithm, 2005.
[9] M. Herlihy, V. Luchangco, M. Moir, and W. N. Scherer
III. Software transactional memory for dynamic-sized
data structures. In PODC ’03: Proceedings of the twentysecond annual symposium on Principles of distributed
computing, pages 92–101, New York, NY, USA, 2003.
ACM Press.
[10] J. R. Larus and R. Rajwar. Transactional Memory.
Morgan & Claypool, 2006.
[11] C. Y. Lee. An algorithm for path connection and its
application. In IRE Trans. Electronic Computer, EC-10,
1961.
[12] V. J. Marathe, M. F. Spear, C. Heriot, A. Acharya,
D. Eisenstat, W. N. Scherer III, and M. L. Scott. Lowering the overhead of nonblocking software transactional
memory. In TRANSACT 06’: Proceedings of the Workshop on Languages, Compilers, and Hardware Support
for Transactional Computing, 2006.
[13] J. E. B. Moss. Open nested transactions: Semantics and
support. In WMPI Poster, 2005.
[14] J. E. B. Moss and A. L. Hosking. Nested transactional
memory: model and preliminary architecture sketches. In
OOPSLA Workshop on Synchronization and Concurrency
in Object-Oriented Languages (SCOOL), 2005.
[15] Y. Ni, V. S. Menon, A.-R. Adl-Tabatabai, A. L. Hosking,
R. L. Hudson, J. E. B. Moss, B. Saha, and T. Shpeisman.
Open nesting in software transactional memory. In
PPoPP ’07: Proceedings of the 12th ACM SIGPLAN
symposium on Principles and practice of parallel programming, pages 68–78, New York, NY, USA, 2007.
ACM.
[16] T. Riegel, P. Felber, and C. Fetzer. A lazy snapshot
algorithm with eager validation. In Proceedings of the
20th International Symposium on Distributed Computing,
DISC 2006, volume 4167 of Lecture Notes in Computer
Science, pages 284–298. Springer, Sep 2006.
[17] B. Saha, A.-R. Adl-Tabatabai, R. L. Hudson, C. C.
Minh, and B. Hertzberg. Mcrt-stm: a high performance
software transactional memory system for a multi-core
runtime. In PPoPP ’06: Proceedings of the eleventh
ACM SIGPLAN symposium on Principles and practice
of parallel programming, pages 187–197, New York, NY,
USA, 2006. ACM Press.
[18] R. M. Yoo, Y. Ni, A. Welc, B. Saha, A.-R. Adl-Tabatabai,
and H.-H. S. Lee. Kicking the tires of software transactional memory: why the going gets tough. In SPAA
’08: Proceedings of the twentieth annual symposium on
Parallelism in algorithms and architectures, pages 265–
274, New York, NY, USA, 2008. ACM.
[19] R. Zhang, Z. Budimlić, and W. N. Scherer III. Commit
phase variations in timestamp-based STM. In SPAA
’08: Proceedings of the twentieth annual symposium on
Parallelism in algorithms and architectures, pages 326–
335, 2008.
Download