Noninvasive concurrency with Java STM

advertisement
Noni nvasi ve concur rency w it h Java ST M
Guy Korland1 and Nir Shavit1 and Pas cal Felb
e r2
Com puter Science Department
Uni versi ty, Israel ,
guykorla@post.tau.ac.il, shanir@post.tau.ac.il
1 Tel-Aviv
2Com
puter Science Department
Univers ity of Ne uchˆate l, Swi tze
rland,
pascal.felber@unine.ch
Abstra ct . In this pap e r we pres ent a com plete, compil er i ndep endent, Java
STM fram ework call ed Deuce , intended as a developm ent platform for s cal
able concurrent appli cati ons and as a re search to ol for de signi ng STM al
gorithm s. Deuce prov ides several b enefits over ex is ti ng Java STM
frameworks : it avoi ds any changes or addi tions to the JVM, i t do es not req ui
re l anguage ex tens ions or intrus ive APIs , and i t do es not i mp ose any m
emory fo otprint or GC overhead. To s upp ort legacy l ibrari es, Deuce dynam
icall y i nstruments clas ses at l oad ti me and uses an original “field-based” l o
cki ng strategy to im prove concurre ncy. Deuce als o provi des a s im ple i
nternal API al lowing different STMs algori thm s to b e pl ugged in. We show
em piri cal re sults that highl ight the scalabi li ty of our framework running b
enchm arks with hundreds of concurrent threads . Thi s pap er show s, for the
first tim e, that one can actuall y des ign a Java ST M w ith reasonabl e p erform
ance , wi thout compi ler supp ort.
1 Int ro ducti on
Multicore CPUs have b ecom e c ommonplac e, with dual-c ore s p owering almos t
any mo dern p ortable or de sktop c ompute r, and quad-c ore s b eing the norm for
ne w se rvers. While multicore hardware has b een advancing at an acce lerated
pace , software for multicore platform s see ms to b e at a cross roads .
Curre ntly, two dive rging programming metho ds are com monly used. The firs t
exploits c onc urrency by s ynchroniz ing multiple thre ads bas ed on lo cks . This
approach is we ll know n to b e a two-edged sword: on the one hand, lo cks give the
program mer a fine-graine d way to c ontrol the applications c ritical se ctions ,
allowing an e xp ert to provide great sc alability; on the other hand, b eca us e of the
risk of deadlo cks, starvation, priority inversions, etc ., the y im p ose a gre at burden
on non- exp e rt programm ers, ofte n leading them to pre fe r the safer but non-sc
alable alternative of coarse -graine d lo cking.
The sec ond m etho d is a shared-nothing mo de l com mon in Web base d
architec ture s. The applications usually c ontain only the bus ine ss logic ,
deployed in a containe r, while the s tate is saved in an exte rnal multi-versioned c
ontrol s ystem , such as databas e, me ss age que ue , or dis tributed cache.
While this m etho d re move s the burde n of handling c onc urrency in the applic
ation, it im p os es a huge p erform anc e impact on data acc es se s and, for m
any typ es of applications , is difficult to apply.
The hop e is that trans actional mem ory (TM) [11] will s im plify concurrent
programming by c ombining the des irable feature s of b oth me tho ds, providing
state-aw are shared memory with simple concurrency con tro l.
In the past se veral ye ars the re has b ee n a flurry of des ign and im plem entation
work on the s oftware transac tional m emory (STM) [21]. Howe ver, the state of the art
to day is less than app e aling. With the notable exc eption of transac tional C/C ++
com pilers [15], m ost ST M initiative s have remained academic e xp erime nts ,
applicable only to “toy applic ations”, and though we have learne d much from the pro c
es s of de veloping them, they have not reached a state that will allow them to b e se
riously fie ld tes ted. There are s everal re asons for this . Among the m are the
problematic handling of many feature s of the targe t language, the large p erform anc e
overhe ads , and the lack of supp ort for le gacy libraries . Moreover, many of the
published res ults have b ee n based on prototyp e s whose s ource c o de is
unavailable or p o orly maintaine d, making a reliable c omparison b etween the various
algorithm s and de signs ve ry diffic ult.
In this pap er, we intro duce Deuce , our nove l op e n- sourc e Java frame work for
trans actional me mory. Deuce has se veral de sired features not found in e arlier Java
STM fram eworks . As we dis cuss in Se ction 2, there c urrently do e s not e xis t an
efficient Java STM fram ework that de live rs a full s et of feature s and can b e added
to an existing applic ation witho ut changes to its c ompile r or libraries . It was not c
lear if one could build s uch an an effic ient “c ompiler inde p e ndent” Java ST M.
Deuce is inte nded to fill this gap. I t is non- intrus ive in the s ense that no m o
difications to the Java virtual machine (JVM) or e xtensions to the language are nec es
sary. It use s, by default, an original lo cking des ign that de tec ts c onflicts at the leve l
of individual fields without a significant incre ase in the m emory fo otprint (no extra
data is adde d to any of the c lass es) and there fore there is no GC ove rhead. This lo
cking scheme provides fine r granularity and b e tte r parallelism than former ob jec
t-base d lo ck de signs. Deuce provides weak atomicity, i.e., do e s not guarante e that
c onc urrent ac ce sse s to a share d me mory lo c ation from b oth ins ide and outside
a transac tion are consis te nt. This is in orde r to avoid a p e rform anc e p e nalty.
Finally, it s upp orts a pluggable STM back- end and an op e n and e asily extendable
archite cture , allowing re se arche rs to integrate and tes t the ir own STM algorithms
within the Deuce fram ework.
Deuce has b e en heavily optimize d for effic iency and, while the re is s till ro om
for im prove ments , our p erformance e valuations on s eve ral high- end machines (up
to 128 hardware thre ads on a 16-core Sun Niagara-bas ed m achine and a 96-c ore Az
ul machine) de mons trate that it sc ales we ll. Our b enchm arks show that it outp
erform s the main comp eting JVM-inde p endent Java STM, the DSTM2 frame work
[9], in many c ase s by two orde rs of m agnitude , and in general sc ales well on m any
workloads. This pap er shows for the first time that one can ac tually de sign a Java
STM with reas onable p erformance without c ompile r supp ort.
The Deuce fram ework has b ee n in deve lopme nt for more than two ye ars , and
we b e lie ve it has re ache d a le vel of maturity s uffi cient to allow it to b e use d by
develop ers with no prior e xp ertise in trans actional m em ory. It is our hop e that
Deuce will he lp demo c ratize the use of STMs among de velop e rs , and that its op
en-s ource nature will encourage STM the m to e xte nd the infras tructure with nove l
fe atures , as has b ee n the c ase , for instanc e, w ith the Jike s RVM [1].
The re st of the pap er is organize d as follow s. We firs t ove rview re late d work in
Section 2. Se ction 3 des crib es Deuce from the p ersp ec tive of the de velop e r of a
concurre nt application. I n Section 4 we disc us s the im ple mentation of the
framework, and the n show how it can b e extended, by the means of the STM
backends , in Sec tion 5. Section 6 pres ents a com panion front-e nd to Deuce that s
upp orts trans actional Java language extensions. Finally, in Sec tion 7, we e valuate
the p erformance of Deuce .
2 R elat ed wo rk
Several to ols have b e en prop ose d to allow the intro duc tion of STM into Java. They
diff er in their program ming interfac e and in the way the y are im ple mente d. One of
the first prop osals is Harris and Fras er’s CCR [8] that use d a C- bas ed STM imple
me ntation underne ath the JVM and a program matic API to us e STM in Java.
DST M [10] is anothe r pione ering Java STM imple mentation. It use d an explicit
API to dem arc ate trans actions and re ad/w rite s hare d data. Trans actional ob jec ts
had to additionally provide som e pre-defined functionality for cloning their state.
More rece ntly, the DSTM2 [9] frame work was intro duce d, prop osing a se t of
higher-le vel m echanisms to supp ort transac tions in Java. DSTM2 is a pure Java
library that do e s not require any change to the JVM nor to the com piler. It use s a s p
ec ial annotation to mark an atomic interfac e. Only a class that implem ents an
annotated inte rfac e c an partic ipate in a transac tion. This clas s is create d us ing a
sp ec ial fac tory and it can only supp ort trans actional acc es s to primitive typ e s or
other annotated c lasse s. Transactions must b e starte d us ing a Callable ob jec t.
This design cre ate s an API that, while p owe rful and e legant, requires imp ortant
refactoring of legac y co de, and intro duc es m any lim itations, s uch as the lack of
supp ort for e xis ting librarie s.
The LSA-STM [19] is a Java STM that als o relies primarily on annotations for TM
programm ing. Trans actional ob je cts, to b e acc es sed in the c onte xt of a trans
action, are annotated as @Transactional, and me tho ds are de clared as @Atomic.
Additional annotations can b e use d to indicate that s ome m etho ds w ill not m o dify
the state of the target ob ject, (re ad- only) or to sp ecify the b ehavior in cas e an exce
ption is throw n within a transac tion. Trans actional ob je cts are implem ented in the
LSA-STM us ing inheritanc e and must supp ort state cloning. As such it do es not fully
supp ort le gacy co de, and unlike Deuce is not transparent or non- invasive. Among
the limitations of the fram ework, one can me ntion that acc es se s to a field are only
supp orte d as part of metho ds from the owner class . The re fore , public and static
fields cannot b e acc es se d trans actionally. Note that this limitation also applies to
other Java STM imple me ntations that use ob jec t-leve l c onflic t dete ction such as
DSTM and DST M2.
AtomJava [12] take s a different approach to Java STM des ign by providing a
source -to-s ource translator bas ed on Polyglot [16], an exte ns ible c ompile r
framework. AtomJava adds a new atomic keyword to mark atom ic blo cks , and p e
rforms s ome ma jor extensions during co de trans formation, such as adding an e
xtra instance fie ld to e ach and every c las s. Atom Java re lie s on this field to
maintain an ob jec t-bas ed lo ck schema. This schema is s how n to re duce lo ck ac
ce ss ove rhead, but imp ose s a mem ory ove rhead which impacts not only transac
tio nal ob je cts but all the application ob je cts. The s ource co de translation
approach s im plifie s the tuning and ve rification of STM ins trume ntation, but it m
akes it harder for the programm er to debug her original
co de. Also, by translating s ource c o de , Atom Java im p os es a ma jor lim itation on
us ers in that it pre vents the m from us ing com piled libraries, and re nders the ir sourc
e co de with atomic blo cks incompatible with re gular Java c ompile rs (unlike
approaches such as DSTM2 and LSA- ST M that are bas ed on annotations ).
Multiverse [14] is another STM implem entation for Java that p erform s instrum
entation drive n by annotations. Similarly to Deuce , Multiverse supp orts field-bas ed
conflic t de tection granularity, but fields can b e monitored as part of a transac tion only
if the ir ow ne r clas s was annotate d as @TransactionalObject. This require s the
develop er to b e aware of all the data s tructure s that might b e acc es se d as part of
a trans action. Neve rthele ss, this approach provides Multive rs e with the ability to
maintain s trong atom icity by pre venting any acce ss to transactional ob jec t memb e
rs that is not part of a trans action’s c onte xt. Multive rse provides s everal other fe
atures , such as a retry/ orelse construct, diff erent leve ls of optim istic and p e ss im is
tic b ehavior, and limited c ommuting op e rations.
Mo dific ations to the JVM have b ee n prop os ed [22] to replace Java m onitors by
trans actions . Atomos [4] is a Java extension that replac es the synchronized keyword
by atomic, and wait/notify op e rations by retry state ments . Such c onve rs ions are
quite c omplex, b ec aus e the trans forme d c o de must maintain the sam e s emantics
as the original c o de, and has to s upp ort op erations like irre vo c able actions or s
ignaling. In the end, the dec is ion to replace a monitor dep ends on the tradeoff b
etwee n STM ove rhead and the gain in paralle lis m.
There exist se veral other STM implem entations for various programm ing
languages. An exhaustive list is outside of the sc op e of this pap e r, but the inte re ste
d readers can consult the online TM bibliography available from the We b page at
http://www.cs.wisc.edu/trans- memory/biblio/index.html.
3 C oncur re nt Pro gr am mi ng w it h D EUCE
One of the main goals in des igning the Deuce API was to ke ep it s im ple . A s urvey
of the past des igns (se e previous s ection) reveals thre e m ain approache s: (i)
adding a new re se rve d keyword to m ark a blo ck as atomic , e.g., atomic in
AtomJava [12]; (ii) us ing e xplicit m etho d c alls to dem arc ate trans actions and acc
es s s hare d ob je cts, e.g., DST M2 [9] and CCR [8]; or (iii) requiring atom ic c las se s
to im plem ent a pre de fine d interfac e or extend a base class , and to impleme nt a s
et of metho ds [10]. The firs t approach require s m o difying the com piler and/or the
JVM, w hile the othe rs are intrusive from the program me r’s p ersp ec tive as the y re
quire significant mo difications to the co de (eve n though some sys te ms us e se
mi-automatic weaving m echanism s such as asp ect- orie nte d program ming to e ase
the task of the program me r, e .g., [19]).
In c ontras t, Deuce has b e en des igne d to avoid any addition to the language or
changes to the JVM. In partic ular, no sp ecial c o de is ne ce ssary to indic ate w hich
ob jec ts are transac tional, no me tho d ne eds to b e im plem ented to supp orting
trans actional ob je cts (e .g., cloning the s tate as in [10]). This allows Deuce to se
amles sly supp ort transactional acce ss es to compile d libraries . T he only pie ce of
information that mus t b e sp ec ifie d by the programm er is , obvious ly, which part of
the c o de s hould exe cute atom ic ally in the c ontext of a trans action.
To tha t end, Deuce relies on Java annotations . Intro duce d as a new fe ature in
Java 5, annotations allow programm ers to mark a m etho d with me tadata that can
be
consulted at c las s loading tim e. Deuce intro duce s ne w typ e s of annotations to
mark me tho ds as atomic : the ir e xec ution will take place in the context of a
transaction.
This approach has se veral advantages , b oth technica lly and se mantic ally. First
technically, the sm alle st co de blo ck Java can annotate is a me tho d, w hich sim
plifie s the ins trume ntation pro ce ss of Deuce and provides a s im ple mo de l for the
program me r. Sec ond, atom ic annotations op e rate at the s ame le vel as syn chron
ized me tho ds, which e xe cute in a mutually e xclus ion m anner on a give n ob je ct;
the refore , atomic me tho ds provide a familiar program ming mo de l.
1 pu
blic int sum(L ist l ist ) {2 int to ta l =
0;
3 atomic {4 for ( N o de n : list )
5 to ta l += n. getVa lue( );
6 }7 return to ta l;
8}
Fig. 1. Atomic blo ck example.
From a s em antic p oint of view, imple menting atomic blo cks at the granularity of
me tho ds rem oves the nee d to deal with lo cal variable s as part of the trans action. I
n particular, s inc e Java do e sn’t allow any acc es s to s tack variable s outside the
current me tho d, the STM can avoid logging many me mory acc es se s. For ins tance
, in Figure 1, a finer-granularity atom ic blo ck would re quire cos tly logging of the total
lo cal variable (otherwis e the metho d would yield incorre ct res ults up on ab ort)
whereas no logging would b e nec es sary when conside ring atom ic blo cks at the
granularity of individual m etho ds. I n case s when fine r granularity is desirable ,
Deuce can b e us ed in combination with the TMJava front-e nd dis cuss ed in Se ction
6.
To illus trate the use and imple me ntation of Deuce , we will c ons ider a
well-known but non- trivial data structure : the skip list [17]. A skip lis t is a probabilistic
struc ture based on multiple parallel, s orte d linked lis ts, w ith e fficie nt search tim e in
O (log n ).Figure 2 shows a pa rtial implem entation of skip list, with an inner class re
pres enting no de s and a me tho d to search for a value through the lis t. The ke y obs
ervation in this c o de is that transactifying an applic ation is as e asy as adding
@Atomic annotations to me tho ds that should e xecute as trans actions . No c o de
nee ds to b e change d within the metho d or else where in the c las s. Interes tingly,
the linked list direc tly manipulate s arrays and ac ce ss es public fields from outside
their enclos ing ob je cts, which would not b e p oss ible with DSTM2 or LSA-STM.
One c an als o obse rve that the @Atomic annotation provide s one configurable
attribute , retries, to optionally limit the am ount of retrie s the transac tion attem pts (at
m ost 64 tim es in the e xample ). A TransactionException is thrown in cas e this limit is
re ache d. Alternatively one can envision providing time out instead (Which we might
add in future ve rs ions ).
A Deuce application is c ompiled with a re gular Java compiler. Up on exec ution,
one ne eds to s p ec ify a Java age nt that allows Deuce to inte rc ept eve ry clas s
loade d and manipulate it b e fore it is loade d by the JVM. T he agent is simply sp e
cified on the c ommand line as a parame ter to the JVM, as follow s:
1 pu
blic cla ss Ski pList {2 private s tat ic
cla ss No de {3 pu blic fin al int val ue ;
4 pu blic fin al No de[ ] fo rwar d;
5 // ...
6 pu blic No de( int level, int v) {7 val ue = v;
8 fo rwar d = ne w No de[ level + 1] ;
9 }10 // .. .
11 }12 private stat ic int MAX LEVEL = 32;
13 private int level ;
14 private final No de hea d;
15 // Co ntinued in nex t column...
16 @Atomic
(r etr ies=64 )
blic b o olea n contai ns( int v) {18 No de
no de = hea d;
19 for ( int i = level ; i >= 0; i --) {20 No de next = no
de.fo rwar d[i ];
21 wh ile (nex t.va lue < v) {22 no de = next;
23 next = no de. for war d[ i ] ;
24 }25 }26 no de = no de.f orwa rd[0 ];
27 return (no de.value == v) ;
28 }29 // ...
30 }
17 pu
Fig. 2. @Atomic me tho d example.
java -javaagent:deuceAgent.jar MainClass args... As will b e disc us se
d in the ne xt s ection, Deuce ins trume nts e very c las s that may b e
used from within a transac tion, not only clas ses that have @Atomic annotations
. If it is known that a clas s will ne ver b e us ed in the conte xt of a trans action,
one c an pre vent it from b e ing instrum ented by providing e xclus ion lis ts to
the Deuce agent. This w ill s p e ed up the applic ation loading tim e ye t s hould
not aff ec t exe cution s p ee d.
4 D EUCE Impl em entat io n
This sec tion de sc rib e s the im plem entation of the Deuce fram ework. We
first give a high- le vel overvie w of its main c omp onents. The n, we explain
the pro ce ss of co de ins trume ntation. Finally, we de sc rib e various
optimizations that e nhanc e the p e rformance of trans actional c o de .
JVM
DEUCE
agent

Application
DEUCE runtime
TL2 LSALock
Jav
a
cla
sse
s
Fig. 3. Main comp one nts of the Deuce archi tecture.
4.1 Overview The Deuce fram ework is conce ptually m ade up of 3 layers, as
s how n in Figure 3:
1. The application layer, w hich consis ts of us er c lass es written without any
relations hip to the STM implem entation, exce pt for annotations added to
atomic me tho ds.
2. The Deuce runtime laye r, which orche strates the inte rac tions b etwee n transac
tions exe cuted by the application and the unde rlying STM impleme ntation.
3. The layer of ac tual STM librarie s that imple me nt the Deuce conte xt API (s ee
Section 5), inc luding a s ingle- global-lo ck impleme ntation (denoted as “Lo ck” in the
figure, and used as a p erform anc e “refere nc e p oint” for all othe r libraries ).
In addition, the Deuce agent inte rc epts c lass es as the y are loade d and instrume
nts them b e fore they start exe cuting.
4.2 I nstrumentat ion Framework Deuce ’s ins trumentation engine is based on ASM
[3], an all-purp ose Java byte co de
manipulation and analys is fram ework. It c an b e use d to mo dify existing class es , or
to dynamically ge ne rate class es , dire ctly in binary form. The Deuce Java age nt use
s ASM to dynam ically ins trume nt c las se s as they are loaded, b efore the ir first ins
tanc e is create d in the virtual machine. During instrum entation, new fields and me tho
ds are adde d, and the c o de of transac tional me tho ds is trans forme d. We now
desc rib e the different m anipulations p e rforme d by the Java age nt.
Fiel ds. For e ach instanc e field in any loaded clas s, Deuce adds a synthe tic c ons
tant field ( final static public) that re pres ents the relative p osition of the field in the clas
s. This value, together with the instanc e of the clas s, unique ly identifies a field, and is
used by the STM implem entation to log fie ld ac ce sse s.
blic cla ss Ski pList {2 pu blic sta tic fi nal lon g CL ASS BASE
= .. .
3 pu blic sta tic fi nal lon g MAX LEVEL addr ess = .. .
4 pu blic sta tic fi nal lon g level addr ess = . ..
5 // ...
6 private stat ic int MAX LEVEL = 32;
7 private int level ;
8 // ...
9}
1 pu
Fig. 4. Fields addres s.
Static fields in Java are eff ective ly fields of the e nc los ing clas s. To de signate
static fields , Deuce define s for each c lass a c ons tant that represe nts the bas e clas
s, and can b e use d in c ombination with the field p osition ins te ad of the clas s
instance .
For instance , in Figure 4, the level field is re pres ented by level ADDRESS while
the SkipList base class is re pre sente d by CLASS BASE .
Ac cessors. For e ach field of any loaded clas s, Deuce adds synthe tic static acc es
sors us ed to trigger field’s ac ces s eve nts on the lo cal c onte xt.
Figure 5 shows the synthe tic ac ce ssors of class SkipList. The getter level Getter$ rec
eives the curre nt fie ld va lue and a c onte xt, and triggers two events :
beforeReadAccess and onReadAccess; the result of the latter is returned by the ge tte r.
The setter rece ives a conte xt and yie lds a s ingle eve nt: onWriteAccess. The re ason
for having two eve nts in the getter is technical: the “b e fore ” and “after” eve nts allow
the STM
1 pu
blic cla ss Ski pList {2 private i nt
level ;
3 // ...
4 5 // Synthetic getter
6 pu blic int level Gett er$ ( Co nt ex t c) {7 c . b ef oreR eadAccess( thi
s, l evel ADD RESS );
8 return c.o nRea dA ccess( thi s, l evel , level ADD RESS );
9 }10 // Synthetic se tter
11 pu blic vo id level Sett er$ ( int v , C ontext c) {12 c . onWri teAccess(
thi s, v , level ADD RESS );
13 }14 // ...
15 }
Fig. 5. Fields ac ces sors.
backe nd to verify that the value of the field, which is ac ce sse d b etwee n b oth e
vents without us ing c ostly refle ction me chanis ms , is consiste nt. This is typic ally
the cas e in tim e-base d ST M algorithm s like TL2 and LSA that s hip with Deuce .
1 private
s tat ic cla ss No de {2 pu blic No de[ ]
fo rwar d;
3 // ...
4 5 // Or igi na l method
6 pu blic vo id setFor wa rd( int level, No de next) {7 fo rwar d[ level ] =
nex t;
8 }9 10 // Synthetic dup licate method
11 pu blic vo id setFor wa rd( int level, No de next, Co ntext c) {12 No de[ ] f = fo rwar
d Gett er$ (c) {13 fo rwar d Sett er$ ( f , l evel , c )
14 }15 // ...
16 }
Fig. 6. Duplicate metho d.
Dupli cate metho ds. In orde r to avoid any p e rformance p enalty for non
transactional co de , and since Deuce provides weak atom ic ity, Deuce duplic ates
each me tho d to provide two distinc t versions. T he firs t version is identical to the
original me tho d: it do e s not trigge r an eve nt up on me mory acc es s and, conse
que ntly, do es not imp os e any transac tional overhe ad. The s econd ve rs ion is a
synthe tic me tho d w ith an extra Context param eter. In this instrum ented c opy of
the original m etho d, all field acc es se s (e xce pt for final fields ) are re plac ed by
calls to the as so c iate d transac tional acc es sors. Figure 4 shows the two ve rs
ions of a metho d of clas s Node. T he s econd synthe tic ove rload re plac es
forward[level] = next by calls to the synthetic ge tte r and se tte r of the forward array.
T he first call obtains the re fe re nc e to the array ob jec t, while the se cond one
changes the sp ec ified array e leme nt (note that ge tters and s ette rs for array e le
me nts have an extra index param ete r).
1 pu
23 //
3 4 //
24 if
blic cla ss Ski pList {2 // ...
Or igi na l method instr umented
5 pu blic b o olea n contai ns( int v) {6 Thr owa
ble thr owa ble = null ;
7 Co nt ext context =
8 Co nt extDel egat or .g etInst ance() ;
9 b o olea n commi t = true ;
10 b o olea n result ;
11 12 for ( int i = 6 4; i > 0; --i) {13 context . ini t ();
14 try {15 result = cont ai ns( v, cont ext );
16 } ca tch (Tra nsa ct io nExceptio n ex) {17 // Must rol
lback
18 commi t = false ;
19 } ca tch (T hrowabl e ex ) {20 thr owa ble
= ex;
21 }22 // Co ntinued in nex t column...
Tr y to commi t
( co mmit ) {25 if ( co ntex t . co mmit () )
{26 if ( t hrowabl e == null )
27 return result ;
28 // Rethrow app lication e xcep tio n
29 throw (IOExcepti on) thr owabl e;
30 }31 } els e {32 context . ro llba ck ();
33 commi t = true ;
34 }35 } // Retry loop36 throw ne w Tr ansacti
onExcept io n( );
37 }38 39 // Synthetic dup licate method
40 pu blic b o olea n contai ns( int v, Context c) {41 No de no
de = hea d Gett er$ (c);
42 // ...
43 }44 }
Fig. 7. Atomic me tho d.
At omic m etho ds. The duplication pro c ess des crib e d ab ove has one e xc eption:
a metho d annotated as @Atomic do e s not need the firs t uninstrum ented ve rsion.
Instead, the original me tho d is replace d by a trans actional ve rs ion that calls the ins
trume nte d ve rs ion from within a trans action that exec ute s in a lo op. The pro c es
s rep e ats as long as the trans action ab orts and a b ound on the numb er of allowe d
retrie s is not re ached. Figure 7 s hows the trans actional version of m etho d
contains.
4.3 Sum mary To summ ariz e, Deuce p e rforms the following ins trumentation op
erations on the Java
byte co de at load-time . – For eve ry fie ld, a c ons tant is added to kee p the re lative
lo cation of the fie ld in
the c las s for fast ac ce ss . – For e very field, 2 acc es sors are adde d (ge tter
and sette r). – For e very clas s, a c ons tant is adde d to keep the re fe re nc e to the
class definition
and allow fas t acc es s to static fie lds. – Every (non- @Atomic) me tho d is
duplicated to provide an atom ic and a nonatomic version. – For e very @Atomic me tho d, an atomic ve rsion is c re ated
and the original version
is replace d with a re try lo op c alling the atom ic vers ion in the c onte xt of a ne w
trans action.
4.4 Opti mi zations During the ins trume ntation, we p e rform s eve ral optimizations to
im prove the p e rformance of Deuce . Firs t, we do not instrum ent acc es ses to final fie lds as they
cannot b e m o difie d afte r creation. This optim iz ation, toge ther with the de claration
of final fields wheneve r p os sible in the application co de , dramatic ally reduc es the
overhe ad.
Second, fie lds ac ce sse d as part of the c ons tructor are ignore d as the y are not
acc es sible by concurre nt threads until the c ons tructor returns.
Third, instead of ge ne rating ac ce ss or me tho ds, Deuce actually inlines the co de
of the gette rs and s etters directly in the transactional co de . We have obse rve d a s
light p e rformance improve me nt from this optimization.
Fourth, we chos e to use the sun.misc.Unsafe pseudo-standard inte rnal library to
implem ent fast re fle ction, as it proved to b e vas tly more effic ient than the standard
Java reflec tion mechanism s. T he implem entation using sun.misc.Unsafe eve n outp
erforme d the approach taken in AtomJava [12], w hich is based on using an
anonymous class p er field to replac e re fle ction.
Finally, we tried to lim it as much as p ossible the stres s on the garbage collec tor,
notably by us ing ob je ct p o ols when ke eping track of acc es se d fie lds (re ad and
write se ts) in threads. In orde r to avoid any contention o n the p o ol, we had e ach
thre ad kee p a s eparate ob ject p o ol as part of its context.
Together, the ab ove optim iz ations he lp ed to signific antly de creas e the im plem
entation ove rhead, in som e of our b enchm arks this improve d p erformance by almos
t an orde r of magnitude (i.e ., tenfold faster) as com pared to our initial imple me
ntation.
5 C usto mi zing Co ncurr ency Co ntro l
Deuce was des igned to provide a re se arch platform for STM algorithms. I n order to
provide a simple API for re se archers to plug in the ir own STM algorithm’s imple me
ntation, Deuce define s the Context API as s hown in Listing 8. T he API include s an
init me tho d, called once b efore the transac tion s tarts and the n up on each re try,
allowing the trans action to initializ e its internal data s tructure s. The atomicBlockId
argum ent allow s the transaction to log inform ation ab out the sp ec ific atom ic blo ck
(statically ass igne d in the bytec o de ).
1 pu
blic interface Co nt ext {2 voi d ini t ( int at
omicBlo ckId) ;
3 b o olea n commi t( );
4 voi d ro ll back () ;
5 6 voi d b efor eReadA ccess(Ob ject ob j, lon g field) ;
7 8 Ob j ect onR eadAccess( Ob ject ob j, Ob j ect value, lon g field) ;
9 int onR eadAccess( Ob ject ob j, int val ue , lon g field ) ;
10 lon g onR eadAccess( Ob ject ob j, lon g val ue , lon g field ) ;
11 // ...
12 13 voi d onWri teAccess( Ob j ect ob j, Ob ject val ue, lon g field) ;
14 voi d onWri teAccess( Ob j ect ob j, int val ue , lon g field ) ;
15 voi d onWri teAccess( Ob j ect ob j, lon g val ue , lon g field ) ;
16 // ...
17 }
Fig. 8. Context interfac e.
One of the heuris tic s we adde d to the LSA im plem entation is , following [6], that
each one of the atomic blo cks will initially b e a re ad- only blo ck. It will b e conve rte d
to b ec ome a writable blo ck up on re try, once it encounte rs a firs t write . Using this
me tho d, re ad- only blo cks c an save mos t of the overhead of logging the fields’ ac
ces s.
Another heuris tic is that commit is c alle d in cas e the atomic blo ck finishes without
a TransactionException and c an re turn false in case the c ommit fails , which in
turn w ill cause a retry. A rollback is called whe n a TransactionException is thrown
during the atomic blo ck (this c an b e us ed by the bus ine ss logic to force a retry).
The re st of the m etho ds are called up on field ac ces s: a field read e vent will
trigger a beforeReadAccess eve nt followed by a onReadAccess eve nt. A write field e
vent will trigger an onWriteAccess eve nt. Deuce curre ntly inc ludes three Context
imple me ntations , tl2.Context, lsa.Context, and norec.Context, im plem enting the TL2
[6], LSA [19], and NOrec [5] ST M algorithm s. Deuce also provide s a re fe re nce
imple me ntation bas ed on a s ingle global lo ck. Since a global lo ck do e sn’t log any
fie ld acc es s, it do e sn’t im plem ent the Context interfac e and do es n’t imp ose any
ove rhe ad on the fields ’ acc es s.
TL2 Context . The TL2 context is a straightforward im plem entation of the TL 2
algorithm [6]. The general princ iple is as follow s (m any de tails omitted).
TL2 uses a s hared array of revokable versioned locks, with each ob jec t field b
eing mapp e d to a s ingle lo ck. Each lo ck has a ve rs ion that corres p onds to the c
ommit times tamp of the transac tion that last up date d some field protec ted by this lo
ck. Time stamps are ac quire d up on comm it from a global time bas e, im ple mented
as a sim ple counte r, up dated as infreque ntly as p ossible by writing transactions.
Whe n re ading s ome data, a transac tion checks that the asso c ia te d tim es tamp
is valid, i.e., not more re ce nt than the time whe n the trans action s tarte d, and ke eps
track of the acc es sed lo c ation in its re ad s et. Up on write, the transaction buffers
the up date in its write set.
At comm it tim e, the transac tion acquires (using an atomic com pare- and-se t
operation) the lo cks protec ting all w ritten fie lds, verifie s that all e ntries in the re ad
se t are s till valid, ac quires a unique com mit tim es tamp, w rites the mo difie d fields
to me mory, and re le ase s the lo cks. I f lo cks cannot b e acquires or validation fails ,
the trans action ab orts , dis carding buff ered up date s.
LSA Context. The LSA c onte xt us es the LSA algorithm [19] that was de velop e d
concurre ntly with TL2 and is base d on a similar design. The main differe nc es are
that (1) LSA acquires lo cks as fie lds are writte n (encounter o rder), instead of at c
ommit time, and (2) it p e rforms “increm ental validation” to avoid ab orting trans
actions that read data that has b ee n mo dified m ore re cently than the s tart tim e of
the trans action.
Both TL2 and LSA take a s im ple approach to conflic t managem ent: they s imply
ab ort and res tart (p os sibly afte r a de lay) the trans action that encounte rs the
conflict. This strategy is s imple and e fficie nt to im plem ent, b ec aus e transac tions
unilaterally an ab ort without any s ynchroniz ation with others. Howe ver, this c an som
etime s hamp e r progre ss by pro ducing live lo cks. Deuce also s upp orts variants o f
the T L2 and LSA algorithm s that provide mo dular conte ntion managem ent s trate
gie s (as firs t prop ose d in [10]), at the price of som e additional s ynchroniz ation
overhead at runtime.
NOre c Context. The NOre c context implem ents the lightweight algorithm prop os
ed in [5]. Roughly sp eaking, NOre c diff ers from TL2 and LSA in that it use s a
single ownership re cord (a global vers ioned lo ck) and re lie s on value -based
validation in addition to time- bas ed validation. The rationale of this de sign is to
avoid having
to pay the runtime ove rheads as so ciated with ac ce ss ing multiple owne rship
records while s till providing a re asonable leve l of concurre nc y thanks value -base d
validation which helps ide ntify fals e conflic ts. T his algorithm was s how n to b e effic
ient at low thre ad c ounts, but tends to fall apart as c onc urrency increas es. In c
ontras t, TL2 and LSA provide b e tte r sc alability, p erform ing b ette r as concurre nt
threads are added or when there are more frequent but disjoint comm its of s oftware
trans actions .
6 J ava l anguag e exte nsi ons
Deuce uses annotation to indicate trans action dem arc ation. Therefore , it only s upp
orts trans actions at the granularity of whole m etho ds, and cannot dec lare s horter
“atomic blo cks ” as in Listing 9. In this e xam ple , one wants to transfer the balanc e of
a s et of ac counts to another ac count. Transfer op e rations mus t b e atomic , but us
ing a s ingle transac tion for all transfers would pro duc e an unnec es sarily long trans
action, and would inc re ase the risk of conflic ts with concurre nt op e rations. In c
ontrast, us ing one transaction p er trans fe r would re duce the like liho o d of ab orts,
and allow for b e tter concurrency. Note that the lo cal variable total, which holds the
total am ount trans fe rred, mus t b e ac ce ss ed trans actionally, w ith its pre vious
value res tore d up on trans actional ab ort.
1 pu
blic int tr ansferA ll ( A cco unt[ ] src , Account dst ) {2 int to ta l = 0
;
3 for
( A ccount acc : sr c ) {4 atomic {5 int amount
= acc. bala nce() ;
6 acc . wi thdr aw (a mo unt) ;
7 dst . dep osit ( a mo unt) ;
8 to ta l += a mo unt;
9 }10 }11 return to ta l;
12 }
Fig. 9. Trans actions at the granularity of atomic blo cks .
Intro duc ing ne w keywords to supp ort trans actional me mory at the language le
vel do e s not only provide finer trans action granularity, but also allows for more s
ophisticate d c ons tructs to c ontrol re cove ry, re trie s, etc. This allows trans actions to
b e use d in more p owe rful ways. Such transac tional extensions do, howe ver, require
e ither using a pre -pro c es sor, or mo difying the Java com piler to supp ort the ne w
constructs.
We have therefore de velop e d a front- end to ol, TMJava, that trans form s the
source c o de from a program written with the language e xte ns ion for atom ic blo cks,
into a program that is suitable for instrum entation by Deuce .
The trans formation p erform ed by the front-e nd to ol m aps the atom ic blo cks in
the e xte nded language into annotated atom ic me tho ds. In other words, the
frontend to ol analyze s the co de to find the atomic blo cks (atomic keyword) ins ide
clas s me tho ds; it then c reates ne w metho ds whos e b o dies c ons is t of the
content of the atomic blo cks, and replac es the blo cks w ith c alls to the se new me
tho ds . Howeve r, mapping an atomic blo ck into an annotate d atom ic me tho d is
not trivial as we nee d to take into acc ount s everal is sues, notably:
1. Variable s and ob je cts, that are ac ce ss ible ins ide the scop e of the metho d in w
hich the atomic blo ck is dec lare d, should als o b e available inside the s cop e of the
atomic blo ck.
2. Mo dific ations inside an atomic blo ck of variables and ob jec ts that are define d
outs ide the s cop e of the blo ck, s hould b e visible outside the atom ic blo ck.
Variable s and ob je cts that are define d outs ide the atomic blo ck, but are use d
and/or mo dified inside the atom ic blo ck, are pas se d as param eters to the corre sp
onding annotated atom ic m etho d. As Java only supp orts paramete r pass ing
by-value for primitive typ e s, variables that are mo dified inside the atom ic blo ck are
pas se d in arrays , and up date d value s are c opied back to the variables when re
turning from the atomic me tho d (in c ase a s ingle variable is mo difie d, its ne w value
is s imply returne d by the me tho d). Lis ting 10 shows the Deuce -c ompatible c o de
pro duc ed by TMJava from Listing 9.
1 pu
blic cla ss Bank {2 pu blic int tr ansferA ll ( A cco unt[ ] src , Account
dst ) {3 int to ta l = 0 ;
4 for ( A ccount acc : sr c ) {5 to ta l = t ra nsfer All ab1 ( t ota l ,
dst , acc );
6 }7 return to ta l;
8 }9 @Atomic
10 private final int tr ansferA ll ab1 ( int to ta l , Account dst, A cco unt a cc) {11 int amount = acc.
bala nce() ;
12 acc . wi thdr aw (a mo unt) ;
13 dst . dep osit ( a mo unt) ;
14 to ta l += a mo unt;
15 return to ta l;
16 }17 }
Fig. 10. Co de of Listing 9 trans form ed by TMJava for Deuce .
In addition to s upp orting basic atomic blo cks, TMJava provides se veral additional
ke ywords , notably retry, next, leave (for c ontrol flow), either/or/otherwise (for alte
rnative s), and on failure (for atom ic e xc eption handling).
The TMJava front-e nd to ol is base d on the JastAd d exte ns ible Java com piler
[7]. In a first ste p, an abstract syntax tre e (AST ) is c ons tructe d from the s ource
core. The AST is then analyz ed to lo c ate trans actional language e xte ns ions, and is
mo difie d acc ordingly. Finally, Deuce -c ompatible annotate d s ource co de is
generated from the mo dified AST. The re fore , TMJava trans forms Java co de with T
M e xte ns ions into “pure” Java source co de that can b e com piled us ing a standard
compile r, but with the s tructure and annotations exp e cted by Deuce for transac
tifying the application.
7 Per fo rm ance Evaluati on
We e valuate d the p e rformance of Deuce on a Sun UltraSPARCTMT2 Plus multicore
machine (2 CPUs e ach with 8 cores at 1.2 GHz, e ach with 8 hardware threads) and
an Azul Vega2(2 CPUs each with 48 cores ).TM
7.1 Deuce overheads
firs t briefly disc us s the ove rheads intro duced by the Deuce agent when ins
trume nting Java clas se s. Table 1 s hows the me mory fo otprint and the pro c es sing
overhead when pro ce ssing com piled librarie s: rt.jar is the runtime library c ontaining
all Java built- in c las se s for JDK 6; cache4j.jar (version 0.4) is a wide ly us ed inmem ory cache for Java ob jec ts; JavaGran de (version 1.0) is a we ll- know n collec
tion of Java b e nchmarks . I ns trumentation time s we re meas ure d on an Inte l Core
2 DuoCPU running at 1.8 GHz. Note that ins trumentation was e xe cuted se rially, i.e .,
without exploiting multiple cores .
T MWe
App lication Memory In st rumentation Ori ginal Instrum ented
Time
rt.jar 50 MB 115 MB 29 s cache4j.jar 248 KB
473 KB < 1 s JavaGrande 360 KB 679 KB < 1 s
Table 1. Mem ory fo otprint and pro ce ssi ng overhead.
As e xp ec ted, the size of the c o de approximate ly doubles b e cause Deuce
duplic ate s e ach metho d, and the pro c es sing ove rhead is prop ortional to the s iz e
of the co de. However, b e cause Java us es laz y c lass loading, only the class es that
are actually used in an application will b e ins trume nte d at load time . Therefore , the
overhe ads of Deuce are negligible considering individual clas se s are loade d on dem
and.
In terms of e xe cution time , instrum ented class es , that are not invoke d w ithin a
trans action, incur no p erform anc e p enalty as the original metho d exec ute s.
7.2 Be nchmarks
We te ste d the four built-in Deuce STM options: LSA, TL2, NOre c, and the simple
global lo ck. Our LSA ve rs ion captures m any of the prop e rtie s of the LSA-STM [20]
fram ework. The TL2 [6] form provides an alternative STM imple mentation that acquire
s lo cks at c omm it tim e using a w rite se t, as o pp os ed to the encounte r order lo
cking approach used in LSA.
Our e xp eriments inc luded three clas sical m ic ro-b e nchm arks : a red/black tre
e, a sorted linked lis t, and a skip list, on which we p erforme d c onc urrent inse rt, de
lete, and se arch op erations. We c ontrolled the siz e of the data s tructures, which rem
aine d constant for the duration of the exp e rime nts, and the mix of op e rations.
We also e xp erimente d with three re al- world applications, the first implem ents the
Lee routing algorithm (as pre sente d in Lee- TM [2]). It is worth noting that adapting
the application to Deuce was straightforward, with little m ore than a change of s
ynchroniz ed blo cks into atom ic m etho ds. The sec ond application is c alle d
Cache 4J, an LRU c ache imple me ntation. Again adapting the application to Deuce
was s traightforward. The third application com es from the wide ly us ed STAMP [13]
TM b enchmark suite.
Micro-b enchmarks. We b e gan by c omparing Deuce to DST M2, the only c omp e
ting STM frame work that do e s not re quire compiler s upp ort. We b enchmarke d
DST M2 and Deuce based on the Red/Black tree b enchmark provide d with the DST
M2 frame work. We tried many variations of op eration c ombinations and leve ls of
concurre nc y on b oth the Azul and Sun machine s. C omparison res ults for a tre e
with 10k e lem ents are s hown in Figure 11, we found that in all the b enchmarks DST
M2 was ab out 100 tim es (two orde rs of magnitude ) s lowe r than Deuce . Our res
ults are c ons is tent with those published in [9], where DSTM2’s throughput, me asured
in op e rations p er se cond, is in the thous ands, while Deuce ’s throughput is in the m
illions . Bas ed on thes e e xp erime nts , we b elieve one can conclude that Deuce is
the first viable c ompile r and JVM indep ende nt Java STM fram ework.
1
2
1
4
0
Number
of
threads
96
6
96
RB tree,
updates
0
0
.
2
Throughput (× )10txs/s
2
4
6
8
1
0
80
6
6
Throughput (× )10txs/s
1 16 32 48 64
112 128
Number of threads
0
RB tree, 10k elements, 10%
updates
DSTM2
0.5
Lock
LSA TL2
1
1.5
2
2.5 1 16 32 48 64 80
3 112 128
Throughput (× 10
txs/s)
RB, 10k elements, 0%
updates
DSTM2
Lock
LSA TL2
10k elements, 20%
DSTM2
Lock
LSA TL2
0
.
14 16 32 48 64
112
128
0
.
Number
of threads
6
0
.
8
1
1
.
2
1
.
4
1
.
6
80
96
Fig. 11. The red/bl ack tre e b enchm ark (Sun).
On the othe r hand we obse rve that the Deuce STMs sc ale in an impre ss ive way
eve n with 20% up date s (up to 32 thre ads ). While DST M2 shows no s calability. Whe
n we investigated dee p er we found out that m ost of DSTM2 ove rhe ad res t in two are
as. The mos t signific ant are a is the contention manager w hich acts as a c ontention p
oint, the s ec ond area is the reflec tion m echanism us ed by DST M2 to re trie ve and
ass ign values during the transac tion and in c ommit.
Next we pres ent the re sults of the te sts with the linked list and the skip list on the
Sun and Az ul machines . The linke d lis t b enchmark is known to pro duce a large
degree of contention as all thre ads mus t traverse the s ame long s eque nc e of no des
to re ach the targe t e le me nt. Res ults of exp e rim ents are s how n in Figure 12 and
Figure 13 for a linked list with 256 e leme nts and different up date rates. We observe that
b ecause there is little p otential for parallelism, the s ingle lo ck, w hich p e rforms
significantly le ss lo ck acquis ition and tracking work than the STMs, wins in all three b e
nchmarks . The STMs s cale slightly until 30 threads w he n there are up to 5% up dates ,
and then drop due to a ris e in ove rhea d w ith no b e ne fit in parallelism. With 20% up
date s, the m ax is reached at ab out 5 thre ads , b e cause as concurrency furthe r
increas es one c an e xp e ct at leas t 2 c onc urrent up date s, and the STMs suffer from
rep e ated trans actional ab orts .
Finally, we consider res ults from the skip lis t b e nchmark. Skip lists [18] allow
significantly greate r p otential for parallelism than linked lists b e cause different
threads are likely to travers e the data struc ture following dis tinct paths . Res ults for a
list with 16k e leme nts are shown in Figure 14 and Figure 15. We obs erve that the
STMs sc ale in an impre ssive way, as long as there is an inc rease in the leve l of
parallelism, and provide gre at sc alability e ven with 20% up dates as long as the ab
ort rate remains
1
16
32
48
96
Number of threads
Fig. 12. The l inked li st b enchm ark (Sun).
Fig. 13. The l inked li st b enchm ark (Azul).
TL2 LSA Lock
3
1
10
15
20
25
30
10 30 50 70 100
Number of threads
TL2 LSA Lock
6
6
Throughput (× 10
txs/s)
0
5
1 16 48 96 128 196 256
1 16 48 96 128 196 256
Number of threads
Number of threads
SL, 16k elements, 0% Fig. 14. The sk ipl is t b enchmark (Sun).
updates
SL, 16k elements, 20% updates
TL2
LSA
0 TL2
LSA
0 Lock
5 Lock
2
1
0
0
4
1
0
5
6
2
0
0
8
2
0
5
1 30 50 70 100 140 180
1 30 50 70 100 140 180
1
3
0
0
0
Number of threads
Numb
er
of
threa
ds
txs/s)
Throughput (× 10
3
1 16 32 48 96
Number of threads
SL, 16k elements, 50%
updates
0
LL, 256 elements, 20%
1000 updates
2000
TL2
3000
LSA
Lock
4000
5000
6000
3
SL, 16k elements, 20%
updates
0
LL, 256 elements, 5%
1000 updates
2000
TL2
3000
LSA
Lock
4000
5000
6000
Throughput (× 10
txs/s)
3
txs/s)
Throughput (× 10
SL, 16k elements, 0%
updates
0
LL, 256 elements, 0%
0 500 updates
2 0
0 100
4 00
0 150
TL2
6 00
LSA
Lock
0 200
8 00
0 250
1 00
0 300 1 10 30 50 70 100
0 00
Number of threads
Throughput (× 10
txs/s)
TL2
LSA
Lock
1000 LL, 256 elements, 20%
2000 updates
3000
TL2
4000
LSA
Lock
5000
0
Throughput (× 10
txs/s)
1 16 32 48 96
Number of threads
LL, 256 elements, 5%
updates
1
10 30 50 70 100
Number of threads
TL2 LSA Lock
15
10
5
0
1 16 48 96 128 196 256
Number of threads
SL, 16k elements, 50%
updates
TL2
4
LSA
Lock
6
TL2
LSA
Lock
5000
0
3
3
txs/s)
Throughput (× 10
10000
Throughput (× 10
txs/s)
15000
0
10
00
20
00
30
00
40
00
50
00
Throughput (× 10
txs/s)
LL, 256 elements, 0%
20000 updates
3
2
1
1 30 50 70 100
Number of threads0
140
180
6
Throughput (× 10
txs/s)
6
Throughput (× 10
txs/s)
6
txs/s)
Throughput (× 10
Fig. 15. The sk ipl is t b enchmark (Azul ).
at a reas onable leve l. We adde d a b enchmark with 50% up date s to s how that there is
a lim it to their scalability. When we inc re ase the frac tion of up date s to 50% , the STMs
in Figure 15 reach a “m elting p oint” much earlie r (at ab out 10 threads versus 100
threads in the 20% b e nchmark) and overall the lo ck wins again b ec aus e the ab ort
rate is high and the STMs inc ur and overhead w ithout a gain in paralle lis m. We note
that typically se arch s tructure s have ab out 10% up date s.
Re al appl icati ons. Our last b enchmarks dem ons trate how simple it is to replace the
critic al se ctions with trans actions . The firs t take s a s erial ve rsion of the Lee rout-
ing algorithm and dem ons trate s how a s im ple re place me nt of the critic al sec tions
by trans actions signific antly improves sc alability. The sec ond take s a non-multi-thre
ade d lo ck bas ed a LRU c ache implem entation (Cache4J) and shows that it is s
traightforward to replace the c ritical s ec tions, but sc alability is n’t promise d.
Circuit routing is the pro ce ss of automatically pro duc ing an interconne ction b e
tween e le ctronic com p onents. Lee ’s routing algorithm is an attractive candidate for
parallelization s inc e circuits (as shown in [2]) c ons ist of thous ands of routes, e ach
of which c an p otentially b e routed concurre ntly.
The graph in Figure 16 s how s exe cution time (that is , latency, not throughput) for
three diffe re nt b oards with the TL2, LSA, and NOrec algorithm s. As c an b e s ee n,
Deuce sc ales well with all algorithms , w ith the overall time de creas ing eve n in the
cas e of a large b oard (MemB oard). NOre c exhibits b etter p erformance with a s
ingle thre ad due to the lower overhead of its single ow ne rs hip rec ord. With more
threads, the p erformance of all algorithms is virtually identic al. This c an b e explained
by the natural parallelism and low contention of this b enchmark.
Lee-TM, MemBoard
Execution time
TL2
LSA
NOrec
0
5
0
1
0
0
1
5
0
2
1 10 15 20
25 30
0
0
Number of threads
2
5
0
3
0
0
Fig. 16. The Lee R outi ng b enchm ark (Sun).
Cache 4J is an LRU lo ck- bas ed cache imple me ntation. The implem entation is
based on two inte rnal data struc tures , a tree and a hash-m ap. The tre e m anage s
the LRU while the hash-map holds the data. The Cache4J implem entation is base d on
a single global lo ck which naturally leads to no sc alability. T he graph in Figure 17 s
hows the res ult of replacing the single lo ck with transac tions us ing the LSA
algorithm. As can b e se en, Deuce do e sn’t s cale well, with the ove rall throughput
slightly decre asing. A quick profiling shows that the fact that Cache 4J is an LRU
cache implie s that eve ry ge t op eration also up date d the inte rnal tre e. This alone
make s eve ry transac tion an up date transac tion. Yet, the fact that the total
throughput re mains almost the same even with 80 threads is encouraging due to the
fact that trans actional me mory’s main advantages , b esides scalability, are c o de s
implicity and robus tne ss.
Finally, we tes ted Deuce on the Vac ation b e nchm ark from STAMP [13], w hich is
the mos t wide ly us ed TM b e nchm ark s uite. As STAMP was b e en originally
written in C, we use d a Java p ort of the original b e nchmarks adapted for Deuce .
The Vacation application m o dels an online travel re se rvation syste m, imple mente
d as a s et of tree s that ke ep track of c us tomers and their res ervations . T hreads
sp end time e xe cuting
Cache4j (LSA)
2
4
6
0% updates
10%
updates
8
1
0
1 16 32 48 64
Number 1of threads
2
80
1
4
1
6
Fig. 17. The Cache 4J b enchm ark (Sun).
trans actions , and the contention le vel is mo de rate . Figure 18
s how s the e xe cution time of Vac ation with the TL2, LSA, and
NOre c algorithms. We can obse rve that b oth TL2 and LSA
scale well with the numb er of cores . In contras t, NOrec s how
s almos t no s calability, although it p e rforms b e tter than TL2
and LSA unde r low thre ad c ounts. This c an b e e xplaine d by
the fac t that (1) NOre c’s single ownership rec ord is p enaliz ed
by the s ize o f the trans actions , and (2) value -base d
validation is quite exp e ns ive in Java since it require s ac ce
ssing me mory via APIs that are s lower than the direct mem ory
acc ess es provided by unmanaged language s.
STAMP: Vacation
TL2
LSA
NOrec
0
2
0
4
0
6
0
8
0
1 2 4 8 161 32
0
Number of threads
0
1
2
0
1
4
0
Fig. 18. The Vacati on b enchm ark from the STAMP s
uite (Sun).
In conclus ion, Deuce shows the typical p e rform anc e patte
rns of an STM, go o d sc alability w he n there is a p ote ntial for
paralle lis m, and unimpre ss ive p erform anc e when paralle lis
m is low. T he res ults, howe ver, are very encouraging c ons
ide ring the fact that Deuce , a pure Java library with all the im
plied ove rhe ads , shows sc alability eve n at low thread counts.
8 C oncl usio n
We intro duce d Deuce , a nove l op e n- sourc e Java fram
ework for transac tional mem ory. As we showe d, Deuce is
Execution time
3
txs/s)
Throughput (× 10
the first e fficie nt fully feature d Java ST M frame work
that c an b e adde d to an existing applic ation without changes to its compile r or
libraries . I t dem ons trate s that one can de sign an e fficie nt, pure Java ST M without
a c ompile r s upp ort, eve n though language supp ort as provided by the com panion
front-e nd TMJava provides additional express ivenes s.
Though much work obvious ly rem ains to b e done , we b elieve Deuce is ready for
use by deve lop ers. It is fre ely dow nloadable from http://code.google.com/p/deuce
under the Apache lic ense .
1In conclus ion, Table 2 shows a general overvie w c omparing the two we ll- know n
Java ST Ms Atom Javaand DT MS2 vs . Deuce . This comparis on shows that Deuce
ranks highe st overall, and provides nove l c apabilities w hich yield much b etter us
ability and p erformance than forme rly e xis ted.
AtomJava DSTM2 Deuce Locks Ob je ct based Ob je
ct based Fiel d bas ed
Instrumentation Source Runtim e Runtim e API Non-Intrus ive Intrusi ve Non-Intrus ive
Libraries support None None Runtim e & offl ine Fields access Anonym ous class es
Re flecti on Low l evel unsafe Execution overhead Medi um High Medi um Extensible
Yes Yes Yes Transaction context atomic blo ck Callable ob ject @Atomic annotation
Table 2. Com paring Atom Java, DST M2, and Deuce
Ackno w ledgemen ts. This pap er was s upp orted in part by grants from Sun Mic ros
ystem s, I nte l Corp oration, Microsoft Inc., as well as a grant 06/1344 from the Israe li
Science Foundation, Europ ean Union grant FP7- ICT- 2007-1 (pro je ct VELOX), and
Swiss National Foundation grant 200021-118043. The TMJava front-e nd has b een
develop ed by Derin Harm anc i.
Refer ence s
1. B. Al p ern, S. Augart, S. M. Bl ackburn, M. Butrico, A. Co cchi , P. Cheng, J. Dol by, S. Fink ,
D. Grove, M. Hind, K. S. McKi nley, M. Mergen, J. E . B. Mos s, T. Ngo, and V. Sarkar. T he
jikes res earch v irtual m achine pro ject: bui ldi ng an op en-source re search communi ty. IBM
Sy st. J. , 44(2): 399–417, 2005.
2. M. Ans ari , C . Kotsel idi s, I. Wats on, C. Ki rkham, M. L uj´an, and K. Jarvis . L ee-tm : A
non-tri vi al b enchm ark sui te for transactional mem ory. In Proc. of the International
Conference on Alg orithms an d Architectures for Paral lel P rocessing (ICA3P P), pages
196–207, Berl in, Hei delb erg, 2008. Springer-Verlag.
3. W. Binder, J. Hul aas, and P. Moret. Advanced java byteco de ins trum entation. In Proc. of
the International Symposium on P rin cipl es and Prac tice of Programming in Jav a (PPP J),
pages 135–144, New York, NY, USA, 2007. ACM.
4. B. D. C arls trom , A. McD onal d, H. Chafi, J. Chung, C. C. Minh, C. Kozy rak is , and K. Ol
ukotun. The atom os transactional programmi ng l anguage. SIGPLAN Not., 41(6): 1–13,
2006.
1 We did not s how compari son b e nchmark re sul ts w ith Atom Java due technical l im
itati ons and bugs found in its current im plem entation, but we have observed on smal l b
e nchmark s that Deuce outp erform s Atom Java by a comfortable margi n.
5. L. Dal ess andro, M. Sp ear, and M. Scott. Norec: stream li ning stm by ab oli shing ow nershi
p records. In Proc. of the ACM SIGPLAN symposium on Principles an d practice of paral lel
prog ramming (PP oP P), pages 67–78, J an 2010.
6. D. Di ce, O. Shalev , and N. Shavi t. Trans acti onal lo cki ng i i. In Proc. of the International
Sy mposium on Distributed Computing (DISC), pages 194–208, 2006.
7. T. Ek man and G. Hedi n. T he JastAdd e xtensi ble J ava compi ler. SIGPLAN Not. ,
42(10):1–18, 2007.
8. T. Harris and K. Frase r. Language s upp ort for li ghtweight transacti ons. In Proc. of the
ACM SIGPLAN Confe rence on Objec t-Oriented Programming S ystems, Lan guag es an d
Appl ications (OOPSLA), pages 388–402, New York, NY, USA, 2003. ACM.
9. M. Herl ihy, V. L uchangco, and M. Moir. A flexi ble framework for i mpl eme nting software
transacti onal mem ory. In Proc. of the ACM SIGP LAN Conference on Object-O rien ted
Programming Systems, Languages and Applications (OOPSLA), pages 253–262, New York,
NY, USA, 2006. ACM.
10. M. Herl ihy, V. Luchangco, M. Moi r, and W. N. Scherer I I I. Software transactional me mory
for dy nami c-si zed data s tructures . In Proc. of the Ann ual ACM S ymposium on P rin cipl es of
Distributed Computin g (PO DC), pages 92–101, Ne w York , NY, USA, 2003. ACM.
11. M. Herli hy and E. Mos s. Transacti onal mem ory: Archi te ctural supp ort for lo ck-free data s
tructures. In Proceedings of the Twentieth Annual International Symposium on Compute r
Architecture , 1993.
12. B. Hi ndman and D. G rossm an. Atomici ty vi a source-to-source transl ati on. In Proc. of the
Work shop on M emory Sy stem Performance and Correctne ss (MSPC), pages 82–91, New
York, NY, USA, 2006. ACM.
13. C. C. Minh, J. C hung, C. Kozy raki s, and K. Olukotun. STAMP: Stanford transactional appli
cati ons for multi -pro ce ssi ng. In Proceedings of IISW C, Septemb er 2008.
14. Multi ve rse. http: //mul tiverse .co dehaus .org/. 15. Y. Ni , A. Wel c, A. -R . Adl -Tabatabai ,
M. Bach, S. Berkowi ts , J. Cownie , R. Ge va,
S. Kozhukow, R. Narayanaswamy, J. Ol iv ie r, S. Prei s, B. Saha, A. Tal , and X. Ti an. Des
ign and im plem entation of trans actional cons tructs for C/C+ +. In Proc. of the ACM
SIGPLAN Confe rence on Objec t-Oriented Programming S ystems, Lan guag es an d Appl
ications (OOPSLA), pages 195–212, New York, NY, USA, 2008. ACM.
16. N. Nys trom , M. R. Cl ark son, and A. C. Mye rs. Poly glot: An extens ibl e compil er fram
ework for java. In Proc. of the Intern ational Conferen ce on Compil er Construction (CC), pages
138–152, 2003.
17. W. P ugh. Sk ip li sts: a probabi li stic alternative to balanced tre es. Commun. ACM, 33(6):
668–676, 1990.
18. W. Pugh. Sk ip l ists : a probabil isti c al te rnati ve to bal anced trees . ACM Transac tions on
Database Sy stems, 33(6): 668–676, 1990.
19. T. R iegel , P. Fel b er, and C. Fetzer. A lazy snaps hot al gorithm wi th eager vali dati on. In
Proc. of the Inte rn ational S ymposium on Distrib uted Computing (DISC), page s 284–298, Sep
2006.
20. T. R iegel , C. Fetzer, and P. Fel b er. Ti me-based trans acti onal m emory wi th scal abl e ti
me bases . In Proc. of the Annual ACM Symposium on Paral le l Algorithms and Architecture s
(SPAA) , pages 221–228, New York, NY, USA, 2007. ACM.
21. N. Shavi t and D. Toui tou. Software trans actional m emory. Distrib ut ed Computing, 10(2):
99–116, February 1997.
22. L. Zi arek, A. Wel c, A. -R . Adl -Tabatabai , V. Menon, T. Shp e ism an, and S.
Jagannathan. A uniform transactional e xecution envi ronm ent for java. In ECOOP ’08:
Proceedings of the 22nd European confe ren ce on Object-Oriented Programming , page s
129–154, Berl in, Heidel b erg, 2008. Spri nger-Ve rlag.
Download