Noni nvasi ve concur rency w it h Java ST M Guy Korland1 and Nir Shavit1 and Pas cal Felb e r2 Com puter Science Department Uni versi ty, Israel , guykorla@post.tau.ac.il, shanir@post.tau.ac.il 1 Tel-Aviv 2Com puter Science Department Univers ity of Ne uchˆate l, Swi tze rland, pascal.felber@unine.ch Abstra ct . In this pap e r we pres ent a com plete, compil er i ndep endent, Java STM fram ework call ed Deuce , intended as a developm ent platform for s cal able concurrent appli cati ons and as a re search to ol for de signi ng STM al gorithm s. Deuce prov ides several b enefits over ex is ti ng Java STM frameworks : it avoi ds any changes or addi tions to the JVM, i t do es not req ui re l anguage ex tens ions or intrus ive APIs , and i t do es not i mp ose any m emory fo otprint or GC overhead. To s upp ort legacy l ibrari es, Deuce dynam icall y i nstruments clas ses at l oad ti me and uses an original “field-based” l o cki ng strategy to im prove concurre ncy. Deuce als o provi des a s im ple i nternal API al lowing different STMs algori thm s to b e pl ugged in. We show em piri cal re sults that highl ight the scalabi li ty of our framework running b enchm arks with hundreds of concurrent threads . Thi s pap er show s, for the first tim e, that one can actuall y des ign a Java ST M w ith reasonabl e p erform ance , wi thout compi ler supp ort. 1 Int ro ducti on Multicore CPUs have b ecom e c ommonplac e, with dual-c ore s p owering almos t any mo dern p ortable or de sktop c ompute r, and quad-c ore s b eing the norm for ne w se rvers. While multicore hardware has b een advancing at an acce lerated pace , software for multicore platform s see ms to b e at a cross roads . Curre ntly, two dive rging programming metho ds are com monly used. The firs t exploits c onc urrency by s ynchroniz ing multiple thre ads bas ed on lo cks . This approach is we ll know n to b e a two-edged sword: on the one hand, lo cks give the program mer a fine-graine d way to c ontrol the applications c ritical se ctions , allowing an e xp ert to provide great sc alability; on the other hand, b eca us e of the risk of deadlo cks, starvation, priority inversions, etc ., the y im p ose a gre at burden on non- exp e rt programm ers, ofte n leading them to pre fe r the safer but non-sc alable alternative of coarse -graine d lo cking. The sec ond m etho d is a shared-nothing mo de l com mon in Web base d architec ture s. The applications usually c ontain only the bus ine ss logic , deployed in a containe r, while the s tate is saved in an exte rnal multi-versioned c ontrol s ystem , such as databas e, me ss age que ue , or dis tributed cache. While this m etho d re move s the burde n of handling c onc urrency in the applic ation, it im p os es a huge p erform anc e impact on data acc es se s and, for m any typ es of applications , is difficult to apply. The hop e is that trans actional mem ory (TM) [11] will s im plify concurrent programming by c ombining the des irable feature s of b oth me tho ds, providing state-aw are shared memory with simple concurrency con tro l. In the past se veral ye ars the re has b ee n a flurry of des ign and im plem entation work on the s oftware transac tional m emory (STM) [21]. Howe ver, the state of the art to day is less than app e aling. With the notable exc eption of transac tional C/C ++ com pilers [15], m ost ST M initiative s have remained academic e xp erime nts , applicable only to “toy applic ations”, and though we have learne d much from the pro c es s of de veloping them, they have not reached a state that will allow them to b e se riously fie ld tes ted. There are s everal re asons for this . Among the m are the problematic handling of many feature s of the targe t language, the large p erform anc e overhe ads , and the lack of supp ort for le gacy libraries . Moreover, many of the published res ults have b ee n based on prototyp e s whose s ource c o de is unavailable or p o orly maintaine d, making a reliable c omparison b etween the various algorithm s and de signs ve ry diffic ult. In this pap er, we intro duce Deuce , our nove l op e n- sourc e Java frame work for trans actional me mory. Deuce has se veral de sired features not found in e arlier Java STM fram eworks . As we dis cuss in Se ction 2, there c urrently do e s not e xis t an efficient Java STM fram ework that de live rs a full s et of feature s and can b e added to an existing applic ation witho ut changes to its c ompile r or libraries . It was not c lear if one could build s uch an an effic ient “c ompiler inde p e ndent” Java ST M. Deuce is inte nded to fill this gap. I t is non- intrus ive in the s ense that no m o difications to the Java virtual machine (JVM) or e xtensions to the language are nec es sary. It use s, by default, an original lo cking des ign that de tec ts c onflicts at the leve l of individual fields without a significant incre ase in the m emory fo otprint (no extra data is adde d to any of the c lass es) and there fore there is no GC ove rhead. This lo cking scheme provides fine r granularity and b e tte r parallelism than former ob jec t-base d lo ck de signs. Deuce provides weak atomicity, i.e., do e s not guarante e that c onc urrent ac ce sse s to a share d me mory lo c ation from b oth ins ide and outside a transac tion are consis te nt. This is in orde r to avoid a p e rform anc e p e nalty. Finally, it s upp orts a pluggable STM back- end and an op e n and e asily extendable archite cture , allowing re se arche rs to integrate and tes t the ir own STM algorithms within the Deuce fram ework. Deuce has b e en heavily optimize d for effic iency and, while the re is s till ro om for im prove ments , our p erformance e valuations on s eve ral high- end machines (up to 128 hardware thre ads on a 16-core Sun Niagara-bas ed m achine and a 96-c ore Az ul machine) de mons trate that it sc ales we ll. Our b enchm arks show that it outp erform s the main comp eting JVM-inde p endent Java STM, the DSTM2 frame work [9], in many c ase s by two orde rs of m agnitude , and in general sc ales well on m any workloads. This pap er shows for the first time that one can ac tually de sign a Java STM with reas onable p erformance without c ompile r supp ort. The Deuce fram ework has b ee n in deve lopme nt for more than two ye ars , and we b e lie ve it has re ache d a le vel of maturity s uffi cient to allow it to b e use d by develop ers with no prior e xp ertise in trans actional m em ory. It is our hop e that Deuce will he lp demo c ratize the use of STMs among de velop e rs , and that its op en-s ource nature will encourage STM the m to e xte nd the infras tructure with nove l fe atures , as has b ee n the c ase , for instanc e, w ith the Jike s RVM [1]. The re st of the pap er is organize d as follow s. We firs t ove rview re late d work in Section 2. Se ction 3 des crib es Deuce from the p ersp ec tive of the de velop e r of a concurre nt application. I n Section 4 we disc us s the im ple mentation of the framework, and the n show how it can b e extended, by the means of the STM backends , in Sec tion 5. Section 6 pres ents a com panion front-e nd to Deuce that s upp orts trans actional Java language extensions. Finally, in Sec tion 7, we e valuate the p erformance of Deuce . 2 R elat ed wo rk Several to ols have b e en prop ose d to allow the intro duc tion of STM into Java. They diff er in their program ming interfac e and in the way the y are im ple mente d. One of the first prop osals is Harris and Fras er’s CCR [8] that use d a C- bas ed STM imple me ntation underne ath the JVM and a program matic API to us e STM in Java. DST M [10] is anothe r pione ering Java STM imple mentation. It use d an explicit API to dem arc ate trans actions and re ad/w rite s hare d data. Trans actional ob jec ts had to additionally provide som e pre-defined functionality for cloning their state. More rece ntly, the DSTM2 [9] frame work was intro duce d, prop osing a se t of higher-le vel m echanisms to supp ort transac tions in Java. DSTM2 is a pure Java library that do e s not require any change to the JVM nor to the com piler. It use s a s p ec ial annotation to mark an atomic interfac e. Only a class that implem ents an annotated inte rfac e c an partic ipate in a transac tion. This clas s is create d us ing a sp ec ial fac tory and it can only supp ort trans actional acc es s to primitive typ e s or other annotated c lasse s. Transactions must b e starte d us ing a Callable ob jec t. This design cre ate s an API that, while p owe rful and e legant, requires imp ortant refactoring of legac y co de, and intro duc es m any lim itations, s uch as the lack of supp ort for e xis ting librarie s. The LSA-STM [19] is a Java STM that als o relies primarily on annotations for TM programm ing. Trans actional ob je cts, to b e acc es sed in the c onte xt of a trans action, are annotated as @Transactional, and me tho ds are de clared as @Atomic. Additional annotations can b e use d to indicate that s ome m etho ds w ill not m o dify the state of the target ob ject, (re ad- only) or to sp ecify the b ehavior in cas e an exce ption is throw n within a transac tion. Trans actional ob je cts are implem ented in the LSA-STM us ing inheritanc e and must supp ort state cloning. As such it do es not fully supp ort le gacy co de, and unlike Deuce is not transparent or non- invasive. Among the limitations of the fram ework, one can me ntion that acc es se s to a field are only supp orte d as part of metho ds from the owner class . The re fore , public and static fields cannot b e acc es se d trans actionally. Note that this limitation also applies to other Java STM imple me ntations that use ob jec t-leve l c onflic t dete ction such as DSTM and DST M2. AtomJava [12] take s a different approach to Java STM des ign by providing a source -to-s ource translator bas ed on Polyglot [16], an exte ns ible c ompile r framework. AtomJava adds a new atomic keyword to mark atom ic blo cks , and p e rforms s ome ma jor extensions during co de trans formation, such as adding an e xtra instance fie ld to e ach and every c las s. Atom Java re lie s on this field to maintain an ob jec t-bas ed lo ck schema. This schema is s how n to re duce lo ck ac ce ss ove rhead, but imp ose s a mem ory ove rhead which impacts not only transac tio nal ob je cts but all the application ob je cts. The s ource co de translation approach s im plifie s the tuning and ve rification of STM ins trume ntation, but it m akes it harder for the programm er to debug her original co de. Also, by translating s ource c o de , Atom Java im p os es a ma jor lim itation on us ers in that it pre vents the m from us ing com piled libraries, and re nders the ir sourc e co de with atomic blo cks incompatible with re gular Java c ompile rs (unlike approaches such as DSTM2 and LSA- ST M that are bas ed on annotations ). Multiverse [14] is another STM implem entation for Java that p erform s instrum entation drive n by annotations. Similarly to Deuce , Multiverse supp orts field-bas ed conflic t de tection granularity, but fields can b e monitored as part of a transac tion only if the ir ow ne r clas s was annotate d as @TransactionalObject. This require s the develop er to b e aware of all the data s tructure s that might b e acc es se d as part of a trans action. Neve rthele ss, this approach provides Multive rs e with the ability to maintain s trong atom icity by pre venting any acce ss to transactional ob jec t memb e rs that is not part of a trans action’s c onte xt. Multive rse provides s everal other fe atures , such as a retry/ orelse construct, diff erent leve ls of optim istic and p e ss im is tic b ehavior, and limited c ommuting op e rations. Mo dific ations to the JVM have b ee n prop os ed [22] to replace Java m onitors by trans actions . Atomos [4] is a Java extension that replac es the synchronized keyword by atomic, and wait/notify op e rations by retry state ments . Such c onve rs ions are quite c omplex, b ec aus e the trans forme d c o de must maintain the sam e s emantics as the original c o de, and has to s upp ort op erations like irre vo c able actions or s ignaling. In the end, the dec is ion to replace a monitor dep ends on the tradeoff b etwee n STM ove rhead and the gain in paralle lis m. There exist se veral other STM implem entations for various programm ing languages. An exhaustive list is outside of the sc op e of this pap e r, but the inte re ste d readers can consult the online TM bibliography available from the We b page at http://www.cs.wisc.edu/trans- memory/biblio/index.html. 3 C oncur re nt Pro gr am mi ng w it h D EUCE One of the main goals in des igning the Deuce API was to ke ep it s im ple . A s urvey of the past des igns (se e previous s ection) reveals thre e m ain approache s: (i) adding a new re se rve d keyword to m ark a blo ck as atomic , e.g., atomic in AtomJava [12]; (ii) us ing e xplicit m etho d c alls to dem arc ate trans actions and acc es s s hare d ob je cts, e.g., DST M2 [9] and CCR [8]; or (iii) requiring atom ic c las se s to im plem ent a pre de fine d interfac e or extend a base class , and to impleme nt a s et of metho ds [10]. The firs t approach require s m o difying the com piler and/or the JVM, w hile the othe rs are intrusive from the program me r’s p ersp ec tive as the y re quire significant mo difications to the co de (eve n though some sys te ms us e se mi-automatic weaving m echanism s such as asp ect- orie nte d program ming to e ase the task of the program me r, e .g., [19]). In c ontras t, Deuce has b e en des igne d to avoid any addition to the language or changes to the JVM. In partic ular, no sp ecial c o de is ne ce ssary to indic ate w hich ob jec ts are transac tional, no me tho d ne eds to b e im plem ented to supp orting trans actional ob je cts (e .g., cloning the s tate as in [10]). This allows Deuce to se amles sly supp ort transactional acce ss es to compile d libraries . T he only pie ce of information that mus t b e sp ec ifie d by the programm er is , obvious ly, which part of the c o de s hould exe cute atom ic ally in the c ontext of a trans action. To tha t end, Deuce relies on Java annotations . Intro duce d as a new fe ature in Java 5, annotations allow programm ers to mark a m etho d with me tadata that can be consulted at c las s loading tim e. Deuce intro duce s ne w typ e s of annotations to mark me tho ds as atomic : the ir e xec ution will take place in the context of a transaction. This approach has se veral advantages , b oth technica lly and se mantic ally. First technically, the sm alle st co de blo ck Java can annotate is a me tho d, w hich sim plifie s the ins trume ntation pro ce ss of Deuce and provides a s im ple mo de l for the program me r. Sec ond, atom ic annotations op e rate at the s ame le vel as syn chron ized me tho ds, which e xe cute in a mutually e xclus ion m anner on a give n ob je ct; the refore , atomic me tho ds provide a familiar program ming mo de l. 1 pu blic int sum(L ist l ist ) {2 int to ta l = 0; 3 atomic {4 for ( N o de n : list ) 5 to ta l += n. getVa lue( ); 6 }7 return to ta l; 8} Fig. 1. Atomic blo ck example. From a s em antic p oint of view, imple menting atomic blo cks at the granularity of me tho ds rem oves the nee d to deal with lo cal variable s as part of the trans action. I n particular, s inc e Java do e sn’t allow any acc es s to s tack variable s outside the current me tho d, the STM can avoid logging many me mory acc es se s. For ins tance , in Figure 1, a finer-granularity atom ic blo ck would re quire cos tly logging of the total lo cal variable (otherwis e the metho d would yield incorre ct res ults up on ab ort) whereas no logging would b e nec es sary when conside ring atom ic blo cks at the granularity of individual m etho ds. I n case s when fine r granularity is desirable , Deuce can b e us ed in combination with the TMJava front-e nd dis cuss ed in Se ction 6. To illus trate the use and imple me ntation of Deuce , we will c ons ider a well-known but non- trivial data structure : the skip list [17]. A skip lis t is a probabilistic struc ture based on multiple parallel, s orte d linked lis ts, w ith e fficie nt search tim e in O (log n ).Figure 2 shows a pa rtial implem entation of skip list, with an inner class re pres enting no de s and a me tho d to search for a value through the lis t. The ke y obs ervation in this c o de is that transactifying an applic ation is as e asy as adding @Atomic annotations to me tho ds that should e xecute as trans actions . No c o de nee ds to b e change d within the metho d or else where in the c las s. Interes tingly, the linked list direc tly manipulate s arrays and ac ce ss es public fields from outside their enclos ing ob je cts, which would not b e p oss ible with DSTM2 or LSA-STM. One c an als o obse rve that the @Atomic annotation provide s one configurable attribute , retries, to optionally limit the am ount of retrie s the transac tion attem pts (at m ost 64 tim es in the e xample ). A TransactionException is thrown in cas e this limit is re ache d. Alternatively one can envision providing time out instead (Which we might add in future ve rs ions ). A Deuce application is c ompiled with a re gular Java compiler. Up on exec ution, one ne eds to s p ec ify a Java age nt that allows Deuce to inte rc ept eve ry clas s loade d and manipulate it b e fore it is loade d by the JVM. T he agent is simply sp e cified on the c ommand line as a parame ter to the JVM, as follow s: 1 pu blic cla ss Ski pList {2 private s tat ic cla ss No de {3 pu blic fin al int val ue ; 4 pu blic fin al No de[ ] fo rwar d; 5 // ... 6 pu blic No de( int level, int v) {7 val ue = v; 8 fo rwar d = ne w No de[ level + 1] ; 9 }10 // .. . 11 }12 private stat ic int MAX LEVEL = 32; 13 private int level ; 14 private final No de hea d; 15 // Co ntinued in nex t column... 16 @Atomic (r etr ies=64 ) blic b o olea n contai ns( int v) {18 No de no de = hea d; 19 for ( int i = level ; i >= 0; i --) {20 No de next = no de.fo rwar d[i ]; 21 wh ile (nex t.va lue < v) {22 no de = next; 23 next = no de. for war d[ i ] ; 24 }25 }26 no de = no de.f orwa rd[0 ]; 27 return (no de.value == v) ; 28 }29 // ... 30 } 17 pu Fig. 2. @Atomic me tho d example. java -javaagent:deuceAgent.jar MainClass args... As will b e disc us se d in the ne xt s ection, Deuce ins trume nts e very c las s that may b e used from within a transac tion, not only clas ses that have @Atomic annotations . If it is known that a clas s will ne ver b e us ed in the conte xt of a trans action, one c an pre vent it from b e ing instrum ented by providing e xclus ion lis ts to the Deuce agent. This w ill s p e ed up the applic ation loading tim e ye t s hould not aff ec t exe cution s p ee d. 4 D EUCE Impl em entat io n This sec tion de sc rib e s the im plem entation of the Deuce fram ework. We first give a high- le vel overvie w of its main c omp onents. The n, we explain the pro ce ss of co de ins trume ntation. Finally, we de sc rib e various optimizations that e nhanc e the p e rformance of trans actional c o de . JVM DEUCE agent Application DEUCE runtime TL2 LSALock Jav a cla sse s Fig. 3. Main comp one nts of the Deuce archi tecture. 4.1 Overview The Deuce fram ework is conce ptually m ade up of 3 layers, as s how n in Figure 3: 1. The application layer, w hich consis ts of us er c lass es written without any relations hip to the STM implem entation, exce pt for annotations added to atomic me tho ds. 2. The Deuce runtime laye r, which orche strates the inte rac tions b etwee n transac tions exe cuted by the application and the unde rlying STM impleme ntation. 3. The layer of ac tual STM librarie s that imple me nt the Deuce conte xt API (s ee Section 5), inc luding a s ingle- global-lo ck impleme ntation (denoted as “Lo ck” in the figure, and used as a p erform anc e “refere nc e p oint” for all othe r libraries ). In addition, the Deuce agent inte rc epts c lass es as the y are loade d and instrume nts them b e fore they start exe cuting. 4.2 I nstrumentat ion Framework Deuce ’s ins trumentation engine is based on ASM [3], an all-purp ose Java byte co de manipulation and analys is fram ework. It c an b e use d to mo dify existing class es , or to dynamically ge ne rate class es , dire ctly in binary form. The Deuce Java age nt use s ASM to dynam ically ins trume nt c las se s as they are loaded, b efore the ir first ins tanc e is create d in the virtual machine. During instrum entation, new fields and me tho ds are adde d, and the c o de of transac tional me tho ds is trans forme d. We now desc rib e the different m anipulations p e rforme d by the Java age nt. Fiel ds. For e ach instanc e field in any loaded clas s, Deuce adds a synthe tic c ons tant field ( final static public) that re pres ents the relative p osition of the field in the clas s. This value, together with the instanc e of the clas s, unique ly identifies a field, and is used by the STM implem entation to log fie ld ac ce sse s. blic cla ss Ski pList {2 pu blic sta tic fi nal lon g CL ASS BASE = .. . 3 pu blic sta tic fi nal lon g MAX LEVEL addr ess = .. . 4 pu blic sta tic fi nal lon g level addr ess = . .. 5 // ... 6 private stat ic int MAX LEVEL = 32; 7 private int level ; 8 // ... 9} 1 pu Fig. 4. Fields addres s. Static fields in Java are eff ective ly fields of the e nc los ing clas s. To de signate static fields , Deuce define s for each c lass a c ons tant that represe nts the bas e clas s, and can b e use d in c ombination with the field p osition ins te ad of the clas s instance . For instance , in Figure 4, the level field is re pres ented by level ADDRESS while the SkipList base class is re pre sente d by CLASS BASE . Ac cessors. For e ach field of any loaded clas s, Deuce adds synthe tic static acc es sors us ed to trigger field’s ac ces s eve nts on the lo cal c onte xt. Figure 5 shows the synthe tic ac ce ssors of class SkipList. The getter level Getter$ rec eives the curre nt fie ld va lue and a c onte xt, and triggers two events : beforeReadAccess and onReadAccess; the result of the latter is returned by the ge tte r. The setter rece ives a conte xt and yie lds a s ingle eve nt: onWriteAccess. The re ason for having two eve nts in the getter is technical: the “b e fore ” and “after” eve nts allow the STM 1 pu blic cla ss Ski pList {2 private i nt level ; 3 // ... 4 5 // Synthetic getter 6 pu blic int level Gett er$ ( Co nt ex t c) {7 c . b ef oreR eadAccess( thi s, l evel ADD RESS ); 8 return c.o nRea dA ccess( thi s, l evel , level ADD RESS ); 9 }10 // Synthetic se tter 11 pu blic vo id level Sett er$ ( int v , C ontext c) {12 c . onWri teAccess( thi s, v , level ADD RESS ); 13 }14 // ... 15 } Fig. 5. Fields ac ces sors. backe nd to verify that the value of the field, which is ac ce sse d b etwee n b oth e vents without us ing c ostly refle ction me chanis ms , is consiste nt. This is typic ally the cas e in tim e-base d ST M algorithm s like TL2 and LSA that s hip with Deuce . 1 private s tat ic cla ss No de {2 pu blic No de[ ] fo rwar d; 3 // ... 4 5 // Or igi na l method 6 pu blic vo id setFor wa rd( int level, No de next) {7 fo rwar d[ level ] = nex t; 8 }9 10 // Synthetic dup licate method 11 pu blic vo id setFor wa rd( int level, No de next, Co ntext c) {12 No de[ ] f = fo rwar d Gett er$ (c) {13 fo rwar d Sett er$ ( f , l evel , c ) 14 }15 // ... 16 } Fig. 6. Duplicate metho d. Dupli cate metho ds. In orde r to avoid any p e rformance p enalty for non transactional co de , and since Deuce provides weak atom ic ity, Deuce duplic ates each me tho d to provide two distinc t versions. T he firs t version is identical to the original me tho d: it do e s not trigge r an eve nt up on me mory acc es s and, conse que ntly, do es not imp os e any transac tional overhe ad. The s econd ve rs ion is a synthe tic me tho d w ith an extra Context param eter. In this instrum ented c opy of the original m etho d, all field acc es se s (e xce pt for final fields ) are re plac ed by calls to the as so c iate d transac tional acc es sors. Figure 4 shows the two ve rs ions of a metho d of clas s Node. T he s econd synthe tic ove rload re plac es forward[level] = next by calls to the synthetic ge tte r and se tte r of the forward array. T he first call obtains the re fe re nc e to the array ob jec t, while the se cond one changes the sp ec ified array e leme nt (note that ge tters and s ette rs for array e le me nts have an extra index param ete r). 1 pu 23 // 3 4 // 24 if blic cla ss Ski pList {2 // ... Or igi na l method instr umented 5 pu blic b o olea n contai ns( int v) {6 Thr owa ble thr owa ble = null ; 7 Co nt ext context = 8 Co nt extDel egat or .g etInst ance() ; 9 b o olea n commi t = true ; 10 b o olea n result ; 11 12 for ( int i = 6 4; i > 0; --i) {13 context . ini t (); 14 try {15 result = cont ai ns( v, cont ext ); 16 } ca tch (Tra nsa ct io nExceptio n ex) {17 // Must rol lback 18 commi t = false ; 19 } ca tch (T hrowabl e ex ) {20 thr owa ble = ex; 21 }22 // Co ntinued in nex t column... Tr y to commi t ( co mmit ) {25 if ( co ntex t . co mmit () ) {26 if ( t hrowabl e == null ) 27 return result ; 28 // Rethrow app lication e xcep tio n 29 throw (IOExcepti on) thr owabl e; 30 }31 } els e {32 context . ro llba ck (); 33 commi t = true ; 34 }35 } // Retry loop36 throw ne w Tr ansacti onExcept io n( ); 37 }38 39 // Synthetic dup licate method 40 pu blic b o olea n contai ns( int v, Context c) {41 No de no de = hea d Gett er$ (c); 42 // ... 43 }44 } Fig. 7. Atomic me tho d. At omic m etho ds. The duplication pro c ess des crib e d ab ove has one e xc eption: a metho d annotated as @Atomic do e s not need the firs t uninstrum ented ve rsion. Instead, the original me tho d is replace d by a trans actional ve rs ion that calls the ins trume nte d ve rs ion from within a trans action that exec ute s in a lo op. The pro c es s rep e ats as long as the trans action ab orts and a b ound on the numb er of allowe d retrie s is not re ached. Figure 7 s hows the trans actional version of m etho d contains. 4.3 Sum mary To summ ariz e, Deuce p e rforms the following ins trumentation op erations on the Java byte co de at load-time . – For eve ry fie ld, a c ons tant is added to kee p the re lative lo cation of the fie ld in the c las s for fast ac ce ss . – For e very field, 2 acc es sors are adde d (ge tter and sette r). – For e very clas s, a c ons tant is adde d to keep the re fe re nc e to the class definition and allow fas t acc es s to static fie lds. – Every (non- @Atomic) me tho d is duplicated to provide an atom ic and a nonatomic version. – For e very @Atomic me tho d, an atomic ve rsion is c re ated and the original version is replace d with a re try lo op c alling the atom ic vers ion in the c onte xt of a ne w trans action. 4.4 Opti mi zations During the ins trume ntation, we p e rform s eve ral optimizations to im prove the p e rformance of Deuce . Firs t, we do not instrum ent acc es ses to final fie lds as they cannot b e m o difie d afte r creation. This optim iz ation, toge ther with the de claration of final fields wheneve r p os sible in the application co de , dramatic ally reduc es the overhe ad. Second, fie lds ac ce sse d as part of the c ons tructor are ignore d as the y are not acc es sible by concurre nt threads until the c ons tructor returns. Third, instead of ge ne rating ac ce ss or me tho ds, Deuce actually inlines the co de of the gette rs and s etters directly in the transactional co de . We have obse rve d a s light p e rformance improve me nt from this optimization. Fourth, we chos e to use the sun.misc.Unsafe pseudo-standard inte rnal library to implem ent fast re fle ction, as it proved to b e vas tly more effic ient than the standard Java reflec tion mechanism s. T he implem entation using sun.misc.Unsafe eve n outp erforme d the approach taken in AtomJava [12], w hich is based on using an anonymous class p er field to replac e re fle ction. Finally, we tried to lim it as much as p ossible the stres s on the garbage collec tor, notably by us ing ob je ct p o ols when ke eping track of acc es se d fie lds (re ad and write se ts) in threads. In orde r to avoid any contention o n the p o ol, we had e ach thre ad kee p a s eparate ob ject p o ol as part of its context. Together, the ab ove optim iz ations he lp ed to signific antly de creas e the im plem entation ove rhead, in som e of our b enchm arks this improve d p erformance by almos t an orde r of magnitude (i.e ., tenfold faster) as com pared to our initial imple me ntation. 5 C usto mi zing Co ncurr ency Co ntro l Deuce was des igned to provide a re se arch platform for STM algorithms. I n order to provide a simple API for re se archers to plug in the ir own STM algorithm’s imple me ntation, Deuce define s the Context API as s hown in Listing 8. T he API include s an init me tho d, called once b efore the transac tion s tarts and the n up on each re try, allowing the trans action to initializ e its internal data s tructure s. The atomicBlockId argum ent allow s the transaction to log inform ation ab out the sp ec ific atom ic blo ck (statically ass igne d in the bytec o de ). 1 pu blic interface Co nt ext {2 voi d ini t ( int at omicBlo ckId) ; 3 b o olea n commi t( ); 4 voi d ro ll back () ; 5 6 voi d b efor eReadA ccess(Ob ject ob j, lon g field) ; 7 8 Ob j ect onR eadAccess( Ob ject ob j, Ob j ect value, lon g field) ; 9 int onR eadAccess( Ob ject ob j, int val ue , lon g field ) ; 10 lon g onR eadAccess( Ob ject ob j, lon g val ue , lon g field ) ; 11 // ... 12 13 voi d onWri teAccess( Ob j ect ob j, Ob ject val ue, lon g field) ; 14 voi d onWri teAccess( Ob j ect ob j, int val ue , lon g field ) ; 15 voi d onWri teAccess( Ob j ect ob j, lon g val ue , lon g field ) ; 16 // ... 17 } Fig. 8. Context interfac e. One of the heuris tic s we adde d to the LSA im plem entation is , following [6], that each one of the atomic blo cks will initially b e a re ad- only blo ck. It will b e conve rte d to b ec ome a writable blo ck up on re try, once it encounte rs a firs t write . Using this me tho d, re ad- only blo cks c an save mos t of the overhead of logging the fields’ ac ces s. Another heuris tic is that commit is c alle d in cas e the atomic blo ck finishes without a TransactionException and c an re turn false in case the c ommit fails , which in turn w ill cause a retry. A rollback is called whe n a TransactionException is thrown during the atomic blo ck (this c an b e us ed by the bus ine ss logic to force a retry). The re st of the m etho ds are called up on field ac ces s: a field read e vent will trigger a beforeReadAccess eve nt followed by a onReadAccess eve nt. A write field e vent will trigger an onWriteAccess eve nt. Deuce curre ntly inc ludes three Context imple me ntations , tl2.Context, lsa.Context, and norec.Context, im plem enting the TL2 [6], LSA [19], and NOrec [5] ST M algorithm s. Deuce also provide s a re fe re nce imple me ntation bas ed on a s ingle global lo ck. Since a global lo ck do e sn’t log any fie ld acc es s, it do e sn’t im plem ent the Context interfac e and do es n’t imp ose any ove rhe ad on the fields ’ acc es s. TL2 Context . The TL2 context is a straightforward im plem entation of the TL 2 algorithm [6]. The general princ iple is as follow s (m any de tails omitted). TL2 uses a s hared array of revokable versioned locks, with each ob jec t field b eing mapp e d to a s ingle lo ck. Each lo ck has a ve rs ion that corres p onds to the c ommit times tamp of the transac tion that last up date d some field protec ted by this lo ck. Time stamps are ac quire d up on comm it from a global time bas e, im ple mented as a sim ple counte r, up dated as infreque ntly as p ossible by writing transactions. Whe n re ading s ome data, a transac tion checks that the asso c ia te d tim es tamp is valid, i.e., not more re ce nt than the time whe n the trans action s tarte d, and ke eps track of the acc es sed lo c ation in its re ad s et. Up on write, the transaction buffers the up date in its write set. At comm it tim e, the transac tion acquires (using an atomic com pare- and-se t operation) the lo cks protec ting all w ritten fie lds, verifie s that all e ntries in the re ad se t are s till valid, ac quires a unique com mit tim es tamp, w rites the mo difie d fields to me mory, and re le ase s the lo cks. I f lo cks cannot b e acquires or validation fails , the trans action ab orts , dis carding buff ered up date s. LSA Context. The LSA c onte xt us es the LSA algorithm [19] that was de velop e d concurre ntly with TL2 and is base d on a similar design. The main differe nc es are that (1) LSA acquires lo cks as fie lds are writte n (encounter o rder), instead of at c ommit time, and (2) it p e rforms “increm ental validation” to avoid ab orting trans actions that read data that has b ee n mo dified m ore re cently than the s tart tim e of the trans action. Both TL2 and LSA take a s im ple approach to conflic t managem ent: they s imply ab ort and res tart (p os sibly afte r a de lay) the trans action that encounte rs the conflict. This strategy is s imple and e fficie nt to im plem ent, b ec aus e transac tions unilaterally an ab ort without any s ynchroniz ation with others. Howe ver, this c an som etime s hamp e r progre ss by pro ducing live lo cks. Deuce also s upp orts variants o f the T L2 and LSA algorithm s that provide mo dular conte ntion managem ent s trate gie s (as firs t prop ose d in [10]), at the price of som e additional s ynchroniz ation overhead at runtime. NOre c Context. The NOre c context implem ents the lightweight algorithm prop os ed in [5]. Roughly sp eaking, NOre c diff ers from TL2 and LSA in that it use s a single ownership re cord (a global vers ioned lo ck) and re lie s on value -based validation in addition to time- bas ed validation. The rationale of this de sign is to avoid having to pay the runtime ove rheads as so ciated with ac ce ss ing multiple owne rship records while s till providing a re asonable leve l of concurre nc y thanks value -base d validation which helps ide ntify fals e conflic ts. T his algorithm was s how n to b e effic ient at low thre ad c ounts, but tends to fall apart as c onc urrency increas es. In c ontras t, TL2 and LSA provide b e tte r sc alability, p erform ing b ette r as concurre nt threads are added or when there are more frequent but disjoint comm its of s oftware trans actions . 6 J ava l anguag e exte nsi ons Deuce uses annotation to indicate trans action dem arc ation. Therefore , it only s upp orts trans actions at the granularity of whole m etho ds, and cannot dec lare s horter “atomic blo cks ” as in Listing 9. In this e xam ple , one wants to transfer the balanc e of a s et of ac counts to another ac count. Transfer op e rations mus t b e atomic , but us ing a s ingle transac tion for all transfers would pro duc e an unnec es sarily long trans action, and would inc re ase the risk of conflic ts with concurre nt op e rations. In c ontrast, us ing one transaction p er trans fe r would re duce the like liho o d of ab orts, and allow for b e tter concurrency. Note that the lo cal variable total, which holds the total am ount trans fe rred, mus t b e ac ce ss ed trans actionally, w ith its pre vious value res tore d up on trans actional ab ort. 1 pu blic int tr ansferA ll ( A cco unt[ ] src , Account dst ) {2 int to ta l = 0 ; 3 for ( A ccount acc : sr c ) {4 atomic {5 int amount = acc. bala nce() ; 6 acc . wi thdr aw (a mo unt) ; 7 dst . dep osit ( a mo unt) ; 8 to ta l += a mo unt; 9 }10 }11 return to ta l; 12 } Fig. 9. Trans actions at the granularity of atomic blo cks . Intro duc ing ne w keywords to supp ort trans actional me mory at the language le vel do e s not only provide finer trans action granularity, but also allows for more s ophisticate d c ons tructs to c ontrol re cove ry, re trie s, etc. This allows trans actions to b e use d in more p owe rful ways. Such transac tional extensions do, howe ver, require e ither using a pre -pro c es sor, or mo difying the Java com piler to supp ort the ne w constructs. We have therefore de velop e d a front- end to ol, TMJava, that trans form s the source c o de from a program written with the language e xte ns ion for atom ic blo cks, into a program that is suitable for instrum entation by Deuce . The trans formation p erform ed by the front-e nd to ol m aps the atom ic blo cks in the e xte nded language into annotated atom ic me tho ds. In other words, the frontend to ol analyze s the co de to find the atomic blo cks (atomic keyword) ins ide clas s me tho ds; it then c reates ne w metho ds whos e b o dies c ons is t of the content of the atomic blo cks, and replac es the blo cks w ith c alls to the se new me tho ds . Howeve r, mapping an atomic blo ck into an annotate d atom ic me tho d is not trivial as we nee d to take into acc ount s everal is sues, notably: 1. Variable s and ob je cts, that are ac ce ss ible ins ide the scop e of the metho d in w hich the atomic blo ck is dec lare d, should als o b e available inside the s cop e of the atomic blo ck. 2. Mo dific ations inside an atomic blo ck of variables and ob jec ts that are define d outs ide the s cop e of the blo ck, s hould b e visible outside the atom ic blo ck. Variable s and ob je cts that are define d outs ide the atomic blo ck, but are use d and/or mo dified inside the atom ic blo ck, are pas se d as param eters to the corre sp onding annotated atom ic m etho d. As Java only supp orts paramete r pass ing by-value for primitive typ e s, variables that are mo dified inside the atom ic blo ck are pas se d in arrays , and up date d value s are c opied back to the variables when re turning from the atomic me tho d (in c ase a s ingle variable is mo difie d, its ne w value is s imply returne d by the me tho d). Lis ting 10 shows the Deuce -c ompatible c o de pro duc ed by TMJava from Listing 9. 1 pu blic cla ss Bank {2 pu blic int tr ansferA ll ( A cco unt[ ] src , Account dst ) {3 int to ta l = 0 ; 4 for ( A ccount acc : sr c ) {5 to ta l = t ra nsfer All ab1 ( t ota l , dst , acc ); 6 }7 return to ta l; 8 }9 @Atomic 10 private final int tr ansferA ll ab1 ( int to ta l , Account dst, A cco unt a cc) {11 int amount = acc. bala nce() ; 12 acc . wi thdr aw (a mo unt) ; 13 dst . dep osit ( a mo unt) ; 14 to ta l += a mo unt; 15 return to ta l; 16 }17 } Fig. 10. Co de of Listing 9 trans form ed by TMJava for Deuce . In addition to s upp orting basic atomic blo cks, TMJava provides se veral additional ke ywords , notably retry, next, leave (for c ontrol flow), either/or/otherwise (for alte rnative s), and on failure (for atom ic e xc eption handling). The TMJava front-e nd to ol is base d on the JastAd d exte ns ible Java com piler [7]. In a first ste p, an abstract syntax tre e (AST ) is c ons tructe d from the s ource core. The AST is then analyz ed to lo c ate trans actional language e xte ns ions, and is mo difie d acc ordingly. Finally, Deuce -c ompatible annotate d s ource co de is generated from the mo dified AST. The re fore , TMJava trans forms Java co de with T M e xte ns ions into “pure” Java source co de that can b e com piled us ing a standard compile r, but with the s tructure and annotations exp e cted by Deuce for transac tifying the application. 7 Per fo rm ance Evaluati on We e valuate d the p e rformance of Deuce on a Sun UltraSPARCTMT2 Plus multicore machine (2 CPUs e ach with 8 cores at 1.2 GHz, e ach with 8 hardware threads) and an Azul Vega2(2 CPUs each with 48 cores ).TM 7.1 Deuce overheads firs t briefly disc us s the ove rheads intro duced by the Deuce agent when ins trume nting Java clas se s. Table 1 s hows the me mory fo otprint and the pro c es sing overhead when pro ce ssing com piled librarie s: rt.jar is the runtime library c ontaining all Java built- in c las se s for JDK 6; cache4j.jar (version 0.4) is a wide ly us ed inmem ory cache for Java ob jec ts; JavaGran de (version 1.0) is a we ll- know n collec tion of Java b e nchmarks . I ns trumentation time s we re meas ure d on an Inte l Core 2 DuoCPU running at 1.8 GHz. Note that ins trumentation was e xe cuted se rially, i.e ., without exploiting multiple cores . T MWe App lication Memory In st rumentation Ori ginal Instrum ented Time rt.jar 50 MB 115 MB 29 s cache4j.jar 248 KB 473 KB < 1 s JavaGrande 360 KB 679 KB < 1 s Table 1. Mem ory fo otprint and pro ce ssi ng overhead. As e xp ec ted, the size of the c o de approximate ly doubles b e cause Deuce duplic ate s e ach metho d, and the pro c es sing ove rhead is prop ortional to the s iz e of the co de. However, b e cause Java us es laz y c lass loading, only the class es that are actually used in an application will b e ins trume nte d at load time . Therefore , the overhe ads of Deuce are negligible considering individual clas se s are loade d on dem and. In terms of e xe cution time , instrum ented class es , that are not invoke d w ithin a trans action, incur no p erform anc e p enalty as the original metho d exec ute s. 7.2 Be nchmarks We te ste d the four built-in Deuce STM options: LSA, TL2, NOre c, and the simple global lo ck. Our LSA ve rs ion captures m any of the prop e rtie s of the LSA-STM [20] fram ework. The TL2 [6] form provides an alternative STM imple mentation that acquire s lo cks at c omm it tim e using a w rite se t, as o pp os ed to the encounte r order lo cking approach used in LSA. Our e xp eriments inc luded three clas sical m ic ro-b e nchm arks : a red/black tre e, a sorted linked lis t, and a skip list, on which we p erforme d c onc urrent inse rt, de lete, and se arch op erations. We c ontrolled the siz e of the data s tructures, which rem aine d constant for the duration of the exp e rime nts, and the mix of op e rations. We also e xp erimente d with three re al- world applications, the first implem ents the Lee routing algorithm (as pre sente d in Lee- TM [2]). It is worth noting that adapting the application to Deuce was straightforward, with little m ore than a change of s ynchroniz ed blo cks into atom ic m etho ds. The sec ond application is c alle d Cache 4J, an LRU c ache imple me ntation. Again adapting the application to Deuce was s traightforward. The third application com es from the wide ly us ed STAMP [13] TM b enchmark suite. Micro-b enchmarks. We b e gan by c omparing Deuce to DST M2, the only c omp e ting STM frame work that do e s not re quire compiler s upp ort. We b enchmarke d DST M2 and Deuce based on the Red/Black tree b enchmark provide d with the DST M2 frame work. We tried many variations of op eration c ombinations and leve ls of concurre nc y on b oth the Azul and Sun machine s. C omparison res ults for a tre e with 10k e lem ents are s hown in Figure 11, we found that in all the b enchmarks DST M2 was ab out 100 tim es (two orde rs of magnitude ) s lowe r than Deuce . Our res ults are c ons is tent with those published in [9], where DSTM2’s throughput, me asured in op e rations p er se cond, is in the thous ands, while Deuce ’s throughput is in the m illions . Bas ed on thes e e xp erime nts , we b elieve one can conclude that Deuce is the first viable c ompile r and JVM indep ende nt Java STM fram ework. 1 2 1 4 0 Number of threads 96 6 96 RB tree, updates 0 0 . 2 Throughput (× )10txs/s 2 4 6 8 1 0 80 6 6 Throughput (× )10txs/s 1 16 32 48 64 112 128 Number of threads 0 RB tree, 10k elements, 10% updates DSTM2 0.5 Lock LSA TL2 1 1.5 2 2.5 1 16 32 48 64 80 3 112 128 Throughput (× 10 txs/s) RB, 10k elements, 0% updates DSTM2 Lock LSA TL2 10k elements, 20% DSTM2 Lock LSA TL2 0 . 14 16 32 48 64 112 128 0 . Number of threads 6 0 . 8 1 1 . 2 1 . 4 1 . 6 80 96 Fig. 11. The red/bl ack tre e b enchm ark (Sun). On the othe r hand we obse rve that the Deuce STMs sc ale in an impre ss ive way eve n with 20% up date s (up to 32 thre ads ). While DST M2 shows no s calability. Whe n we investigated dee p er we found out that m ost of DSTM2 ove rhe ad res t in two are as. The mos t signific ant are a is the contention manager w hich acts as a c ontention p oint, the s ec ond area is the reflec tion m echanism us ed by DST M2 to re trie ve and ass ign values during the transac tion and in c ommit. Next we pres ent the re sults of the te sts with the linked list and the skip list on the Sun and Az ul machines . The linke d lis t b enchmark is known to pro duce a large degree of contention as all thre ads mus t traverse the s ame long s eque nc e of no des to re ach the targe t e le me nt. Res ults of exp e rim ents are s how n in Figure 12 and Figure 13 for a linked list with 256 e leme nts and different up date rates. We observe that b ecause there is little p otential for parallelism, the s ingle lo ck, w hich p e rforms significantly le ss lo ck acquis ition and tracking work than the STMs, wins in all three b e nchmarks . The STMs s cale slightly until 30 threads w he n there are up to 5% up dates , and then drop due to a ris e in ove rhea d w ith no b e ne fit in parallelism. With 20% up date s, the m ax is reached at ab out 5 thre ads , b e cause as concurrency furthe r increas es one c an e xp e ct at leas t 2 c onc urrent up date s, and the STMs suffer from rep e ated trans actional ab orts . Finally, we consider res ults from the skip lis t b e nchmark. Skip lists [18] allow significantly greate r p otential for parallelism than linked lists b e cause different threads are likely to travers e the data struc ture following dis tinct paths . Res ults for a list with 16k e leme nts are shown in Figure 14 and Figure 15. We obs erve that the STMs sc ale in an impre ssive way, as long as there is an inc rease in the leve l of parallelism, and provide gre at sc alability e ven with 20% up dates as long as the ab ort rate remains 1 16 32 48 96 Number of threads Fig. 12. The l inked li st b enchm ark (Sun). Fig. 13. The l inked li st b enchm ark (Azul). TL2 LSA Lock 3 1 10 15 20 25 30 10 30 50 70 100 Number of threads TL2 LSA Lock 6 6 Throughput (× 10 txs/s) 0 5 1 16 48 96 128 196 256 1 16 48 96 128 196 256 Number of threads Number of threads SL, 16k elements, 0% Fig. 14. The sk ipl is t b enchmark (Sun). updates SL, 16k elements, 20% updates TL2 LSA 0 TL2 LSA 0 Lock 5 Lock 2 1 0 0 4 1 0 5 6 2 0 0 8 2 0 5 1 30 50 70 100 140 180 1 30 50 70 100 140 180 1 3 0 0 0 Number of threads Numb er of threa ds txs/s) Throughput (× 10 3 1 16 32 48 96 Number of threads SL, 16k elements, 50% updates 0 LL, 256 elements, 20% 1000 updates 2000 TL2 3000 LSA Lock 4000 5000 6000 3 SL, 16k elements, 20% updates 0 LL, 256 elements, 5% 1000 updates 2000 TL2 3000 LSA Lock 4000 5000 6000 Throughput (× 10 txs/s) 3 txs/s) Throughput (× 10 SL, 16k elements, 0% updates 0 LL, 256 elements, 0% 0 500 updates 2 0 0 100 4 00 0 150 TL2 6 00 LSA Lock 0 200 8 00 0 250 1 00 0 300 1 10 30 50 70 100 0 00 Number of threads Throughput (× 10 txs/s) TL2 LSA Lock 1000 LL, 256 elements, 20% 2000 updates 3000 TL2 4000 LSA Lock 5000 0 Throughput (× 10 txs/s) 1 16 32 48 96 Number of threads LL, 256 elements, 5% updates 1 10 30 50 70 100 Number of threads TL2 LSA Lock 15 10 5 0 1 16 48 96 128 196 256 Number of threads SL, 16k elements, 50% updates TL2 4 LSA Lock 6 TL2 LSA Lock 5000 0 3 3 txs/s) Throughput (× 10 10000 Throughput (× 10 txs/s) 15000 0 10 00 20 00 30 00 40 00 50 00 Throughput (× 10 txs/s) LL, 256 elements, 0% 20000 updates 3 2 1 1 30 50 70 100 Number of threads0 140 180 6 Throughput (× 10 txs/s) 6 Throughput (× 10 txs/s) 6 txs/s) Throughput (× 10 Fig. 15. The sk ipl is t b enchmark (Azul ). at a reas onable leve l. We adde d a b enchmark with 50% up date s to s how that there is a lim it to their scalability. When we inc re ase the frac tion of up date s to 50% , the STMs in Figure 15 reach a “m elting p oint” much earlie r (at ab out 10 threads versus 100 threads in the 20% b e nchmark) and overall the lo ck wins again b ec aus e the ab ort rate is high and the STMs inc ur and overhead w ithout a gain in paralle lis m. We note that typically se arch s tructure s have ab out 10% up date s. Re al appl icati ons. Our last b enchmarks dem ons trate how simple it is to replace the critic al se ctions with trans actions . The firs t take s a s erial ve rsion of the Lee rout- ing algorithm and dem ons trate s how a s im ple re place me nt of the critic al sec tions by trans actions signific antly improves sc alability. The sec ond take s a non-multi-thre ade d lo ck bas ed a LRU c ache implem entation (Cache4J) and shows that it is s traightforward to replace the c ritical s ec tions, but sc alability is n’t promise d. Circuit routing is the pro ce ss of automatically pro duc ing an interconne ction b e tween e le ctronic com p onents. Lee ’s routing algorithm is an attractive candidate for parallelization s inc e circuits (as shown in [2]) c ons ist of thous ands of routes, e ach of which c an p otentially b e routed concurre ntly. The graph in Figure 16 s how s exe cution time (that is , latency, not throughput) for three diffe re nt b oards with the TL2, LSA, and NOrec algorithm s. As c an b e s ee n, Deuce sc ales well with all algorithms , w ith the overall time de creas ing eve n in the cas e of a large b oard (MemB oard). NOre c exhibits b etter p erformance with a s ingle thre ad due to the lower overhead of its single ow ne rs hip rec ord. With more threads, the p erformance of all algorithms is virtually identic al. This c an b e explained by the natural parallelism and low contention of this b enchmark. Lee-TM, MemBoard Execution time TL2 LSA NOrec 0 5 0 1 0 0 1 5 0 2 1 10 15 20 25 30 0 0 Number of threads 2 5 0 3 0 0 Fig. 16. The Lee R outi ng b enchm ark (Sun). Cache 4J is an LRU lo ck- bas ed cache imple me ntation. The implem entation is based on two inte rnal data struc tures , a tree and a hash-m ap. The tre e m anage s the LRU while the hash-map holds the data. The Cache4J implem entation is base d on a single global lo ck which naturally leads to no sc alability. T he graph in Figure 17 s hows the res ult of replacing the single lo ck with transac tions us ing the LSA algorithm. As can b e se en, Deuce do e sn’t s cale well, with the ove rall throughput slightly decre asing. A quick profiling shows that the fact that Cache 4J is an LRU cache implie s that eve ry ge t op eration also up date d the inte rnal tre e. This alone make s eve ry transac tion an up date transac tion. Yet, the fact that the total throughput re mains almost the same even with 80 threads is encouraging due to the fact that trans actional me mory’s main advantages , b esides scalability, are c o de s implicity and robus tne ss. Finally, we tes ted Deuce on the Vac ation b e nchm ark from STAMP [13], w hich is the mos t wide ly us ed TM b e nchm ark s uite. As STAMP was b e en originally written in C, we use d a Java p ort of the original b e nchmarks adapted for Deuce . The Vacation application m o dels an online travel re se rvation syste m, imple mente d as a s et of tree s that ke ep track of c us tomers and their res ervations . T hreads sp end time e xe cuting Cache4j (LSA) 2 4 6 0% updates 10% updates 8 1 0 1 16 32 48 64 Number 1of threads 2 80 1 4 1 6 Fig. 17. The Cache 4J b enchm ark (Sun). trans actions , and the contention le vel is mo de rate . Figure 18 s how s the e xe cution time of Vac ation with the TL2, LSA, and NOre c algorithms. We can obse rve that b oth TL2 and LSA scale well with the numb er of cores . In contras t, NOrec s how s almos t no s calability, although it p e rforms b e tter than TL2 and LSA unde r low thre ad c ounts. This c an b e e xplaine d by the fac t that (1) NOre c’s single ownership rec ord is p enaliz ed by the s ize o f the trans actions , and (2) value -base d validation is quite exp e ns ive in Java since it require s ac ce ssing me mory via APIs that are s lower than the direct mem ory acc ess es provided by unmanaged language s. STAMP: Vacation TL2 LSA NOrec 0 2 0 4 0 6 0 8 0 1 2 4 8 161 32 0 Number of threads 0 1 2 0 1 4 0 Fig. 18. The Vacati on b enchm ark from the STAMP s uite (Sun). In conclus ion, Deuce shows the typical p e rform anc e patte rns of an STM, go o d sc alability w he n there is a p ote ntial for paralle lis m, and unimpre ss ive p erform anc e when paralle lis m is low. T he res ults, howe ver, are very encouraging c ons ide ring the fact that Deuce , a pure Java library with all the im plied ove rhe ads , shows sc alability eve n at low thread counts. 8 C oncl usio n We intro duce d Deuce , a nove l op e n- sourc e Java fram ework for transac tional mem ory. As we showe d, Deuce is Execution time 3 txs/s) Throughput (× 10 the first e fficie nt fully feature d Java ST M frame work that c an b e adde d to an existing applic ation without changes to its compile r or libraries . I t dem ons trate s that one can de sign an e fficie nt, pure Java ST M without a c ompile r s upp ort, eve n though language supp ort as provided by the com panion front-e nd TMJava provides additional express ivenes s. Though much work obvious ly rem ains to b e done , we b elieve Deuce is ready for use by deve lop ers. It is fre ely dow nloadable from http://code.google.com/p/deuce under the Apache lic ense . 1In conclus ion, Table 2 shows a general overvie w c omparing the two we ll- know n Java ST Ms Atom Javaand DT MS2 vs . Deuce . This comparis on shows that Deuce ranks highe st overall, and provides nove l c apabilities w hich yield much b etter us ability and p erformance than forme rly e xis ted. AtomJava DSTM2 Deuce Locks Ob je ct based Ob je ct based Fiel d bas ed Instrumentation Source Runtim e Runtim e API Non-Intrus ive Intrusi ve Non-Intrus ive Libraries support None None Runtim e & offl ine Fields access Anonym ous class es Re flecti on Low l evel unsafe Execution overhead Medi um High Medi um Extensible Yes Yes Yes Transaction context atomic blo ck Callable ob ject @Atomic annotation Table 2. Com paring Atom Java, DST M2, and Deuce Ackno w ledgemen ts. This pap er was s upp orted in part by grants from Sun Mic ros ystem s, I nte l Corp oration, Microsoft Inc., as well as a grant 06/1344 from the Israe li Science Foundation, Europ ean Union grant FP7- ICT- 2007-1 (pro je ct VELOX), and Swiss National Foundation grant 200021-118043. The TMJava front-e nd has b een develop ed by Derin Harm anc i. Refer ence s 1. B. Al p ern, S. Augart, S. M. Bl ackburn, M. Butrico, A. Co cchi , P. Cheng, J. Dol by, S. Fink , D. Grove, M. Hind, K. S. McKi nley, M. Mergen, J. E . B. Mos s, T. Ngo, and V. Sarkar. T he jikes res earch v irtual m achine pro ject: bui ldi ng an op en-source re search communi ty. IBM Sy st. J. , 44(2): 399–417, 2005. 2. M. Ans ari , C . Kotsel idi s, I. Wats on, C. Ki rkham, M. L uj´an, and K. Jarvis . L ee-tm : A non-tri vi al b enchm ark sui te for transactional mem ory. In Proc. of the International Conference on Alg orithms an d Architectures for Paral lel P rocessing (ICA3P P), pages 196–207, Berl in, Hei delb erg, 2008. Springer-Verlag. 3. W. Binder, J. Hul aas, and P. Moret. Advanced java byteco de ins trum entation. In Proc. of the International Symposium on P rin cipl es and Prac tice of Programming in Jav a (PPP J), pages 135–144, New York, NY, USA, 2007. ACM. 4. B. D. C arls trom , A. McD onal d, H. Chafi, J. Chung, C. C. Minh, C. Kozy rak is , and K. Ol ukotun. The atom os transactional programmi ng l anguage. SIGPLAN Not., 41(6): 1–13, 2006. 1 We did not s how compari son b e nchmark re sul ts w ith Atom Java due technical l im itati ons and bugs found in its current im plem entation, but we have observed on smal l b e nchmark s that Deuce outp erform s Atom Java by a comfortable margi n. 5. L. Dal ess andro, M. Sp ear, and M. Scott. Norec: stream li ning stm by ab oli shing ow nershi p records. In Proc. of the ACM SIGPLAN symposium on Principles an d practice of paral lel prog ramming (PP oP P), pages 67–78, J an 2010. 6. D. Di ce, O. Shalev , and N. Shavi t. Trans acti onal lo cki ng i i. In Proc. of the International Sy mposium on Distributed Computing (DISC), pages 194–208, 2006. 7. T. Ek man and G. Hedi n. T he JastAdd e xtensi ble J ava compi ler. SIGPLAN Not. , 42(10):1–18, 2007. 8. T. Harris and K. Frase r. Language s upp ort for li ghtweight transacti ons. In Proc. of the ACM SIGPLAN Confe rence on Objec t-Oriented Programming S ystems, Lan guag es an d Appl ications (OOPSLA), pages 388–402, New York, NY, USA, 2003. ACM. 9. M. Herl ihy, V. L uchangco, and M. Moir. A flexi ble framework for i mpl eme nting software transacti onal mem ory. In Proc. of the ACM SIGP LAN Conference on Object-O rien ted Programming Systems, Languages and Applications (OOPSLA), pages 253–262, New York, NY, USA, 2006. ACM. 10. M. Herl ihy, V. Luchangco, M. Moi r, and W. N. Scherer I I I. Software transactional me mory for dy nami c-si zed data s tructures . In Proc. of the Ann ual ACM S ymposium on P rin cipl es of Distributed Computin g (PO DC), pages 92–101, Ne w York , NY, USA, 2003. ACM. 11. M. Herli hy and E. Mos s. Transacti onal mem ory: Archi te ctural supp ort for lo ck-free data s tructures. In Proceedings of the Twentieth Annual International Symposium on Compute r Architecture , 1993. 12. B. Hi ndman and D. G rossm an. Atomici ty vi a source-to-source transl ati on. In Proc. of the Work shop on M emory Sy stem Performance and Correctne ss (MSPC), pages 82–91, New York, NY, USA, 2006. ACM. 13. C. C. Minh, J. C hung, C. Kozy raki s, and K. Olukotun. STAMP: Stanford transactional appli cati ons for multi -pro ce ssi ng. In Proceedings of IISW C, Septemb er 2008. 14. Multi ve rse. http: //mul tiverse .co dehaus .org/. 15. Y. Ni , A. Wel c, A. -R . Adl -Tabatabai , M. Bach, S. Berkowi ts , J. Cownie , R. Ge va, S. Kozhukow, R. Narayanaswamy, J. Ol iv ie r, S. Prei s, B. Saha, A. Tal , and X. Ti an. Des ign and im plem entation of trans actional cons tructs for C/C+ +. In Proc. of the ACM SIGPLAN Confe rence on Objec t-Oriented Programming S ystems, Lan guag es an d Appl ications (OOPSLA), pages 195–212, New York, NY, USA, 2008. ACM. 16. N. Nys trom , M. R. Cl ark son, and A. C. Mye rs. Poly glot: An extens ibl e compil er fram ework for java. In Proc. of the Intern ational Conferen ce on Compil er Construction (CC), pages 138–152, 2003. 17. W. P ugh. Sk ip li sts: a probabi li stic alternative to balanced tre es. Commun. ACM, 33(6): 668–676, 1990. 18. W. Pugh. Sk ip l ists : a probabil isti c al te rnati ve to bal anced trees . ACM Transac tions on Database Sy stems, 33(6): 668–676, 1990. 19. T. R iegel , P. Fel b er, and C. Fetzer. A lazy snaps hot al gorithm wi th eager vali dati on. In Proc. of the Inte rn ational S ymposium on Distrib uted Computing (DISC), page s 284–298, Sep 2006. 20. T. R iegel , C. Fetzer, and P. Fel b er. Ti me-based trans acti onal m emory wi th scal abl e ti me bases . In Proc. of the Annual ACM Symposium on Paral le l Algorithms and Architecture s (SPAA) , pages 221–228, New York, NY, USA, 2007. ACM. 21. N. Shavi t and D. Toui tou. Software trans actional m emory. Distrib ut ed Computing, 10(2): 99–116, February 1997. 22. L. Zi arek, A. Wel c, A. -R . Adl -Tabatabai , V. Menon, T. Shp e ism an, and S. Jagannathan. A uniform transactional e xecution envi ronm ent for java. In ECOOP ’08: Proceedings of the 22nd European confe ren ce on Object-Oriented Programming , page s 129–154, Berl in, Heidel b erg, 2008. Spri nger-Ve rlag.