Software Nir Transactional Memory Dan Shavit* MIT Touitou and Tel-Aviv Tel-Aviv University by Abstract University constructing classes blockirag [7, 15, 14]. As we learn from chroni~ation signing highly isting of a literature, is flexibility greatly concurrent hardware Buildktg inflexible on chronization for supporting synchronization can of be Herlihy provide a general Empirical all a k-word evidence chitectures the concurrent shows that lock-free translation and outperforms Herlihy’s numbers of processors. software-transactional always in translation based the and to a Load. unlike most by employing operating proposed Barnes style protocols, pol- language they 1 tency Introduction adding are able using protocol. can , and or soon on the for to level a single algorithms the of word, imprac- problem on machines Herlihy and hardware to can trans- associative to the support cache a flexible cache consistency transactional operations. be written Moss solution: a specialized changes of provid- existing support. synchronization operation executed which current concurrent primitives minor writing chronization [18] operations operation system By several for of the an ingenious memory. making helping” icy. most to overcome programming and Heap Corn pare&Swap support [7] suggested efficient actional of our lists net- Fetch&Complement of Rappoport’s highly on counting future. ing have the of on two compression a three-word of these near simpro- a Corn pare&Swap path Linked/Store-Conditional greatly data-structures combination and architectures in the [16] for sufficiently use non- flexibil- concurrent operation, Unfortunately, Bershad of Barnes, parallel using more. operations use S pi ice which be developed making ar- [2] are literature, non-blocking which a special that the non-blocking the [22] implemented tical based are Pu [5] from designing Fetch&I nc, Israeli translating efficiency “recursive of implementations synchronization of Anderson’s many a STM and uses and be the task Examples ones style to the is that on a costly works outperforms method key approach it is not offer multiprocessor method The which STM-transaction. methods large methods, our for the words, syn- only choosing Massalin software use lock-free on simulated the we using method compare&swap collected single non-blocking, We to grams. programming machines implementations on implementing is ex- a novel in plifies level Moss, ity de- the on a operation. highly object of transactional and STM existing Load.Linked/Store.Conditional sequential on transactional operations. on best (STM), flexible implemented at based memory syn- task operation hardware transactional method is itional the choosing the Unfortunately, and methodology soflwar-e in simplifies programs. Load_Linked/Store_Cond word. of the operations of As we learn Any syn- as a transaction an optimistic algorithm built into Unfortunately though, this the transactional and the consisis block- solution ing. A major chines obstacle widely designing Given an on the highly the increasingly serious tention for to to number gram timing highly and uses they and by limit anomalies of (possibly modern means and critical eliminating make critical is the a multiprocessor sections its transactional supports flexible face clear f0cu5 memory pro- altogether) tions which This class Though on access we cannot and processor failures. most of in the aim known the memto todays in the transactional that sequence the for resiliency a software transactions, a pre-determined primitives that transactional of applicability of static includes chronization and introduce design of synchroniza- machines, implement.ati0n9 support a novel software in terms among anomalies that (STM), our advantages We programming software. portability approach, implementation. memory performance, of tirnhg We the in overall has to adopt based transactional operations machines, The proposes hardware sofiware ory system decrease paper not same con- failures. to im- sections increase but tion for of critical processor sections is multiprocessor and programming delay techniques parallelism, interconnect, concurrent size in in structures. unpredictable conventional objects since memory vulnerable key problem data This ma- of programmers and that that concurrent unsuitable, multiprocessor difficulty programs realization we argue plementing to making is the concurrent growing architectures, are way acceptable and is, transac- of locations. proposed syn- literature. Permission to make d@al/hard copies of all or part of this material for personal or classroom use is granted without fee provided that the copies a~e not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and ik date appear, and notice is given that copyright is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires specific permission PODC and/or 95 Ottawa 1.1 CA 01995 ACM 0-89791-710-3/95/08. of the .$3.50 Author: E-mail: shanir@theory. lcs. rnit .edu operation 204 a nutshell environment, operations clusively lContact in In a non-faulty fee. Ontario STM is usually ownerships Op. on the If a transaction the way based to ensure on locking memory cannot locations capture the atomicity or acquiring accessed exby an an ownerships it fails, and releases erwise, acquired. To deadlocks, which the guarantee for ownerships continue certain process which This other same location plete its tional is to only help eral has single transactions key order to of attempting is the cooperative this location in free if ing the gives, to com- one release we using cific can by The raon- cooperative other One can use method into to STM for to [6, 26]. The memory to tions. non-blocking Herlihy, in sequential objects cording done ory, to his by first switching the the help into it to the solution Herlihy for large current updating. Alemany suggested the new data structure, to whole improve price of loosing support making not object, and the and all that portability, standard does not support [4] and with as our and drawbacks em- its spe- which are method: a recursive causes access structure processes a disjoint part using of to help of the data ever, in many cases since have to will have to first P‘s operation help to On [20] b, help sys- Q, will then Q, likely other hand, and fail. for again P, Moreover, it, when also the P and location after b will only HOW- change read waiting requesting to to likely help P. executes the P restart. only the P transaction, a and help find if STM release have most P, processes as an not have on system help and not already processes some coopera- a in any P has ac- redundantly P. the a will the the pare&Swap method All the operation. will P a 2-word that operation own P’s Compare&Swap retry. b, all its its but helped Assume to co- fail executes b. Q complete 11 after of nevertheless which on level percentage b. According continues Q changed will P are a and owns helps b and transaction operations they locations P first acquires quired since Q already method, Compare&Swap the a high a process on process case con- operating assumptions example on Compare&Swap contention for k-word fail “helped,” k-word then ap- LaMarca general tive not processes. memory a suitable the by a set of strong is operations. provide of this Ac- of mem- like Felten efficiency ’s tentatively atomic and the at version does of locking tem new thus by other Take of structure block are Compare&Swap Herlihy ones. allocated structures proach as transformation a data ’s method data loca- guarantees the has mostly generate other of Load_ Linked/Store_Conditional Unfortunately, a Com- desired concurrent a new on to k-word sequel updating into changes pointer the non-blocking methodology, changes the a general method transactional which operative succeed. in offer which STM’S and of transactional of implementation to to copying making the (referred first approach as an atomic always general frequently processes operations implementations caching 2) on [19] streamlined Compare&Swap However, translation method which Unlike concurrent use Figure STM will [15] was the the collection them (see transaction method), object on any performing transaction The highly is straightforward: implement object, some based approach to a general and implemen- and . major is done Barnes structure. Translation sequential ones pare&Swap that provide translating non-blocking shared Lock-free this Rappoport a clean k-word two based and needs the lock- by call “helping” Sequential STM and by Israeli the have our it helps on specific method, both approach Though are vague achieving a process operation, chain. paper to caching process cooperative suggest, the by another own executing key whenever Prakash implementation overcome redundant-helping. 1.2 in of a non-blocking results update method its a recent the the The Load-Linked/Store_Conditional pirical sev- a location which details, using in locks. dependency and implementation need among complete the the behavior locked along Shasha, tation transacone to involved releasing already process capture the and recursively the Turek, Moreover, help a location to out, a location policy of [6, 26] methodology, coordination to helping resilient we must feature transaction. overhead a “reactive” of The owner the non-blocking even to locations the swapped trying the eliminate In order a “helping” owner in order. key, operation by acquiring delayed, are of their the environment, been Oth- ownerships first completes by the the is done a faulty which help avoid must increasing achieved is that its employing it transaction. approach one transaction transactions own effectively in every acquired. frees transactions in some executes crashed. forcing static liveness that already Op and liveness, needed ensuring make or the ownerships it succeeds in executing Finally, value The The P. will if to processes processes Q hasn’t of b in its 2-word fail own Comacquire waiting for waiting for b will changed b, P will cache. behavior. To overcome in [6], the A whole limitations hk object similar and approach Shasha, and “simulates” vate memory, the in the stores rest arts the the cache new from k-word a location but writing process uses atomic operation memory update. values the in the beginning. Read-Modify-write disjoint to the for Barnes, is done an this memory. Barnes To make ceptable, Turek, its which the checks is the case, Otherwise, suggested by locking has the the Results techniques Parallel We val- when (see Section found blocking process method periments order 205 the cited translation to reduce the system 5) the stable above. We on use a simulated accessing translation and the show that acone (non-faulty). We comparison of well Alewife machine, the method cooperative the methods overhead the of translation accepted Proteus [8, 9]. shared-memory in stable experimental conditions Simulator that performance is first under Hardware concurrency to implement needs pay distributed value operation in ascending sequential-to-non-blocking performance k-word the the to private if the to the Empirical one present pri- is done non-blocking are equivalent If updating. time into Our by in first 1.3 copying a process updating the Barnes, avoids proposed According of ’s method, that independently execution the Write the concurrent memory Then, Read-Modifyin allows i.e reading ues contained read method, [26]. the shared memory. of Iferiihy caching was Prakash first from the introduced object [1] cache-coherent as the grows, outperforms method. in general STM potentiid the both Unfortunately, and other for STM non- Herlihg’s our non-blocking ex- Dequeueo BeginTransaction DeletedItem G Read_transactional( Hesd) if DeletedItem Q Null ReturnedValue = Empty else Write-transactional( Hesd,DeletedItem+ if DeletedItem-+Next n Null Write-transactional( ’Ml, Null) ReturnedValue n DeletedItem+Value EndTransaction end Dequeue k.word_C&S(Size, DataSetlJ,OldH, for i=l to if Size do Read_transactional techniques are methods In ible A Non inferior such architecture I: to summary, similar STM shared and improved section objects, for i=l to Size mentation and Finally, in and ReturnedValue for design which ensures the lock-based a shared package software transactional standard properties in non-faulty in ones. of the our concur- faulty The our A begin variant is by of the a finite transactional sequence the transactional memory of local and memory, of [16]. shared A a - reads a local the value memory of machine a shared location The data set of a transaction accessed by the structions. Any cessfully, other – stores location. in of the which case For of a doubly returns A dequeued k-word proposed The k-word two cessful values that case, memory turns A item and New the 1 may transaction as in the stores a C&S-Success are data set’s transaction Figure set, the checks New value, its size. equivalent size A suc- a finite implementation the old. into otherwise In try the the set, set will and focus on supports of the in the in Figure known literature. Dequeue that could 2 is an procedure in (but not in one whole cannot and with system. An be swapped implemented fail forever, if out transac- order However, if pro(as when used only can be made since the the is non-blocking in different list). if implies successfully if it implementation during terminates the repeatedly non-blocking, swapped same terminate hardware linked which successfully non-blocking the tolerant, locations their is will theory process by a process a process The two if any It necessarily times. a doubly Our (1) data that terminates of attempts transactions, 3 be data most transaction is swap to write updating never while transaction) assumption of [16] will the we transaction transaction (not number many tolerant includes of attempts. process after static paper in thus a deterministic the operations of some dtierent is repeatedly whether to values 2 is number STM tions and is wait-free the execution some cesses (3) memory that can be stored This implementation a possibly under transaction and as parameters inputs transaction, executes infinitely successfully value. data the be performed or an Empty of the to ~from terminates memory transaction returns a value as parameters and atomically gets a transactional a class of a single execution for swap- process which of a transaction successfully). the it re- Implementation of Static-STM CtYS-Fadure. sofiware which changes action the in visible as in Figure insuc- form 1 is not. a finite that locations or complete dequeuing Compare&Swap stored are transaction gets Old k-word the ject which vectors of shared fail, trans- order. advance, should Compare&Swap STM after the which synchronization of a static in Figure register Write-transactional either changes Compare&Swap a transaction and its list the of a local set and may example, linked If the is the among transaction transaction. of transactions, and repeated transaction as a transaction. it contents Read-transactional processes. head the on values in which based new output An a shared following sequen- real-time a special of the which repeatedly into the execute order their is known inputs implementations register. Write.transactional with is set the the example into satisfy to sequential of as a procedure static. transaction instructions: Read-transactional appear The data set (2) returns sofiware should interleaving. transaction the thought performance Memory presenting without is consistent static which proof. function We transactions i.e., actions imple- correctness empirical memory [13]: Serializability: runs data Transactional Transaction following 3 we describe a sketch tially, evaluation. 2 2: A Static bus of flex- of highly resiliency In Section present C&S-Success EndTransaction A tomicity: software 5 we (DataSet~],New~]) = k_word_C&S end for a novel provide Section Results offers STM, Old~] do Wrke_transactional non-resilient [23]. performance introduces # C& S-l%lure Exitl’ransacti.m Next ) in flavor. coordination-operation = Transaction standard as queue-locks were rent Static (DataSet~]]) ReturnedValue Figure Figure NewD) BeginTransacti.m transactional behaves to its addresses is a thread of primitive operations memory like a memory by means of control to that memory. (STM), that is a shared supports of transactions. applies Any a finite ob- multiple A implement Memory[M], trans- of 206 a non-blocking a vector transactional sequence implementation We memory, termines for it. process Each which any cell static cent ains TM the Ownerships[M], in Memorg[lll], keeps in the shared of size data a vector which memory M stored using in the which transaction a record deowns with StartTransaction( input,DataSet Initialize(Tranj ,input Tranj = + Stable ) ,DataSet + -. Stable = return tran,version) tran+stat.s) version,’lkue) False Tranj +. Version++ if TTanj + Status = rrm,versiOn,IsInitiatOr) AcquireOwnerships( f;~ayi~~fl/l;h~j( TransactiOn(Tranj,Tranj Tranj ‘hansaction(t ) True if (version # tran-+version) SC(tran+status, Success (Success, then CalcOutput(Tranj -+ OldValues,input)) if status else = = I?dure Success NewValues return O)) LL(tran+status) then AgreeOldValues( return then (Success, (status,failadd) tran,version) = CalcNewValues(tran+ UpdateMemory(tran, version) ReleaseOwnerships( tran,version) ReleaseOwnerships( tran,version) OldValues,tran+ NewValues) else Figure if IsInitiator 3: St artTransaction then failtran= Ownerships[failadd] if failtran = Nobody then return the following set. Addo in fields: increasing order. Oldvatues~ Null the successful tion are of the its – the vector involved in record termines cent ains this vector to every process and time which the PJ, the former may a Figure transac- The eventually help O, which transaction. process terminates determines AcquireO wnerships(tran, transize de- This for address P] the the stable, that and the if the by 3. Transaction the processors checks record’s will After the called by the The parameter output of the Mmtiator, from read never version by the or by the parameter during ownership the If the sets the new it old the a helping that the owns record 1 The Validate the status 0). values into the of operation this field In ReleaseO case for record, the status the failure. The it already owns it helps Helping the i = 1 to if LL( size (Null , O) ) then while ion], tran) then loop (Failure,i) ) then version) do tran+Add~] O wnerships[locat if tran+. version ion]) AgreeOldValues( size = i = = version # SC(Ownerships[location], tran then then return Nobody) tran,version) tran+si5e 1 to field location= tries if LL(tran+ size do tran+.Add~] OldValues[locat if tran+version # ion]) version # then Null then return SC(tran+OldValues[location],Memory[location]) UpdateMemory(tran, si5e the for process field and, the in transaction size do LL(Memory[location]) AllWritten # if oldvalue# then if case (not then newvalues~] LL(tran+ if version return tran+version SC(Memory[location], re- newvalues) tran+Add~] if version newvalues~]) AllWritten)) # return then tran+version then then return SC(tran+AllWritten,True) which only 1 to if tran+ contains first i = version, tran+size oldvalue= and process = location= calculates memory is performed loop tran+si5e location= (Fail- yet, the to the while returning a vzlue of success exit SC( O wnerships[locat wnerships(tran, = calling be set to have Otherwise, process, upon will doesn’t them that location. a stable so then writes caused a helping do return Add~]]) return rou- first by then return then if SC(tran+status, of the version locations transaction’s to be stored, failing use the (Success, ownerships is in to the then SC(tran+status, if and the Transaction, set’s fails field that it is not data Nobody process. when Null version else and number used since call. the status ownerships. location leases to values releases the it AquireOwnership, zme, fadadd). the on If process tran Transaction instance is not = suc- executed, # = the input whether process initiating AquireOwnership. writes # (Ownerships[tran+ exit as parameters transaction contains This change acquire process LL if owner if for from tran+.add~] if owner a con- has the 4), gets indicating initiating executedl is crdled to = as executing transaction (Figure address was will = first record size Transaction value tine location owner vector. a boolean record do do if LL(tran+status) of a transaction declares transaction. process procedure tran, of then helping the if so calculates Oid Vaiues The record any of the execution ion rout ine of Figure description ceeded, the process’s ensuring transaction the initiates Transact initializes sistent size true if tran-+version process calling version) tran+size 1 to while record. A = i = field a transac- the 4: Transaction other owner of the Tran3 of the initially failtran+version values of the input. = if failtran+stable to case between an integer, failversion TransactiOn(failtran,failversiOn,Fslse) initialized In the data addresses transaction. are output synchronize number tion every The processes instance is incremented For vector set the cells this locations. the of transaction. Version– the input which size of the data every order and the the of from used transactions: its Input transaction is calculated contains cent sins beginning in the fields which which a consensus at stored Size – a vector else if the state. unbounded is available field [18, can be avoided if an additional Figure 19]. 207 5: Ownerships and Memory access input) Since by AcquireOwnerships the that initiator (1) same all processes locations from the moment tion. The but which reads ership on a free it, undecided. (Nu1l,O) read in the the action, All have the before past to not Any only prevents acquiring by the To of T is the ownership Claim 4.2 the process Proof: I ) which ecuting trans- location UpdateMemory in order to after the so every process updating the wrote location A failing Formallyj following actional [21], memory static for transactions k types TranJ as (Sketch than Return, ..n. In set to Failure. owner- failing location the acquired and cess should sets memory failing of a static that supports described as an (DataSet) and (FinalStatus, k types trans- where implementation, transaction record of the version started the any tran, field). the initiator Transaction of to be helping of T T. All with the an instance Therefore, execution tine and transaction (which the owns k and T is related to processes parameters helping are the The implementation record which of T. executing which tran) execute The as rouare initiator processes a content (tran,version,False), processes processes (the process the 4.1 is atomic and of T. owned P has of thk and lemma instruction should All the 4.3 same data Any set vector executing set will processes serializ- not process be able T of to was stored which update any T read by T“s read . of the the the saw the if P has failing pro- belongs itself. before failing transac- the to T. But, in failing process pro- saw the executed the Store-Conditions therefore the Store-Conditions loI I ■ has failed. the initiator. diRerent data shared data ) Assume nates successfully. failures is finite. in the computation, for the same on tries such By dresses higher have Claim 4.2 are the an those there of failing the of has failed there are infinitely on A many but failed to the con- initiator which have one and ownership A – a contradiction than lo- every is at least often, transactions the is completed, number Since that if Since transaction transaction acquired pro- location only often. infinitely the on, Ac- same happen location. point in the are several of the that fail of transaction “stuck” infinite that termi- some there may it follows which number case implies to help retrying, transaction infinitely be processes highest failed. this contradiction no ownership when must in turn the transaction the only transaction before In of if from processes released there which the only Thk and This A, the transaction. that way which happens acquire is released sider that to is squired follows Assume all try by in routine. which is non-blocking. schedule This quireOwnerships cesses implementation (Sketch is an infinite it is based on the of a transaction which since P has the location free in was Now, then ex- status process location failing an a higher invariant the that location the first failing failing ownership, The there the executing the be on transaction. location and before on the him 1. the the acquired instruction location The proof that P before confirmed on a higher P saw the occupied transactions. of proofi invariants: to location. Let By by another seen since case, cation able. Sketch following failed owns its fail- ownership ownership before that contrary. location. that P has Therefore location Lemma never its failing an Therefore have Lemma J C 1... the process exe- fai~ing with actions: number we define The k different automaton of output Output) status. T will the is undefined ownership Proof: our defined be define T, is the failing that acquired P acquired status actions: Request, Zel. n processes the location which the 4.1, cation specification that ) Assume process tion’s Outline the can of input TTanJ the Proof to T’s transaction, or a higheT prevent ownerships. Correctness Failure we first on it. cess saw it occupied 4 property transaction is still a different memory do after location acquire Lemma to which non-blocking of a failing own- status any the process process ing location process Store.-Conditiona synchronize updating the (with (2) to prove failing transac- proving before as owned to be True, releasing cuting the the becomes that transaction values released. field on Failure. new from the This processes been Written the status 5, the process ships that for for have by writing that between transaction allowed property. will field. in writing a slow location status version In order ensure instructions) of the are non-blocking is done location Figure status either ownership the is essential confirm to set the When in to This acquire by checking the the be called we must Store.Conditional property also 5 may processes to ownerships second atomicity the that no additional try is done and fixed, helping will (this Load.Linked the of Figure or by the fact have on ad- A is that ■ highest. structures. To 2. All the executing processes acquire ownership All the ownerships the version field after of of a transaction the the owned by T’s record status T will T will be released is incremented gorithm never of T has been set. T’s will All the T will executing update the processes memory of a successful before T’s transaction AllWritten field is set to True. 208 only helping increase of the or decreases “redundant the helping and the the avoid al- as much when a failing process. Such help- consequently, will cause In interval it any implementa- ownerships helped. helps” occur, S’rM occurs non-faulty release if not must In contention to released no failures paradigm helping.” another process have when redundant “helps” cess increases function on the above, helped would overheads “redundant given the 3. based transaction ing initiator. major as possible tion before by avoid our later then algorithm, between discovered. it a prohelps as a 5 An Empirical tion 5.1 Evaluation of Transla- no Doubly Methodology We compared methods ing Colbrook and without and [8]. Our 2048 contention of switching software at Dellarocas, architecture was MIT [1]. of 6 bytes and 4 cycles or wiring in in the us- Brewer, distributed-memory lines cost other architectures by network development with and network developed cache-coherent under a cache cost Weihl of STIVI bus simulator Alewife currently performance 64 processor Proteus of the had the on the both the array cent ain item in version of used a slightly respectively. an item processor Corn pare&Swap stamp. 2 version operation where On pare&Swap may existing ples of enqueue/dequeue 1 tial be lock-free We ous used methods This when the the serve as 64 bits the by using the Alpha a shared 64 size of the for evaluating structures. data structure We bits Each of n processes 10000/n times. In this change the whole object increments variThe and a shared benchmark state, updates the counter in and have no built A resource a few processes to time share a process tries in par- allocation scenario [10]: a set of resources and from time to atomically acquire a subset size s of those resources. This is the typical of a well designed distributed data structure. of space we show only cesses atomically locations increment chosen length 60. highly the benchmark have 5000/n uniformly The at random benchmark concurrent times queue captures and counter the transaction n. We used t ation [11]. a variant In consequently dequeues heap this of used the lier and with the is probably greatest the this built directly cost a memory we the believe the theoretical tations of [17] do not and the 3 The spurious value empty and most trying operation Load-Linked/Store-Conditional have a random s = 2 a vector of the behavior of Proteus), while access to non-blocking failures the it efficient 15, shared the 18] non-blocking raeli and four 1. (which wont. Alpha between [12] 2. there the will be achieved only if the size of ia rela- size. we to queue-lock in the STM processes data do set before value which compare says STM methods [23] include solution (the method. backoff manner). Method Compare&Swap cooperative and based All the ear- exclusive Herlihy’s to described based a mutually k-word the Is- imple- non-blocking [3] to reduce contention. leads us to conclude differentiating among parallelism: do not process at The the that there performance joint parts The price the a time is allow oj are of the data update it to the is a least the private pointer). when the only the data the to coopera- access Hedihy’s dis- object is such the process number copy, a failing and the that the reading nature almost updates methods, locations of (reading Fortunately, caching lock-free accesses of the to In both In size protocols performed are local. and processes memory the coherence cached to update: of copying Herlihy’s and structure. number in is at least writing and parallelism allowed concurrent a jailing the cache locking potential of the and failure Both exploit software-transactional methods copy However, in a boolean MCS methods Potential cesses will general is that stored on transaction the methods: and could of benchmarks to be presented factors the its the private price accessed of all ac- of a during execution. PowerPC Load_Linked 3. operations. property the Results object world then implemenor object translation of the update is value use exponential method, theoretical the real existing interfering oMl_J or not. Rappoport’s tive This Compare&Swap to since memory The than Store_Conditional as on times. since a failing is closer and size is n. benchmark [6, a heap 5000/n without simplification is accessed method n processes in maximsl in a failing Load-Linked/Store-Conditional allow its since 64-bit Com pare&Swap Load&inked/Store.Conditional Store-Conditional from less proposed into access is value queue software a blocking The data from of the parallelism Compare&Swap only to above structure The three has n pro- The started, nonblocking 5.2 k-word on the We heap implemen- each couof ini- enqueues/dequeues as specialization above. two implementations of a sequential up- of the a queue to the ia equal queue on a heap of size benchmark enqueues is initially 2 Naturally priority index limited empty, by dequeues 5000/n compared 2) structure. A shared the on supports of the a high item executes operations which element and enqueue/dequeue value one Queue process of Every array, index head to contain ia not list. a new item’s aa cells next in each if the as in [24, 25]. Priority the and locations to agree methods behavior For lack which of the in enqueues benchmark the given mentation Allocation tail two of processes, Figure not data are short, allelism. Resource of a queue number implemented (given Ber- of parallelism, Counting data Com - or using data small of the first previous Each queue For updated tively a time list. tail/head other. the 64-bit scheme benchmarks implementing in bits a the each we [7]. synthetic for vary amount size n. update support supports as on the methodology four methods 32 implemented Load-Linked/Store_Conditional shad’s not in the the new the The Instead machines by updating size The of cells the the architectures. was and process contain item does that of next architecture array. is a couple Each tadto the implementation an head index access instructions. modified list the in the and n. An list a memory Alewife Proteus Load_Linked/Store.Conditional the dating with Queue represent that concurrency linked machine Each Linked since current for increases a doubly cycle/packet. The potential structure Methods number of is finite. The amount of helping ists only the erative 209 in methods. by other processes: software-transactional In the cooperative Helping and the implementation, excoop- 12000, ~ 1 1 8000 .-. ------ 1 10 20 ------- ------ 30 50 k-word ations only that by all the and so on... the locations. mance factor, terminate the The results 6. and the vertical there architecture, higher the number concurrently, sors, the the number of priority queue and accessed most linearly word updatea k-word on the given in Fig- of processors achieved. to the give This since size . method the of the On the bus significantly the update with In the queue the STM Compare&Swap 5.3 ------- ------ 30 ------- 40 50 60 local work can performance 7, the STM number declines, as the be performed of a certain of causing that Every theoretical implemented parison them increases smaller too. im- and than better the size than not the methods) (in STM). of the allow chose doubly linked it is limited: the paid usually method, priority object. transactions most for low two in it should ran mark are results given inherent As in processes method a the failed number granularity implies may of of that of performs update the up to the the price Table ~tkc~~~~thod. in all remote Israeli of disjoint and 10. In since of a 2-word the this operations times of the advantage should the the high bench- throughqueue number and that to Israeli priority in Israeli number give highest sequential is the priority queue provides and regular priority algorithm STM for spite the the highlights 1, where entries in all benchmarks for the counter of are of faihng Rappoport of successful STM and the other the k-word pure bench- throughput ratio outperforms the coop- outperforms Herlihy’s benchmark. protwo- 4 In of 210 tion fact, since using it avoids 3-word freezzng Compare&Swap [1S] nodes simplifies the a pro- Rappoport different instead aa for As can be seen, method except 2.5 helping. execution slightly concurrent of the reason We a concurrent 4. in Figure counter Com- . summarize in is all which recursive the operation of the method, the k-word operation. on to ation Compare&Swap erative grows the We marks advantage benchmark structure Rappoport put. based another operation same (in policy for during Compare&Swap the The is use perfor- backoff supported helps an the implementation algorithm it implement without cooperative a process give Our Store-Conditional is since one should we compare a specific a software [18], ways com- non-redundant-helping Rappoport’s whenever method: counter such and in many to get a fair methods the for Compare&Swap k-word There the the compare STM queue method. In order methods without also needs Israeli benchmarks, method results. than Herlihy’s twice the queue object at queue. penalty size: the though methods be improved Therefore, non-blocking with Therefore, and the non-blocking can form. and Compare&Swap acceasing levels, Test-and-Test-smd-Set. the non-blocking We explicitly queue. number Herlihy’s concurrency in practice. purest of all the the We the increases, Still, all the of method between in their uses a 3-word a grow- does of processors in than comparison when proces- conflicts, structure number A cess, Figure constant higher only k-word methods. Compare&Swap is in because 1 20 is still pare&Swap level. though 8 10 remains work the perfor- methods, benchmark performs 9 contains concurrently cesses. are parallelism parallelism accessed concurrency benchmark, poorly need fails number based caching is a data STM concurrency a failure local the Iocations the Figure more the for the as the of locations Therefore, in Com- as failing is equivalent beyond k-word k-word degrades. concurrency, number thus them unsuccessful throughput increases, for of throughput the the bus, potential ----a --- . . . . . ..a Benchmark mance caching allocation and On a... but helping that not Herlihy’s than of processors proves. A and oper- is a crucial benchmark memory resource 0 are ones, a transaction shows is no potential throughput In ---”. e 0 helped. shows to the locking this of the when axis of updated and operations and counting axia ia cruel amount object ing the an transactions, it is not horizontal benchmark by most , and for The L concurrently, are in turn method, only as ~ailirzg failing locations STM in STM location, J ‘-%n-a- 1000 6: Counting Compare&Swap that Moreover, Compare&Swap ure k-word same is helped same first the operations In pare&Swap , including by acceas also g w Q -+- Processors Compare&Swap not 2000 1 1 “Q.. 60 Figure helped method .. g .g 1 tter+iiay!s,.rnethod ❑ ~ QUEUE spin KYcii““”X-- +------i-... _- _____ 40 1 .. . . . .. . gyera~ve Q..... 0 +------- 1 1 STM + ““” -X m e 1 1 x “n..... o ~ . .. ..%.. --- 0 0 6000 1 STM + -a g U3 : 1 Xk30perutiv.3method =@ .... Hertihy’s method -B-.... QUEUE spin lock -xQ.... ... .... ‘El.. .... .. ~ 10000 Alewife BUS # ..... 1 ... ❑. implementa- it ,7, Alewife BUS 12000 c 12000 1 10000 s 0 is u! :0 8000 6000 b n !4? .g p ~.\ -. %- ---..:’..* x--- -. ,x.- ... -. -... 4000 E ..X /’...+------+-------+’... 1 x + ‘%..~ -’-+ ------ -+ t o~ o @. . . . . . ..+ . . . . . . ..m . . . . . ..m..-.. -m 10 50 20 30 40 I dl I 0 10 o! 60 .. . “““““~ . . .. .. x.. ., ........ 20 7: Resource Allocation , 1 -1 ~7..... , m g c1 (0 to “ “’”’-% .. . .-x.. .x 2000 / x,, ,$ ‘,. .. , ., 10 20 30 40 50 x ....,,, ““’-x.... % la g .2 p 1000 g 500 --%- . .. .... ~. ..,., -’m. 1 , 1 0 60 8: Priority Queue Acknowledgments 10 20 Herlihy Scale ings The MIT Alewife Distributed-Memory of Workshop processors, tended Kluwer version publication, Scalable Academic of and Machine: Multiprocessor. on this Shared haa In ProceedMulti- 1991. been as MIT/LCS E. W. Blocking Synchronization cessors. In pression. Parallel Primitives Proceeding for of Algorithms and An submitted Memo the Asynchronous Jth ACM works. for [6] G. T.E. M.P. pp. Anderson. for The shared performance memory List pagea of Performance oj Iesues 1 Ith ACM in Non- MultiproSymposium Computation, spin multiprocessors. Pages on 1?25-134 and ACM, N. Shavit. Vol. Counting 41, No. Net- 5 (September Method Structures on for In Parallel Implementing Proceedings Lock-Free of the Algorithms and 5th ACM Architectures 1993. Comon 199-208, lock Herlihy, of the A Data [7] B,N Bershad. current Carnegie ternatives Systems, 1020-1048. Barnes Shared 1992. [3] Distributed on Shared-Memory Proceedings Journal 1994), ex- TM-454, Symposium Architectures, and 1992. Symposium R. J. Anderson. l?elten of Distributed [5] J. Aspnes, 1991. [2] 60 A Large- Memory Publishers, paper appears 50 1990. [4] J. Alemany, August et al. on Parallel January Principles A. Agarwal 40 for their comments. References [1] 1 30 Transaction 1(1):6-16, and Maurice .. . ... .. --.-. -.:-----y ---------------- Benchmark IEEE Greg Barnes x Processors Figure helpful I ., 1500 Processors We wish to thank I I Cooperatie method -i--Herlih ‘smethod -D-QUEU~ spin lock -x-- y 2000 r) many 60 1 +------+ 6 50 STM + ‘“”,-x. 2500 o 40 ........m Benchmark 2500 1 Cooperative method -+-Herlih ‘s method .n-QUEU~spinlock X- .. 3000 z a ........ STM + .... 1 Alewife 1 , 3500 % W : c1 ........ 30 BUS $ 3 -x- -... -x .... .... .. -x Processors Figure , x, /+ { la . . . . . . . . Processors 4000 I .%.-’- x . .. 1 2000 0 .. . .,-” alIn 211 [6] E.A. E. consideration Technical Mellon University. Brewer Weihl. Practical objects. C.N. Proteus: A. lock-free CMU-CS-91- September Dellarocas, A for Report, 183, 1991. Colbrook, High-Performance con- and W. Parallel- ~r-iiiza =1====1 5000 2000 x ... ... . 6000 4000 x.-. +4------- --.)+-.. -------- t i t i ~i ‘~ 20 30 40 Processors 50 600 , BUS , , 550 - - 350 - 300 - 250 - 200 - Linked Queue ‘. ‘. ‘. 50 60 1 1 Benchmark Alewife , , , , 350 200 i ‘., ‘.+ .... x’: ‘k . . .. .. w.-. 150 . .. %.. %-. -’--.+- 0 I 1 I 10 20 50 Figure - - ---- 10: Simulator. +--.-.._.-+ I 1 1 30 40 Processors Non-blocking 50 C.N. Dellarocaa. Proteus. *------ o ations of Israeli Documen- 10 20 -+ 30 40 50 60 Processors & Rappoport [15] M. September User ---- ,0 ~ 60 implement MIT/LCS/TR-516. +..y 100 1989. Brewer 30 40 Processors 250 ‘. 100 - E.A. 20 300 * 150 - [9] 10 400 , , STM -e-- - Cooperative mefhod -+-- 500 - r 0 60 9: Doubly -+-. +.-... Q. -“-+.. ‘- .. . ----- +------t--------w . . . . .. --EK......P. . . . . . ..+. -...--..-.$ , 1 0 10 ...@oWqye.rnehod e 1000 - Figure 400 - 500 - o 450 .-x.. 1500 - i t Alewife -x t 3000 Architecture -------- ‘a Priority Herlihy. A Queue methodology concurrent data gramming Languages November 1993. for implementing ACM objects. and highly Transactions on 15(9): Systems, Pro- 745–77o, t ation. [16] M. [10] K. Chandy Problem. guages [11] T.H. and InA and to CM The Drinking Transaction on 6(4):632-646, Systems, Cormen, duction J. Misra. C .E. Leiserson algorithms. MIT Programming October and R.L. In Lan- 20th pages 1984. Rivest. Herlihy and Architectural Philosopher [17] IBM. Intro- Press. Annual 289-5’00, Power [18] A. Israeli DEC. [13] M. Alpha Herlihy ness system and condition action pages J.M. for reference manual. Wing. Linear&ability: concurrent on Programming 463-492, July objecte. Languages and M. Herlihy. action pages on Programming 124-149, January Languages and on Notes Verlag, 1-17. pages Memory: Data Structures. Computer Architecture, 1993. PC. Reference 199.5’. Lecture In ACM Trans- Systems, 12(3), [19] A. Israeli and In A CM Trans- Systems, 13(1), L. Implementatione the Synchronization. Symposium May Transactional Lock-Free manual. in Efficient Priority Wait Free Imple- Queue. Computer Science In 725, WDA G Springer A correct- 1990. Wait-Free Moss. for of a Concurrent 13th [20] A. LaMarca. Synchronization 1991. 212 Rappoport. of ACM Computing [14] B. and L. Rappoport. mentation [12] J.E Support Strong Symposium pages A Disjoint-Access-Parallel Shared on Memory Principles Proc. of of Distributed 151-160. Performance Protocols. Evaluation Proc. of the 13th of Lock-Free A CM Sym- Throughput ratio of STMf Counter Doubly linked queue queue Table posiwn on Principles 0.34 0.30 6.07 2.44 22.5 24.14 0.42 0.41 Bus Alewife Bus Alewife Bus Alewife BUS Alewife Resource Allocation Priority 10 processors Herlihy’s Cooperative method method other 1: Pure implementation of Distributed throughput Computing, pages 130-140. [21] N. Lynch and M. Tuttle. for Distributed Symposium Pages [22] on kernel. versit y. Mars [23] J.M. [24] L. Rudolph, chines. [25] of and Support Systems, Allocation the 3rd Interna- for Program- April 1991. A Simple Load in Parallel Ma- Symposium on ACM pages Architectures, and A. Zemach. the Annual Architectures D. Shasha Making 1992 Principle Touitou. posal. and Lock 237-245, Non-blocking. Lock-Free University Trees. In ProceedParallel Locking Concurrent In Systems without Data Struc- Proceedings pages Programming: April Algorithms 1994. S. Prakash. Based of Database Tel Aviv on June (SPAA), J. Turek Algorithms Diffracting Symposium blocking: D. Synchronization of the 4th and E. Upfal. for Task Algorithms of ture [27] Uni- 1991. ings [26] OS Columbla Scott Operating M. Slivkin, Scheme N. Shavit and and In Proceedings Parallel July and M.L. In Proceedings on Architecture Languages Balancing multiprocessor CUCS-005-91. 1991. Conference ming A lock-free Report Contention. tional ACM Computation, 1987. Mellor-Crummey without Proofs of 6th of Distributed and C. Pu. Technical Correctness In Proceedings Principles August 137-151 H. Massalin Hierarchical Algorithm. of the 212-222. A Thesis Pro- 1993. 213 0.74 0.45 58.9 12.9 85.61 59.8 2.8 1.1 1.98 1.92 1.44 1.75 1.09 1.12 1.26 1.27 ratio: 60 processors Herlihy’s Cooperative method method STM / other 8.44 7.6 3.36 7.28 1.69 2.35 2.16 2.24 methods