(Lightweight) Recoverable Virtual Memory Robert Grimm New York University The Three Questions What is the problem? What is new or different? What are the contributions and limitations? The Goal Simplify the implementation of fault-tolerant apps Protect updates to persistent data structures Use transactions Well-understood, general mechanism What are transactions? And, what is "acid"? Obviously: Atomicity, Consistency, Isolation, Durability Let's Build on Camelot General transaction facility external for Mach page commitment strategies. communication Mach node. heavily and with the 4.3BSD Each module is implemented between modules Implemented through Mach'sMach’s external pager interface inteqxocess communication facilihy(IPC). File system metadata m system [2], Unix Figure 1 shows the overall structure task and communication Coda file system uses RVM on the interprocess of the Mach operating compatible operating system [20]. Recoverable virtual memory of(RVM) a Camelot relies management facilities Nested and distributed transactions which is binary Camelot 9“” as a is via m . Recoverable Processes Persistent data for replica control Internal housekeeping data Camelot But not file system data system Co.pon.nt I Mach K@r.el . A Rocky Relationship Decreased scalability when compared to AFS High CPU utilization, ~19% due to Camelot [Satya '90] Paging and context switching overheads Additional programming constraints All clients must inherit from Disk Manager task Debugging is more difficult Clients must use kernel threads Increased complexity Camelot pushes Mach interface, triggers bugs Who's to blame: Coda, Camelot, or Mach? What are the lessons? acks. Since recoverable virtual memory device driver of a mirrored disk. LRVM Design Methodology ct of Camelot we relied on, we sought to of this functionality into a realization -use and had few strings attached. that That RVM thus adopts a layered approach to transact support, as shown in Figure 2. This approach is simple enhances flexibility: into an application does not have to those aspects of the transactional irrelevant concept tha to it. ale was to The designer taketh away e we adopted in designing RVM er generality. In building Application a tool that did Code e were heeding sound advice NoLampson’s nested or distributed [19]. We were also being faithful transactions of keeping building om generality ent positions operating blocks simple. to the I The No concurrency to simplicity allowed control us to Distribution / SanalizabWy No support for recovering system dependence, and from media failures on was to eliminate support for nesting A cost-benefit analysis showed us that as an independent le a layered implementation monolithic \ from Camelot in the areas Operating Instead, a simple, layered approach er provided Nastkg I one, it layer on maybe has the less attractive Permananca.’ Figure 2: Layering 3.2. Operating System To make RVM portable, System media faihre of Functionality in RVM Dependence we decided to rely only More on Methodology The designer likes portability Only use widely-available virtual memory primitives No external pager, no pin/unpin calls Rely on regular file as backing store Slower startup: no demand paging, rather read from file (?) Possibly slower operation: duplicate paging (VM vs. RVM) The designer likes libraries No curmudgeonly collection of tasks, just a library Applications and RVM need to trust each other Each application has its own log, not one per system Log-structured file systems and RAID might help RVM Interface Backed by (external data) segments Mapping in regions (of segments) Mapped only once per process, must not overlap Must be multiples of page size, page-aligned Supported by segment loader and allocator To track mappings and to manage persistent heap initialize (version, options_desc); map(region_desc, unmap(region options_desc); —desc); ~end_transaction abort_transaction terminate; (a) Imtialization & Mapping EzJ (c)LogControlOperations Operations (tid, commit_mode) (tid) (b) Transactional query (options_desc, create log(options, ~ (d) Miscellaneous ; ; Operations region_desc); log_len, Operations mode); RVM Implementation No undo/redo log, just new values Old values buffered in virtual memory Bounds and contents of old-value records determined by set-range operation No-restore and no-flush transactions for efficiency (?) RaveraeD!splacemerrts Ranw Tram Hdr Hdr 1 *I !1 1 1 FoIwwd ‘this log record has three medifimtimr ranges. The bidirectional Dkplacements displacements records aUow the log to be read either way. Figure 5: Format of a Typical Log Record Tail Displacements I l} Recovery and Log Truncation Recovery: traverse log, build in-memory representation, apply modifications, update disk Truncation: apply recovery on old epoch of log Significant advantage: same code as for recovery Disadvantages Increases log traffic Degrades forward processing Results in bursty system performance Alternative: incremental truncation Not quite ready for prime-time (as of '93) Optimizations Intra-transaction Ignore duplicate set-range calls Coalesce overlapping or adjacent memory ranges Inter-transaction For no-flush transactions, discard old log records before a log flush Experimental Evaluation Complexity: source size RVM: 10 KLoC main + 10 KLoC utilities Camelot: 60 KLoC main + 10 KLoC utilities Performance: modified TPC-A benchmark Simulates hypothetical bank Accesses in-memory data structures Uses three account access patterns Best case: sequential Worst case: random Between case: localized 70% txns ➙ 5% pgs, 25% ➙ 15%, 5% ➙ 80% ‘M table presents the measured steady-state throughput, in transactions per second, of RVM and Camelot on the benchmark described in Section ‘7.1. 1. The cohmm labekd “Rmem/Pmem” ~ivw the ratio of recoverable to physi~ m~ow si= Ea* dam -t t$ves tie mea SSId StSSSda~ Evaluation: Throughput deviatiar (in parenthesis) of the three trials {ith most consistent results, chosen from a set of five to eight. ‘Dse experiments were conducted on a DEC 50W200 with 64MB of main memory and separate disks for the log, external data segment, and paging file. Only one thread was used to nm the benchmark. Only processes relevant to the benchmark mrr on the machine during the experiments. Transactions were required to be fully atomic and permanent. Inter- and intra-transaction optirnizations were enabled in the case of RVM, but not effective for this benchmark. This version of RVM only supported epoeh truncation; we expect incremental truncation Table 1: Transactional g 5P q -- -’ g a g V.* -- n 40 “\ =<.. \ 2 \ "\ > ! \ !. . ..m ‘- . *...*-... “’- s... Et”.. u... =.. -------mp ! .. 6.- . . . . . . ! “---a . . . . . . : " .:- ..?... K .. ! ~-... L7-.. k b, . %-------- ““-&.-... a’-. ‘*\ i? 30 - u . u 40 w . \ J \ ! -.. c .: . ,* significantly. Throughput ~ 50q u) ~ to improve performance q . ..*D \e ‘u \a - .0 — 20 “ _ Sequa+ml Camelot Sequential -m- _ 100 is% IWM -i - : E-n-+a EzizEzl RVM R.wrdom Cunotot Random 20 40 60 100 120 80 140 RmemfPmem 160 180 10 0 20 40 80 lLXI the data in Table 1. For clarity, 120 140 RmeWPmem (percent) 160 180 (percent) (b) Average Case (a) Best and Worst Cases These plots illustrate 60 the average case is presented separately from the best and worst eases. Transactional Throughput What metric does Figure the8:x-axis represent? For localized account access, Figure 8(b) shows that data in Table 1 indicates that applications with good locality can use up to 40% of physical memory for active throughput drops almost linearly with increasing Why does Camelot perform considerably worse? recoverable data, while keeping throughput degradation to recoverable memory size. But the drop is relatively slow, RVM’S and performance recoverable memory remains size acceptable approaches even physical when memory less than 10%. Applications restrict active recoverable with poor locality have to data to less than 25% for similar Evaluation: CPU Utilization _ _ — . . _ Cmmelot Rrndom Cwldot Sequa?ntial RVM Random RVM Sequential I ! 8 “5 .!. ‘a ! ....... FI... ! # # ..s...m”””o ””a”””m”””am” ” .ti. ! U--U-’--I3 8“ --- * o r ---- 20 60 40 80 1(X3 120 " --9--b-’ . 140 RmetrVPmem 160 ".. ... ..... ..... .... .... .... ...-9 ....?.... V.... V... W...V-. ”" 4 9 0 180 20 40 60 1(M 80 120 140 RmenVPmem (percent) 160 180 (per cent) (b) Average Case (a) Worst and Best Cases How did they arrive at these numbers? These plots depict the measured CPU usage of RVM and Camelot during tbe experiments described in Secticm 7.1.2. As in Figure 8, we have separated the average ease from the best and worst cases for visuat clarity. To save space, we have omitted the table of data (similar to Table 1) on which these plots are based. Figure 9: the Amortized CPU Costexperiments per Transaction I.e., did they repeat original Although we were by these results, we were For low ratios of behavior. feasible memory faster gratified because server considerably. of IBMCPU? RTs we by Camelot’s Why does Camelot require so muchInstead more puzzled recoverable Camelot’s to physical and RVM’S the degree of locality throughputs we had expected to be independent in the access pattern. both of The data shows Decstation experiment on 5000/200s. current has hardware hardware now use the much Repeating is also because Coda servers now use RVM changed the not original possible, to the exclusion of Evaluation: Optimizations On Coda servers (transactions are flushed) ~21% log size savings On Coda clients (transactions are mostly no-flush) 48% – 82% log size savings Arguably more important b/c clients have fewer resources, may run on battery What Do You Think? Wait a Minute... Appel & Li: Persistent Stores Basic idea: persistent object heap Modifications can be committed or aborted Advantage over traditional DBs: object accesses are (almost) as fast as regular memory accesses Implementation strategy Database is memory mapped file Uncommitted writes are temporary Detected by virtual memory system RVM: ProtN, Trap, Unprot... What Do You Think?