Techniques for Transparent Parallelization of Discrete Event Simulation Models Alessandro Pellegrini PhD program in Computer Science and Engineering Cycle XXVI September 26th , 2014 Technological Trend 2 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Technological Trend • Implications of Moore’s Law have changed since 2003 2 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Technological Trend • Implications of Moore’s Law have changed since 2003 • 130W is considered an upper bound (the power wall) P = ACV 2 f 2 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Technological Trend e h T r u t Fu e l! e l l a of Moore’s Law r • Implications a is Phave changed since 2003 • 130W is considered an upper bound (the power wall) P = ACV 2 f 2 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Multicore Software Scaling 3 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Multicore Software Scaling 3 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Multicore Software Scaling 3 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Multicore Software Scaling 3 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Multicore Software Scaling 3 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models The Needs • • • • Multicore / Clusters are a consolidated reality Both Scientific & Commodity Users Parallel Programming has been a privilege of few experts so far Large amount of (sequential) legacy code around 4 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models The Needs • • • • Multicore / Clusters are a consolidated reality Both Scientific & Commodity Users Parallel Programming has been a privilege of few experts so far Large amount of (sequential) legacy code around 4 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models The Needs As parallel processors become ubiquitous, the key question is: How do we convert the applications we care about into parallel programs? 4 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Thesis’ Goals: Automatic Code Parallelization “Relieve programmers from the tedious and error-prone manual parallelization process” 5 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Thesis’ Goals: Automatic Code Parallelization “Relieve programmers from the tedious and error-prone manual parallelization process” • Targeting Event Driven Programming • Transparency ◦ programmers fully exploit the semantic power of languages ◦ operations they should know nothing about are automatic ◦ Avoid set of new APIs, rely on classical/standard programming models • Efficiency ◦ make the parallel programs run fast ◦ specifically rely on non-blocking algorithms and speculation • Best trade-off between the two 5 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Research Context • Addressing every aspect of parallel/concurrent programming is non-trivial • I have concentrated on an instance of Event Driven Programming: ◦ No limitations to apply the results to other instances ◦ Discrete Event Simulation Environments • an effective test-bed for general event-driven programming • natural evolution of the work carried out by my research group • widely applicable 6 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Research Context • Addressing every aspect of parallel/concurrent programming is non-trivial • I have concentrated on an instance of Event Driven Programming: ◦ No limitations to apply the results to other instances ◦ Discrete Event Simulation Environments • an effective test-bed for general event-driven programming • natural evolution of the work carried out by my research group • widely applicable http://www.dis.uniroma1.it/∼hpdcs/ROOT-Sim 6 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Research Context • Addressing every aspect of parallel/concurrent programming is non-trivial • I have concentrated on an instance of Event Driven Programming: ◦ No limitations to apply the results to other instances ◦ Discrete Event Simulation Environments • an effective test-bed for general event-driven programming • natural evolution of the work carried out by my research group • widely applicable http://www.dis.uniroma1.it/∼hpdcs/ROOT-Sim 6 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Event Driven Programming • The flow of the program is determined by events ◦ sensors outputs ◦ user actions ◦ messages from other programs or threads • It is based on a main loop divided into two phases: ◦ event selection/detection ◦ event handling • Based on event handlers ◦ they are essentially asynchronous callbacks ◦ events can be queued if the involved handler is busy at the moment ◦ events’ execution is inherently parallel 7 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Discrete Event Simulation (DES) • A discrete event occurs at an instant in time and marks a change of state in the system • DES represents the operation of a system as a chronological sequence of events • Program state is represented using Logical Processes (LPs) ◦ A collection of data structures scattered in memory ◦ Each simulation object mapped to one LP ◦ Events produce changes in the LPs 8 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Discrete Event Simulation (DES) • Highly versatile technique • Allows to analyze complex systems ◦ VHDL ◦ traffic simulation ◦ agent-based simulation ◦ ... • Systems can be simulated before their construction or operativity (what-if analysis) ◦ online reconfiguration of runtime parameters (e.g. on Cloud platforms) [SIMUTOOLS13] • Interaction of simulated worlds with physical worlds (symbiotic simulation) 9 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models PDES Logical Architecture on Multicores LP LP LP LP LP LP Discrete Event Runtime Environment Simulation Kernel CPU CPU CPU CPU Machine 10 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models The Synchronization Problem LPi 5 10 20 Execution Time Message LPj 10 15 12 20 Execution Time Straggler Message Events Timestamps 17 Message LPk 7 17 25 Execution Time 11 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Optimistic (Speculative) Synchronization: Rollback LPi 5 10 message LPj 10 15 WCT 12 straggler message 17 message 7 Rollback Execution: recovering state at LVT 10 12 20 events' timestamp LPk WCT 20 17 antimessage 17 25 antimessage reception 25 Rollback Execution: recovering state at LVT 7 12 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models WCT Rollback Supports: State Saving Snapshot taken before the execution of every simulation event LPi 5 10 x 18 5 WCT 10 12 LPj y 7 snapshots' timestamp 15 7 events' straggler message timestamp 12 15 12 Rollback Execution: restoring simulation state at T = 7 13 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models WCT What if the model... void *simulation_state[lps]; ev_handl(int id, double time, int ev_type, void *ev_content) { switch(event_type) { case INIT: simulation_states[id] = malloc(sizeof(some_structure)); break; case EVENT_x: simulation_states[id]->counter++; break; <...> } } 14 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models What if the model... void *simulation_state[lps]; ev_handl(int id, double time, int ev_type, void *ev_content) { switch(event_type) { case INIT: simulation_states[id] = malloc(sizeof(some_structure)); break; case EVENT_x: simulation_states[id]->counter++; break; <...> } } 14 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models ACT I Dynamic Memory and Incremental State Saving 15 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models What we want... • Transparency ◦ Interception of memory-related operations (no platform APIs) ◦ No application-level procedure for (incremental) log/restore tasks • Optimism-Aware Runtime Supports ◦ Recoverability of generic memory operations: allocation, deallocation, and updating • Incrementality ◦ Cope with memory “abuse” of speculative rollback-based synchronization schemes ◦ Enhance memory locality 16 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Di-DyMeLoR [PADS09], candidate for Best Paper Di-DyMeLoR: Dirty Dynamic Memory Logger and Restorer • Lightweight software instrumentation ◦ Optimized memory-write access tracing and logging ◦ Arbitrary-granularity memory-write tracing ◦ Concentration of most of the instrumentation tasks at a pre-running stage: • No costly runtime dynamic disassembling • Standard API wrappers ◦ Code can call standard malloc services ◦ Memory map transparently managed by the simulation platform 17 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Memory Map Block status bitmap Dirty bitmap ... malloc_area • Memory (for each LP) is pre-allocated • Requests are served on a chunk basis • Explicit avoidance of per-chunk metadata ◦ Block status bitmap: tracks used chunks ◦ Dirty bitmap: tracks updated chunks since last log 18 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models chunk ... chunk malloc_area Hijacker [HPCS13], candidate for the Best Paper • Hijacker is a rule-based static binary instrumentation tool • It has been specifically tailored to HPC applications • It tries to hide away all the complexities related to binary manipulation • It is highly modular: can be extended to different instructions sets and binary formats • Instrumentation rules are specified using xml 19 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Internal binary representation Instructions Executable jmp Functions mov call 20 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Data Static Binary Instrumentation in Di-DyMeLoR new writeable section original memory update mov $3, x indirect branch jmp *%eax Original Executable .Jump: jmp 0xXXXX regular jump modi ed by branch_corrector push struct call track mov $3, x push struct call corrector jmp .Jump regular jump Final executable Instrumentation Process 21 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Discussion of the Approach • Fast O(1) management of memory access tracking on a CISC ISA (x86) ◦ thanks to cached disassembly information • Enhanced locality ◦ chunks to be logged are contiguous • Fast log/restore operations ◦ tiny and compact data structures to analyze • Supporting a free is non trivial due to the lack of headers ◦ fast software cache for mapping addresses to malloc areas 22 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Log/Restore Operations • Log Operation ◦ Only used malloc areas are logged ◦ Status and dirty bitmap are copied to the log buffer ◦ Only chunks dirtied since the last log are copied • Restore Operation ◦ The chain of logs is traversed ◦ Status/Dirty bitmaps are used to know which chunks are present ◦ The operation stops when all the (allocated) chunks are restored 23 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Isn’t the cost high? • Do we have to really pay the memory-tracking supports cost? 24 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Isn’t the cost high? • Do we have to really pay the memory-tracking supports cost? • No, if we don’t have any benefit from it! • Hijacker offers code multiversion facilities • We have developed an analytic model to determine if we have benefits from incrementality [TPDS2014] ◦ switching among pure full and incremental modes involves only changing a function pointer ◦ decision is stable to fluctuations thanks to an integral model 24 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models What does the literature propose? • Large-granularity approaches ◦ Memory (page-based) protection solutions [SQ06] ◦ The cost is higher ◦ Logs easily become huge • Small-granularity approaches ◦ Instrumentation-based approach [WP96] ◦ Each memory operation is stored ◦ Complex reconstruction of a previous state ◦ Larger memory footprint • API-based approaches ◦ Specific APIs are used to indicate where the state is located ◦ Specific APIs are used to mark modified areas ◦ The modeler has to implement callbacks to generate/restore logs 25 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Test-Bed Scenario and Settings • NoSQL benchmark [SIMUTOOLS13] ◦ 64 cacheservers ◦ 50000 data objects per cache server ◦ read intensive vs write intensive phases (5% to 95% accesses in write mode) • Run on an HP Proliant server: ◦ 64-bits NUMA machines ◦ four 2GHz AMD Opteron 6128 processors and 32GB of RAM ◦ each processor has 8 CPU-cores (for a total of 32 CPU-cores) 26 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models NoSQL: Execution Time 1600 Overall Execution Time (seconds) 1400 1200 1000 800 600 400 200 0 r l 27 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models l ria Se ito on ta en em M ll o N Fu cr In Read Intensive Phase Write Intensive Phase What if the model... ev_handl(int id, double time, int ev_type, void *ev_content) { switch(event_type) { case EVENT: printf("I’m executing event %d\n", ev_type); break; <...> } } 28 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models What if the model... ev_handl(int id, double time, int ev_type, void *ev_content) { switch(event_type) { case EVENT: printf("I’m executing event %d\n", ev_type); break; <...> } } 28 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Non-Rollbackable Operations output via printf() Outside World rollback operation not possible Simulation Framework LP1 e1 T1 e2: T2 < T1 m2 T2 LP2 • It’s not a reproducibility problem (as in fault tolerance) • It’s a correctness one (rollbacks, silent execution) 29 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models ACT II Interacting with the Outside World 30 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models What do we want... • What does the literature propose? 1. Ad-hoc output-generation APIs provided by simulation frameworks 2. Temporary suspension of processing activities until output generation is safe (delay until commit) 3. Storing output messages in events, and materializing during fossil collection (this affects the critical path) 31 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models What do we want... • What does the literature propose? 1. Ad-hoc output-generation APIs provided by simulation frameworks 2. Temporary suspension of processing activities until output generation is safe (delay until commit) 3. Storing output messages in events, and materializing during fossil collection (this affects the critical path) • What we explicitly want: 1. Transparency 2. Move most operations away from the critical path Executing event at T1 Processing I/O LPx WCT T2 Wasted Work LPy WCT T3 T4 T5 T6 Rollback 3. Order output messages on a system-wide basis 31 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Output Messages Management [PADS13] • We rely on link-time wrappers ◦ printf family functions are redirected to an internal subsystem • We base our solution on an output daemon ◦ a user-space process separated from the actual simulation framework ◦ It is not given a dedicated processing unit • Communication with the kernel is achieved via a logical device ◦ A non-blocking shared memory buffer, accessed circularly ◦ If it gets filled, a new (double-sized) buffer gets chained ◦ Once empty, the older buffer gets destroyed 32 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Non-blocking logical device • A logical-device is a per-thread channel • The device is written by a thread, and is read by the output daemon: we can implement a non-blocking access algorithm updated by daemon r w updated by thread 33 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Non-blocking logical device • A logical-device is a per-thread channel • The device is written by a thread, and is read by the output daemon: we can implement a non-blocking access algorithm updated by daemon r w updated by thread 33 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Output Message Ordering, Commit, and Rollback • A timestamped output message read from a device is stored into a Calendar Queue ◦ fast O(1) access for output message insertion and materialization ◦ provides system-wide ordering • Upon computation of the GVT, the output subsystem is notified about the newly computed value (via a special COMMIT message) • The Calendar Queue is then queried to retrieve messages falling before that value • Upon rollback, messages can be extracted from any position of the queue 34 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Output Daemon Wakeup • All this might require much computing time ◦ We want a timely materialization! ◦ Yet we want an efficient simulation! 35 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Output Daemon Wakeup • All this might require much computing time ◦ We want a timely materialization! ◦ Yet we want an efficient simulation! • Processing time of each type of message: to , tc , tr • Mean events’ number written to device in a GVT phase: c̄o , c̄c , c̄r • Expected execution time to empty the logical device: E(T ) = X t̄x · c̄x x∈(o,c,r) • If larger than a compile-time threshold (smaller than GVT interval), it is forced to that value • Final value is divided into several time slices and a sleep time is computed, so as to create an activation/deactivation pattern 35 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Test-Bed Scenario and Settings • Personal Communication System (PCS) benchmark ◦ 1024 wireless cells, each one having 1000 channels ◦ random-walk mobility model ◦ high-fidelity simulation • fading phenomena explicitly modeled • explicitly accounts for climatic conditions ◦ simulation statistics printed periodically, with frequency f ∈ [1%, 35%] of total events ◦ that’s one output message produced [200, 7000] times per second 36 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Experimental Results Output delay - 1% 120 instantaneous print (theoretical) 1 print every 70 handoff events Wall-clock-time (seconds) - output 100 Throughput 35% 1.6e+06 With Daemon Without Running Daemon Without Subsystem No printf Output in Event Queue 1.2e+06 60 40 20 1e+06 0 0 10 20 800000 30 40 50 60 Wall-clock-time (seconds) - generation 70 80 Output delay - 35% 1200 instantaneous print (theoretical) 1 print every 2 handoff events 600000 1000 400000 200000 0 0 20 40 60 80 Wall-clock-time (seconds) 100 120 140 Wall-clock-time (seconds) - output Cumulated Committed Events 1.4e+06 80 800 600 400 200 0 0 10 20 37 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models 30 40 50 60 Wall-clock-time (seconds) - generation 70 80 90 What if the model... int processed_events = 0; ev_handl(int id, double time, int ev_type, void *ev_content) { switch(event_type) { case EVENT_x: processed_events++; break; <...> } } 38 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models What if the model... int processed_events = 0; ev_handl(int id, double time, int ev_type, void *ev_content) { switch(event_type) { case EVENT_x: processed_events++; break; <...> } } 38 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models ACT III Managing Global Variables 39 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Global Variables Management Subsystem (GVMS) [MASCOTS12] • The programmer can access both the LP’s private state and the global portion • Rely on instrumentation via Hijacker • Implement shared state as multi-versioned variables • Propose an extended rollback scheme • Rely on non-blocking algorithms for data synchronization • GVMS exposes only two (internal) APIs: ◦ write glob var(void *orig addr, time type lvt, ...) ◦ void *read glob var(void *orig addr, time type my lvt) 40 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Read/Write Detection • To efficiently support runtime execution, an exact number of multi-versioned global variables must be installed • At linking time the .symtab section is explored, to find global variables in the executable • A table of hname, address, sizei tuples is built • At simulation startup, the correct number of multi-versioned variables is installed 41 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Shared Memory-Map Organization ... ... ... { { { metadata variables nodes read list • Preallocated: no malloc invocations during execution • Contiguous: enhances locality • Accessed concurrently: CAS-based allocation of nodes 42 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Version Lists • Multi-versioned variables are implemented as version lists • Each node represents one variable’s value at a certain LVT • Insert/Delete operations are implemented as non-blocking operations by relying on the CAS primitive H Tx Tz T CAS Ty H Tx Ty 43 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models T Synchronization and Rollback • To strengthen the optimism, we allow interleaved reads and writes on a version list • We explicitly avoid a freshly installed version to invalidate any version related to a greater LVT LVT = 8 Write v LVT = 10 LVT = 6 Read: LVT = 7 Read: LVT = 9 44 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models v i o l a t i o n Synchronization and Rollback • Processes which read a version node must leave a mark, i.e., visible reads are enforced. • Classical rollback’s notion is augmented: ◦ In case of inconsistent read, a special anti-message is sent to the related LP 45 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Synchronization and Rollback • Processes which read a version node must leave a mark, i.e., visible reads are enforced. • Classical rollback’s notion is augmented: ◦ In case of inconsistent read, a special anti-message is sent to the related LP • A ReadList is maintained, to keep track of versions reads • After each Write operation, the ReadList of the previous node is checked to see if an anti-message must be scheduled to some LPs 45 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Synchronization and Rollback • Processes which read a version node must leave a mark, i.e., visible reads are enforced. • Classical rollback’s notion is augmented: ◦ In case of inconsistent read, a special anti-message is sent to the related LP • A ReadList is maintained, to keep track of versions reads • After each Write operation, the ReadList of the previous node is checked to see if an anti-message must be scheduled to some LPs • When an antimessage is received because of an inconsistent read, version nodes related to that particular event must be removed ◦ This is done by connecting every node in the message queue with version nodes installed during an event execution 45 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Experimental Results PCS with global variables handling global statistics Total Time Execution 600 Throughput 7000000 Message Passing Shared Memory No Shared State Cumulated Committed Events Execution Time (seconds) Message Passing Shared Memory 6000000 500 400 300 200 100 5000000 4000000 3000000 2000000 1000000 0 0 0 5 10 15 20 Number of Worker Threads 25 30 35 0 10 20 30 40 50 60 Wall-Clock Time (seconds) 46 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models 70 80 90 What if the model... ev_handl(int id, double time, int ev_type, void *ev_content) { switch(event_type) { case INIT: my_state = malloc(sizeof(state_t)); my_state->buffer = malloc(1024); break; case EVENT_x: char *cont = my_state->buffer; ScheduleEvent(to, time, EVENT_y, cont, sizeof(char *)); break; case EVENT_y: strcpy(ev_content, "Some String!"); break; <...> } } 47 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models What if the model... ev_handl(int id, double time, int ev_type, void *ev_content) { switch(event_type) { case INIT: my_state = malloc(sizeof(state_t)); my_state->buffer = malloc(1024); break; case EVENT_x: char *cont = my_state->buffer; ScheduleEvent(to, time, EVENT_y, cont, sizeof(char *)); break; case EVENT_y: strcpy(ev_content, "Some String!"); break; <...> } } 47 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models ACT IV Cross-Accessing Logical Processes’ States 48 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Step 1: Materializing Cross-State Dependencies [PADS14] • To transparently detect accesses to other LPs’ states we rely on an x86 64 kernel-level memory management architecture 49 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Step 1: Materializing Cross-State Dependencies [PADS14] • To transparently detect accesses to other LPs’ states we rely on an x86 64 kernel-level memory management architecture Linear Address 47 39 38 PML4 30 29 21 20 Directory Ptr Directory 9 9 12 11 Table 9 0 O set 9 12 Page-DirectoryPointer Table PDPTE 40 PDE with PS=0 PTE 40 Page Directory Page Table 40 PML4E 40 Physical Addr. 40 CR3 49 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models 4-KB Page Memory Allocation Policy • LPs use virtual memory according to stocks • Memory requests are intercepted via malloc wrappers (DyMeLoR) • Upon the first request, an interval of page-aligned virtual memory addresses is reserved via mmap POSIX API (a stock) • This is a set of empty-zero pages: a null byte is written to make the kernel actually allocate the chain of page tables • One stock gives 1GB of available memory to each LP 50 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Memory Access Management • A LKM creates a device file accessible via ioctl • SET VM RANGE command associates stocks with LPs • A kernel-level map (accessible in constant time) is created: ◦ Each stock is logically related to one entry of a PDP page-table ◦ The id of the LP which the stock belongs to is registered • When LP j accesses LP i’s state, we could know that by the memory address constant time access map updated via the SET_VM_RANGE ioctl command PML4 PDP simulation object x O-th PDPTE 1-st PDPTE simulation object y 51 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Memory Access Management CR3 Register PML4 PDP 0-th PDPTE LPx 52 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Memory Access Management Sibling PML4 Sibling PDP NULL CR3 Register WTi PML4 PDP 0-th PDPTE LPx 52 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Memory Access Management Sibling PML4 Sibling PDP NULL CR3 Register WTi PML4 ioctl(fd, SCHEDULE_ON_PGD, x) PDP 0-th PDPTE LPx 52 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Memory Access Management Sibling PML4 Sibling PDP NULL CR3 Register WTi PML4 PDP 0-th PDPTE LPx 52 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Memory Access Management Sibling PML4 Sibling PDP 0-th PDPTE CR3 Register WTi PML4 PDP 0-th PDPTE LPx 52 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Memory Access Management Sibling PML4 Sibling PDP 0-th PDPTE CR3 Register WTi PML4 PDP 0-th PDPTE LPx 52 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Memory Access Management Sibling PML4 Sibling PDP 0-th PDPTE CR3 Register WTi PML4 PDP 0-th PDPTE LPx 52 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Cross-State Dependency Materialization • If other LPs’ stocks are accessed, we have a memory fault • This is the materialization of a Cross-State Dependency • Yet, this page fault cannot be traditionally handled: ◦ Memory has already be validated via mmap at simulation startup ◦ The Linux kernel would simply reallocate new pages ◦ For the same virtual page we would have multiple page table entries! 53 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Step 2: Event and Cross-State Synchronization (ECS) • At startup we change the IDT table to redirect the page-fault handler pointer to a specific ECS handler • Upon a real segfault, the original handler is called • Otherwise, the ECS handler pushes control back to user mode to let the PDES platform handle synchronization: ◦ Execution goes back into platform mode ◦ CR3 is switched back to the original PML4 table ◦ The simulation kernel can access any memory buffer required for supporting synchronization 54 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Step 2: Event and Cross-State Synchronization (ECS) • At the end of the event the simulation platform invokes the UNSCHEDULE ON PGD command • This explicitly brings back the execution to platform mode SCHEDULE_ON_PGD platform mode p to the (CR3 points original PML4) UNSCHEDULE_ON_PGD simulation-object mode (CR3 points to the sibling PML4) faulting access to a remote stock • Upon a CR3 switch, the penalty incurred is a flush of the TLB 55 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models ECS System Property When a Cross-State Dependency is materialized at simulation time T , the involved LP observes the state snapshot that would have been observed in a sequential-run. • To support this we introduce: ◦ temporary LP blocking: the execution of an event can be suspended ◦ rendez-vous events: system-level simulation events not causing state updates • Events are “transactified”: read/write operations are serialized without pre-declarations • Bypass of the RPC (copy-restore) model 56 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models ECS System LPx WCT CSDx = {} Cross-State Dependency Set LPy 57 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models WCT ECS System LPx CSDx = {} ex LPy 57 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models WCT WCT ECS System memory fault LPx CSDx = {} ex LPy 57 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models WCT WCT ECS System Generation of a unique ID LPx CSDx = {} ex LPy 57 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models WCT WCT ECS System block state LPx CSDx = {} ex LPy 57 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models WCT WCT ECS System LPx CSDx = {} LPy WCT ex ervx Incorporated in the event queue 57 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models WCT ECS System LPx CSDx = {} LPy WCT ex ervx 57 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models WCT ECS System LPx CSDx = {} LPy ervax ex ervx 57 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models WCT WCT ECS System LPx CSDx = {y} LPy ervax ex ervx 57 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models WCT WCT ECS System ready state LPx CSDx = {y} LPy ervax ex ervx 57 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models WCT WCT ECS System event execution LPx CSDx = {y} LPy ervax ex ervx 57 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models WCT WCT ECS System LPx CSDx = {y} LPy WCT ervax ex ervx eubx 57 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models WCT ECS System LPx CSDx = {y} LPy WCT ervax ex ervx eubx 57 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models WCT Rollback • Rollback of LPx is managed via traditional annihilation scheme • Rollback of LPy must be explicitly notified ◦ A restart event ervr is sent to LPx x • All other events are not incorporated in the queue ◦ They do not require special care for rollback operations ◦ They are simply discarded if no rendez-vous ID is found 58 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Progress: Deadlock issued rendez-vous with source objects blocked waiting for acks t1 object x object y t2 object z deadlock generator rendez-vous t3 59 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Progress: Domino Effect 3) Snapshot reconstruction for rendez-vous requires coasting-forward from an older log 1) rollback (requires coasting-forward up to ts(ex) log ex object x straggler object y ey log 2) Snapshot reconstruction for rendez-vous requires coasting-forward up to ts(ex) 60 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Experimental Evaluation: Overhead Assessment • Personal Communication System Benchmark • 1024 wireless cells, 1000 wireless channels each • 25%, 50%, and 75% channel utilization factor 50 Speedup 40 30 20 10 0 0.25 0.5 Channel Utilization Factor No Ad-Hoc Memory Management 0.75 Ad-Hoc Memory Management 61 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Experimental Evaluation: Effectiveness Assessment • NoSQL data-grid simulation • 2-Phase-Commit (2PC) protocol to ensure transactions atomicity • Two different implementations: ◦ Not using ECS: the write set is sent via an event ◦ ECS-based: a pointer to the write set is sent • 64 nodes (degree of replication 2 of each hkey, valuei pair) • Closed-system configuration: 64 active concurrent clients continuously issuing transactions • Amount of keys touched in write mode by transactions varied between 10 and 100 62 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Experimental Evaluation: Effectiveness Assessment 160 Overall Execution Time (seconds) 140 120 100 80 60 40 20 0 10 100 Average Transaction Write Set Size ECS Traditional Parallel Serial 63 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Overview 1. Dynamic Memory and Incremental State Saving ◦ Instrumentation-based interception of memory updates ◦ Efficient log/restore operations relying on compact metadata ◦ Possibility to switch to full mode depending on the operation cost (based on an innovative performance optimization/stability methodology) 2. Interacting with the Outside World ◦ Materialization moved off from the critical path ◦ System-wide ordering of output ◦ Activation/Deactivation scheme to enhance simulation efficiency 64 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Overview (2) 3. Managing Global Variables ◦ Based on multiversion lists ◦ Non-blocking algorithm to access the list ◦ Reduced rollback impact in case of read operations 4. Cross-Accessing Logical Processes’ States ◦ Kernel-level low-cost detection of cross-state accesses ◦ Introduction of execution suspension of LPs ◦ Introduction of a new synchronization protocol (ECS) • All approaches are fully transparent vs the common event handler programming paradigm 65 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Thanks for your attention s? n o i t es u Q pellegrini@dis.uniroma1.it http://www.dis.uniroma1.it/∼pellegrini http://www.dis.uniroma1.it/∼ROOT-Sim 66 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models DES Skeleton 1: 2: 3: 4: 5: procedure Init End ← false initialize State, Clock schedule IN IT end procedure 6: 7: 8: 9: 10: 11: 12: 13: procedure Simulation-Loop while End == false do Clock ← next event’s time process next event Update Statistics end while end procedure 1 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Di-DyMeLoR Architecture SetState() Application Level Software malloc() free() realloc() calloc() Calls to 3rd party functions memory accesses Simulation Platform set_current_lp() Memory Map Manager and Allocator Third Party Library Wrapper Update Tracker dymelor_init() take_full_log() take_incremental_log() state_restore() Log/Restore Subsystem Memory Recovery Subsystem 2 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models prune_log() Restore Operations 3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Restore Operations 3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Restore Operations 3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Restore Operations 3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Restore Operations 3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Restore Operations 3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Restore Operations 3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Restore Operations 3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Restore Operations 3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Restore Operations 3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Restore Operations 3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Restore Operations 3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Restore Operations 3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Restore Operations 3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Restore Operations 3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Restore Operations 3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Comparison of Instrumenting Tools ATOM EEL BIRD Stat X X X PEBIL Dyninst X X Dyn X Pin DynamoRIO X X Valgrind X Hijacker X Features Alpha machines only C++ interfaces to alter code Windows Only, Rewrites first 5 bytes of functions Linux Only, Relocates the code Mostly used for performance profiling and debugging Just-in-time instrumentation Powerful, multiplatform, must implement a runtime client Powerful, but invasive. Emulates the execution of programs xml-based instrumentation, offers built-in features 4 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Rule-based instrumentation • Code is described in terms of functions, specific (machine-dependent) instructions or instruction families • Instrumentation entails adding code snippets in specific points, and injecting completely new functionalities • Replacing particular instructions with a function call is as simple as: <Inject file="mcopy.c"/> <Instruction instruction="movs" replace="nop"> <AddCall where="after" function="mcopy" arguments="target" convention="registers"/> </Instruction> 5 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Instruction Families I MEMRD: The instruction reads from memory I MEMWR: The instruction writes to memory I CTRL: The instruction performs checks on data I JUMP: The instruction alters the execution flow I CALL: The instruction calls a different function I RET: The instruction returns from a callee I CONDITIONAL: The instruction is executed only if a condition is met I STRING: The instruction operates on large amount of data I ALU: The instruction does some logical/arithmetic operation I FPU: The instruction does some floating-point operation I STACK: The instruction works on stack I INDIRECT: The instruction behavior might depend on some runtime value 6 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Hijacker’s Architecture Hijacker Front-End XML Parser XML Con g File Input Relocatable Executable Output Relocatable Executable F i l e L o a d e r F i l e W r i t e r Back-End Executable Formats Interpreters Instruction Sets Disassemblers Executable Formats Generators Instrumentation Rule Manager Internal Executable Representation Instrumentation Engine Instruction Sets Assemblers 7 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Full Log Model OHF = SF χF − 1 δLB + Pr (SF δRB + δe ) χF 2 δe average event execution cost. SF average size of a full log. δLB average cost for logging one byte of the state image (including metadata) δRB average cost for restoring one byte from the log Pr rollback probability (frequency of rollback occurrences over event executions) χF selected log interval when operating in non-incremental mode. 8 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Full Log Model • The non-incremental overhead is minimized for : &r χF = 2 δLB SF Pr δ e ' • The optimal checkpointing interval is denoted as χopt F 9 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Incremental Log Model OHI = SP (SF −SP ) χI−1 δLB + δLB +Pr SF δRB+ (δe +δm ) +δm χI χI χI,F 2 SP average size of an incremental log XI selected log-interval when operating in incremental mode. XI,F interleave step between full and incremental logs δm is the per-event cost for running the memory-update tracking module 10 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Incremental Log Model • The non-incremental overhead is minimized for : &r χI = 2 δLB SP Pr δe + δm ' • The optimal checkpointing interval is denoted as χopt I 11 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Integral Model • The the best suited operating mode is selected usign a cost opt function CF (χopt F , χI ) defined as: opt opt opt CF (χopt F , χI ) = OHF (χF ) − OH(χI ) • The final choice is taken on the result of the integration of this cost function over a multi-dimensional domain defined by the values of the parameters {δe , δm , δLB , δRB , Pr , SF , SP }. 12 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models PCS: Throughput (variable τA ) 1.0⋅108 Cumulated Committed Events 9.0⋅107 8.0⋅107 7.0⋅107 6.0⋅107 5.0⋅107 4.0⋅107 3.0⋅107 2.0⋅107 1.0⋅107 0 0 50 100 150 200 250 300 Wall-clock-time (seconds) Incremental (Scattered state) Full (Scattered state) 350 400 No Monitor (Scattered state) Full (Contiguous state) 13 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models 450 PCS: Throughput (fading recalculation) 4.5⋅107 Cumulated Committed Events 4.0⋅107 3.5⋅107 3.0⋅107 2.5⋅107 2.0⋅107 1.5⋅10 7 1.0⋅107 5.0⋅106 0 0 20 40 60 80 100 120 Wall-clock-time (seconds) Incremental (Scattered state) Full (Scattered state) 140 160 No Monitor (Scattered state) Full (Contiguous state) 14 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models 180 PCS: Memory Consumption 1000 Log Metadata Actual State Log Total Memory Occupancy (MB) 800 V: Variable τA F: Fading Recalc. 600 C: Contiguous State S: Scattered State 400 200 0 ,F ) S, F) F) V) V) S, l( r( ta ito en on (C M ll Fu o N F) ) S, S, l( r( ta ito ,V , (S (C on V) en em cr In ll Fu M ll Fu o N , (S em cr In ll Fu 15 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Event Latency (microsec.) PCS: Event Latency Instrumented Non-Instrumented 120 100 80 60 40 20 1 2 3 4 5 Call Inter-Arrival Frequency 16 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models 6 Simulation Throughput Throughput 1% 1.6e+06 With Daemon Without Running Daemon Without Subsystem No printf Output in Event Queue Cumulated Committed Events 1.4e+06 1.2e+06 1e+06 800000 600000 400000 200000 0 0 10 20 30 40 50 60 Wall-clock-time (seconds) 70 17 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models 80 90 Non-Blocking Stack Throughput 18 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Non-Blocking Linked List Throughput 19 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Concurrent Allocator 1: procedure Allocate 2: m ← generate mark() 3: slot ← first node free 4: while true do 5: alloc ← vers[slot].alloc; 6: if alloc ∨ ¬ CAS(vers[slot].alloc, alloc, m) then 7: slot ← next slot in circular policy 8: else 9: break 10: end if 11: end while 12: atomically update first node free 13: return slot 14: end procedure 20 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Read Operation 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: procedure Read(addr, lvt) slot ← hash table’s entry associated with addr if slot ∈ AccessSet then version ← AccessSet[slot] else version ← Find-Node(slot, lvt) AccessSet[slot] ← version end if return vers[version].value; end procedure 21 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Write Operation 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: procedure Write(addr, lvt, val) slot ← hash table’s entry associated with addr if slot ∈ AccessSet then version ← AccessSet[slot] vers[version].value ← val else version ← Insert-Version(slot, lvt, val) AccessSet[slot] ← version end if end procedure 22 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Output Message Rollback • Upon a rollback, a ROLLBACK message is written to the logical device, piggybacking: ◦ a [f rom, to] interval ◦ the involved LP ◦ an era, a monotonic counter updated by the simulation kernel upon the execution of a rollback operation on a per-LP basis • Calendar Queue’s buckets are augmented with a Bloom filter, storing eras of messages contained • f rom and to are mapped to buckets • A linear search is performed in between the two buckets, checking only the ones which are expected to contain an element by the Bloom filter 23 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models Where do we go now? • Full support for statefull libraries • Support for any third-party library • What if the modeler knows something about parallelism? ◦ What does it mean to call, e.g., pthread create() in a handler? • Specific optimizations for other programming languages • Testing, testing, testing! 24 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models