Techniques for Transparent Parallelization of Discrete Event

advertisement
Techniques for Transparent Parallelization of
Discrete Event Simulation Models
Alessandro Pellegrini
PhD program in Computer Science
and Engineering
Cycle XXVI
September 26th , 2014
Technological Trend
2 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Technological Trend
• Implications of Moore’s Law
have changed since 2003
2 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Technological Trend
• Implications of Moore’s Law
have changed since 2003
• 130W is considered an upper
bound (the power wall)
P = ACV 2 f
2 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Technological Trend
e
h
T
r
u
t
Fu
e
l!
e
l
l
a of Moore’s Law
r
• Implications
a
is Phave changed since 2003
• 130W is considered an upper
bound (the power wall)
P = ACV 2 f
2 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Multicore Software Scaling
3 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Multicore Software Scaling
3 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Multicore Software Scaling
3 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Multicore Software Scaling
3 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Multicore Software Scaling
3 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
The Needs
•
•
•
•
Multicore / Clusters are a consolidated reality
Both Scientific & Commodity Users
Parallel Programming has been a privilege of few experts so far
Large amount of (sequential) legacy code around
4 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
The Needs
•
•
•
•
Multicore / Clusters are a consolidated reality
Both Scientific & Commodity Users
Parallel Programming has been a privilege of few experts so far
Large amount of (sequential) legacy code around
4 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
The Needs
As parallel processors become ubiquitous, the key question is:
How do we convert the applications we care about
into parallel programs?
4 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Thesis’ Goals: Automatic Code Parallelization
“Relieve programmers from the tedious and error-prone
manual parallelization process”
5 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Thesis’ Goals: Automatic Code Parallelization
“Relieve programmers from the tedious and error-prone
manual parallelization process”
• Targeting Event Driven Programming
• Transparency
◦ programmers fully exploit the semantic power of languages
◦ operations they should know nothing about are automatic
◦ Avoid set of new APIs, rely on classical/standard programming models
• Efficiency
◦ make the parallel programs run fast
◦ specifically rely on non-blocking algorithms and speculation
• Best trade-off between the two
5 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Research Context
• Addressing every aspect of parallel/concurrent programming is non-trivial
• I have concentrated on an instance of Event Driven Programming:
◦ No limitations to apply the results to other instances
◦ Discrete Event Simulation Environments
• an effective test-bed for general event-driven programming
• natural evolution of the work carried out by my research group
• widely applicable
6 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Research Context
• Addressing every aspect of parallel/concurrent programming is non-trivial
• I have concentrated on an instance of Event Driven Programming:
◦ No limitations to apply the results to other instances
◦ Discrete Event Simulation Environments
• an effective test-bed for general event-driven programming
• natural evolution of the work carried out by my research group
• widely applicable
http://www.dis.uniroma1.it/∼hpdcs/ROOT-Sim
6 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Research Context
• Addressing every aspect of parallel/concurrent programming is non-trivial
• I have concentrated on an instance of Event Driven Programming:
◦ No limitations to apply the results to other instances
◦ Discrete Event Simulation Environments
• an effective test-bed for general event-driven programming
• natural evolution of the work carried out by my research group
• widely applicable
http://www.dis.uniroma1.it/∼hpdcs/ROOT-Sim
6 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Event Driven Programming
• The flow of the program is determined by events
◦ sensors outputs
◦ user actions
◦ messages from other programs or threads
• It is based on a main loop divided into two phases:
◦ event selection/detection
◦ event handling
• Based on event handlers
◦ they are essentially asynchronous callbacks
◦ events can be queued if the involved handler is busy at the moment
◦ events’ execution is inherently parallel
7 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Discrete Event Simulation (DES)
• A discrete event occurs at an instant in time and marks a change
of state in the system
• DES represents the operation of a system as a chronological
sequence of events
• Program state is represented using Logical Processes (LPs)
◦ A collection of data structures scattered in memory
◦ Each simulation object mapped to one LP
◦ Events produce changes in the LPs
8 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Discrete Event Simulation (DES)
• Highly versatile technique
• Allows to analyze complex systems
◦ VHDL
◦ traffic simulation
◦ agent-based simulation
◦ ...
• Systems can be simulated before their construction or operativity
(what-if analysis)
◦ online reconfiguration of runtime parameters (e.g. on Cloud
platforms) [SIMUTOOLS13]
• Interaction of simulated worlds with physical worlds (symbiotic
simulation)
9 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
PDES Logical Architecture on Multicores
LP
LP
LP LP
LP
LP
Discrete Event
Runtime
Environment
Simulation Kernel
CPU
CPU
CPU
CPU
Machine
10 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
The Synchronization Problem
LPi
5
10
20
Execution Time
Message
LPj
10
15
12
20
Execution Time
Straggler Message
Events
Timestamps
17
Message
LPk
7
17
25
Execution Time
11 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Optimistic (Speculative) Synchronization: Rollback
LPi
5
10
message
LPj
10
15
WCT
12
straggler message
17
message
7
Rollback Execution:
recovering state at
LVT 10
12
20
events'
timestamp
LPk
WCT
20
17
antimessage
17
25
antimessage
reception
25
Rollback Execution:
recovering state at
LVT 7
12 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
WCT
Rollback Supports: State Saving
Snapshot taken
before the execution
of every simulation event
LPi
5
10
x
18
5
WCT
10
12
LPj
y
7
snapshots'
timestamp
15
7
events'
straggler message
timestamp
12
15
12
Rollback Execution:
restoring simulation
state at T = 7
13 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
WCT
What if the model...
void *simulation_state[lps];
ev_handl(int id, double time, int ev_type, void *ev_content) {
switch(event_type) {
case INIT:
simulation_states[id] = malloc(sizeof(some_structure));
break;
case EVENT_x:
simulation_states[id]->counter++;
break;
<...>
}
}
14 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
What if the model...
void *simulation_state[lps];
ev_handl(int id, double time, int ev_type, void *ev_content) {
switch(event_type) {
case INIT:
simulation_states[id] = malloc(sizeof(some_structure));
break;
case EVENT_x:
simulation_states[id]->counter++;
break;
<...>
}
}
14 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
ACT I
Dynamic Memory and
Incremental State Saving
15 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
What we want...
• Transparency
◦ Interception of memory-related operations (no platform APIs)
◦ No application-level procedure for (incremental) log/restore tasks
• Optimism-Aware Runtime Supports
◦ Recoverability of generic memory operations: allocation, deallocation,
and updating
• Incrementality
◦ Cope with memory “abuse” of speculative rollback-based
synchronization schemes
◦ Enhance memory locality
16 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Di-DyMeLoR [PADS09], candidate for Best Paper
Di-DyMeLoR: Dirty Dynamic Memory Logger and Restorer
• Lightweight software instrumentation
◦ Optimized memory-write access tracing and logging
◦ Arbitrary-granularity memory-write tracing
◦ Concentration of most of the instrumentation tasks at a pre-running
stage:
• No costly runtime dynamic disassembling
• Standard API wrappers
◦ Code can call standard malloc services
◦ Memory map transparently managed by the simulation platform
17 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Memory Map
Block status
bitmap
Dirty
bitmap
...
malloc_area
• Memory (for each LP) is pre-allocated
• Requests are served on a chunk basis
• Explicit avoidance of per-chunk metadata
◦ Block status bitmap: tracks used chunks
◦ Dirty bitmap: tracks updated chunks since last log
18 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
chunk
...
chunk
malloc_area
Hijacker [HPCS13], candidate for the Best Paper
• Hijacker is a rule-based static binary instrumentation tool
• It has been specifically tailored to HPC applications
• It tries to hide away all the complexities related to binary
manipulation
• It is highly modular: can be extended to different instructions sets
and binary formats
• Instrumentation rules are specified using xml
19 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Internal binary representation
Instructions
Executable
jmp
Functions
mov
call
20 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Data
Static Binary Instrumentation in Di-DyMeLoR
new writeable section
original memory
update
mov $3, x
indirect branch
jmp *%eax
Original Executable
.Jump:
jmp 0xXXXX
regular jump modi ed
by branch_corrector
push struct
call track
mov $3, x
push struct
call corrector
jmp .Jump
regular jump
Final executable
Instrumentation Process
21 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Discussion of the Approach
• Fast O(1) management of memory access tracking on a CISC ISA
(x86)
◦ thanks to cached disassembly information
• Enhanced locality
◦ chunks to be logged are contiguous
• Fast log/restore operations
◦ tiny and compact data structures to analyze
• Supporting a free is non trivial due to the lack of headers
◦ fast software cache for mapping addresses to malloc areas
22 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Log/Restore Operations
• Log Operation
◦ Only used malloc areas are logged
◦ Status and dirty bitmap are copied to the log buffer
◦ Only chunks dirtied since the last log are copied
• Restore Operation
◦ The chain of logs is traversed
◦ Status/Dirty bitmaps are used to know which chunks are present
◦ The operation stops when all the (allocated) chunks are restored
23 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Isn’t the cost high?
• Do we have to really pay the memory-tracking supports cost?
24 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Isn’t the cost high?
• Do we have to really pay the memory-tracking supports cost?
• No, if we don’t have any benefit from it!
• Hijacker offers code multiversion facilities
• We have developed an analytic model to determine if we have
benefits from incrementality [TPDS2014]
◦ switching among pure full and incremental modes involves only
changing a function pointer
◦ decision is stable to fluctuations thanks to an integral model
24 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
What does the literature propose?
• Large-granularity approaches
◦ Memory (page-based) protection solutions [SQ06]
◦ The cost is higher
◦ Logs easily become huge
• Small-granularity approaches
◦ Instrumentation-based approach [WP96]
◦ Each memory operation is stored
◦ Complex reconstruction of a previous state
◦ Larger memory footprint
• API-based approaches
◦ Specific APIs are used to indicate where the state is located
◦ Specific APIs are used to mark modified areas
◦ The modeler has to implement callbacks to generate/restore logs
25 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Test-Bed Scenario and Settings
• NoSQL benchmark [SIMUTOOLS13]
◦ 64 cacheservers
◦ 50000 data objects per cache server
◦ read intensive vs write intensive phases (5% to 95% accesses in write
mode)
• Run on an HP Proliant server:
◦ 64-bits NUMA machines
◦ four 2GHz AMD Opteron 6128 processors and 32GB of RAM
◦ each processor has 8 CPU-cores (for a total of 32 CPU-cores)
26 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
NoSQL: Execution Time
1600
Overall Execution Time (seconds)
1400
1200
1000
800
600
400
200
0
r
l
27 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
l
ria
Se
ito
on
ta
en
em
M
ll
o
N
Fu
cr
In
Read Intensive Phase
Write Intensive Phase
What if the model...
ev_handl(int id, double time, int ev_type, void *ev_content) {
switch(event_type) {
case EVENT:
printf("I’m executing event %d\n", ev_type);
break;
<...>
}
}
28 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
What if the model...
ev_handl(int id, double time, int ev_type, void *ev_content) {
switch(event_type) {
case EVENT:
printf("I’m executing event %d\n", ev_type);
break;
<...>
}
}
28 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Non-Rollbackable Operations
output via printf()
Outside World
rollback operation
not possible
Simulation Framework
LP1
e1
T1
e2: T2 < T1
m2
T2
LP2
• It’s not a reproducibility problem (as in fault tolerance)
• It’s a correctness one (rollbacks, silent execution)
29 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
ACT II
Interacting with the Outside World
30 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
What do we want...
• What does the literature propose?
1. Ad-hoc output-generation APIs provided by simulation frameworks
2. Temporary suspension of processing activities until output generation
is safe (delay until commit)
3. Storing output messages in events, and materializing during fossil
collection (this affects the critical path)
31 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
What do we want...
• What does the literature propose?
1. Ad-hoc output-generation APIs provided by simulation frameworks
2. Temporary suspension of processing activities until output generation
is safe (delay until commit)
3. Storing output messages in events, and materializing during fossil
collection (this affects the critical path)
• What we explicitly want:
1. Transparency
2. Move most operations away from the critical path
Executing
event at T1
Processing I/O
LPx
WCT
T2
Wasted Work
LPy
WCT
T3
T4
T5
T6 Rollback
3. Order output messages on a system-wide basis
31 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Output Messages Management [PADS13]
• We rely on link-time wrappers
◦ printf family functions are redirected to an internal subsystem
• We base our solution on an output daemon
◦ a user-space process separated from the actual simulation framework
◦ It is not given a dedicated processing unit
• Communication with the kernel is achieved via a logical device
◦ A non-blocking shared memory buffer, accessed circularly
◦ If it gets filled, a new (double-sized) buffer gets chained
◦ Once empty, the older buffer gets destroyed
32 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Non-blocking logical device
• A logical-device is a per-thread channel
• The device is written by a thread, and is read by the output
daemon: we can implement a non-blocking access algorithm
updated by
daemon
r
w
updated by
thread
33 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Non-blocking logical device
• A logical-device is a per-thread channel
• The device is written by a thread, and is read by the output
daemon: we can implement a non-blocking access algorithm
updated by
daemon
r
w
updated by
thread
33 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Output Message Ordering, Commit, and Rollback
• A timestamped output message read from a device is stored into a
Calendar Queue
◦ fast O(1) access for output message insertion and materialization
◦ provides system-wide ordering
• Upon computation of the GVT, the output subsystem is notified
about the newly computed value (via a special COMMIT message)
• The Calendar Queue is then queried to retrieve messages falling
before that value
• Upon rollback, messages can be extracted from any position of the
queue
34 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Output Daemon Wakeup
• All this might require much computing time
◦ We want a timely materialization!
◦ Yet we want an efficient simulation!
35 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Output Daemon Wakeup
• All this might require much computing time
◦ We want a timely materialization!
◦ Yet we want an efficient simulation!
• Processing time of each type of message: to , tc , tr
• Mean events’ number written to device in a GVT phase: c̄o , c̄c , c̄r
• Expected execution time to empty the logical device:
E(T ) =
X
t̄x · c̄x
x∈(o,c,r)
• If larger than a compile-time threshold (smaller than GVT
interval), it is forced to that value
• Final value is divided into several time slices and a sleep time is
computed, so as to create an activation/deactivation pattern
35 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Test-Bed Scenario and Settings
• Personal Communication System (PCS) benchmark
◦ 1024 wireless cells, each one having 1000 channels
◦ random-walk mobility model
◦ high-fidelity simulation
• fading phenomena explicitly modeled
• explicitly accounts for climatic conditions
◦ simulation statistics printed periodically, with frequency f ∈ [1%, 35%]
of total events
◦ that’s one output message produced [200, 7000] times per second
36 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Experimental Results
Output delay - 1%
120
instantaneous print (theoretical)
1 print every 70 handoff events
Wall-clock-time (seconds) - output
100
Throughput 35%
1.6e+06
With Daemon
Without Running Daemon
Without Subsystem
No printf
Output in Event Queue
1.2e+06
60
40
20
1e+06
0
0
10
20
800000
30
40
50
60
Wall-clock-time (seconds) - generation
70
80
Output delay - 35%
1200
instantaneous print (theoretical)
1 print every 2 handoff events
600000
1000
400000
200000
0
0
20
40
60
80
Wall-clock-time (seconds)
100
120
140
Wall-clock-time (seconds) - output
Cumulated Committed Events
1.4e+06
80
800
600
400
200
0
0
10
20
37 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
30
40
50
60
Wall-clock-time (seconds) - generation
70
80
90
What if the model...
int processed_events = 0;
ev_handl(int id, double time, int ev_type, void *ev_content) {
switch(event_type) {
case EVENT_x:
processed_events++;
break;
<...>
}
}
38 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
What if the model...
int processed_events = 0;
ev_handl(int id, double time, int ev_type, void *ev_content) {
switch(event_type) {
case EVENT_x:
processed_events++;
break;
<...>
}
}
38 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
ACT III
Managing Global Variables
39 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Global Variables Management Subsystem (GVMS)
[MASCOTS12]
• The programmer can access both the LP’s private state and the
global portion
• Rely on instrumentation via Hijacker
• Implement shared state as multi-versioned variables
• Propose an extended rollback scheme
• Rely on non-blocking algorithms for data synchronization
• GVMS exposes only two (internal) APIs:
◦ write glob var(void *orig addr, time type lvt, ...)
◦ void *read glob var(void *orig addr, time type my lvt)
40 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Read/Write Detection
• To efficiently support runtime execution, an exact number of
multi-versioned global variables must be installed
• At linking time the .symtab section is explored, to find global
variables in the executable
• A table of hname, address, sizei tuples is built
• At simulation startup, the correct number of multi-versioned
variables is installed
41 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Shared Memory-Map Organization
...
...
...
{
{
{
metadata
variables
nodes
read list
• Preallocated: no malloc invocations during execution
• Contiguous: enhances locality
• Accessed concurrently: CAS-based allocation of nodes
42 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Version Lists
• Multi-versioned variables are implemented as version lists
• Each node represents one variable’s value at a certain LVT
• Insert/Delete operations are implemented as non-blocking
operations by relying on the CAS primitive
H
Tx
Tz
T
CAS
Ty
H
Tx
Ty
43 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
T
Synchronization and Rollback
• To strengthen the optimism, we allow interleaved reads and writes
on a version list
• We explicitly avoid a freshly installed version to invalidate any
version related to a greater LVT
LVT = 8
Write
v
LVT = 10
LVT = 6
Read: LVT = 7
Read: LVT = 9
44 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
v
i
o
l
a
t
i
o
n
Synchronization and Rollback
• Processes which read a version node must leave a mark, i.e., visible
reads are enforced.
• Classical rollback’s notion is augmented:
◦ In case of inconsistent read, a special anti-message is sent to the
related LP
45 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Synchronization and Rollback
• Processes which read a version node must leave a mark, i.e., visible
reads are enforced.
• Classical rollback’s notion is augmented:
◦ In case of inconsistent read, a special anti-message is sent to the
related LP
• A ReadList is maintained, to keep track of versions reads
• After each Write operation, the ReadList of the previous node is
checked to see if an anti-message must be scheduled to some LPs
45 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Synchronization and Rollback
• Processes which read a version node must leave a mark, i.e., visible
reads are enforced.
• Classical rollback’s notion is augmented:
◦ In case of inconsistent read, a special anti-message is sent to the
related LP
• A ReadList is maintained, to keep track of versions reads
• After each Write operation, the ReadList of the previous node is
checked to see if an anti-message must be scheduled to some LPs
• When an antimessage is received because of an inconsistent read,
version nodes related to that particular event must be removed
◦ This is done by connecting every node in the message queue with
version nodes installed during an event execution
45 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Experimental Results
PCS with global variables handling global statistics
Total Time Execution
600
Throughput
7000000
Message Passing
Shared Memory
No Shared State
Cumulated Committed Events
Execution Time (seconds)
Message Passing
Shared Memory
6000000
500
400
300
200
100
5000000
4000000
3000000
2000000
1000000
0
0
0
5
10
15
20
Number of Worker Threads
25
30
35
0
10
20
30
40
50
60
Wall-Clock Time (seconds)
46 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
70
80
90
What if the model...
ev_handl(int id, double time, int ev_type, void *ev_content) {
switch(event_type) {
case INIT:
my_state = malloc(sizeof(state_t));
my_state->buffer = malloc(1024);
break;
case EVENT_x:
char *cont = my_state->buffer;
ScheduleEvent(to, time, EVENT_y, cont, sizeof(char *));
break;
case EVENT_y:
strcpy(ev_content, "Some String!");
break;
<...>
}
}
47 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
What if the model...
ev_handl(int id, double time, int ev_type, void *ev_content) {
switch(event_type) {
case INIT:
my_state = malloc(sizeof(state_t));
my_state->buffer = malloc(1024);
break;
case EVENT_x:
char *cont = my_state->buffer;
ScheduleEvent(to, time, EVENT_y, cont, sizeof(char *));
break;
case EVENT_y:
strcpy(ev_content, "Some String!");
break;
<...>
}
}
47 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
ACT IV
Cross-Accessing
Logical Processes’ States
48 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Step 1: Materializing Cross-State Dependencies [PADS14]
• To transparently detect accesses to other LPs’ states we rely on an
x86 64 kernel-level memory management architecture
49 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Step 1: Materializing Cross-State Dependencies [PADS14]
• To transparently detect accesses to other LPs’ states we rely on an
x86 64 kernel-level memory management architecture
Linear Address
47
39 38
PML4
30 29
21 20
Directory Ptr Directory
9
9
12 11
Table
9
0
O set
9
12
Page-DirectoryPointer Table
PDPTE
40
PDE with PS=0
PTE
40
Page Directory
Page Table
40
PML4E
40
Physical Addr.
40
CR3
49 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
4-KB Page
Memory Allocation Policy
• LPs use virtual memory according to stocks
• Memory requests are intercepted via malloc wrappers (DyMeLoR)
• Upon the first request, an interval of page-aligned virtual memory
addresses is reserved via mmap POSIX API (a stock)
• This is a set of empty-zero pages: a null byte is written to make
the kernel actually allocate the chain of page tables
• One stock gives 1GB of available memory to each LP
50 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Memory Access Management
• A LKM creates a device file accessible via ioctl
• SET VM RANGE command associates stocks with LPs
• A kernel-level map (accessible in constant time) is created:
◦ Each stock is logically related to one entry of a PDP page-table
◦ The id of the LP which the stock belongs to is registered
• When LP j accesses LP i’s state, we could know that by the
memory address
constant time access map
updated via the SET_VM_RANGE
ioctl command
PML4
PDP
simulation object x
O-th PDPTE
1-st PDPTE
simulation object y
51 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Memory Access Management
CR3 Register
PML4
PDP
0-th PDPTE
LPx
52 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Memory Access Management
Sibling PML4
Sibling PDP
NULL
CR3 Register
WTi
PML4
PDP
0-th PDPTE
LPx
52 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Memory Access Management
Sibling PML4
Sibling PDP
NULL
CR3 Register
WTi
PML4
ioctl(fd, SCHEDULE_ON_PGD, x)
PDP
0-th PDPTE
LPx
52 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Memory Access Management
Sibling PML4
Sibling PDP
NULL
CR3 Register
WTi
PML4
PDP
0-th PDPTE
LPx
52 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Memory Access Management
Sibling PML4
Sibling PDP
0-th PDPTE
CR3 Register
WTi
PML4
PDP
0-th PDPTE
LPx
52 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Memory Access Management
Sibling PML4
Sibling PDP
0-th PDPTE
CR3 Register
WTi
PML4
PDP
0-th PDPTE
LPx
52 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Memory Access Management
Sibling PML4
Sibling PDP
0-th PDPTE
CR3 Register
WTi
PML4
PDP
0-th PDPTE
LPx
52 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Cross-State Dependency Materialization
• If other LPs’ stocks are accessed, we have a memory fault
• This is the materialization of a Cross-State Dependency
• Yet, this page fault cannot be traditionally handled:
◦ Memory has already be validated via mmap at simulation startup
◦ The Linux kernel would simply reallocate new pages
◦ For the same virtual page we would have multiple page table entries!
53 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Step 2: Event and Cross-State Synchronization (ECS)
• At startup we change the IDT table to redirect the page-fault
handler pointer to a specific ECS handler
• Upon a real segfault, the original handler is called
• Otherwise, the ECS handler pushes control back to user mode to
let the PDES platform handle synchronization:
◦ Execution goes back into platform mode
◦ CR3 is switched back to the original PML4 table
◦ The simulation kernel can access any memory buffer required for
supporting synchronization
54 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Step 2: Event and Cross-State Synchronization (ECS)
• At the end of the event the simulation platform invokes the
UNSCHEDULE ON PGD command
• This explicitly brings back the execution to platform mode
SCHEDULE_ON_PGD
platform mode
p to the
(CR3 points
original PML4)
UNSCHEDULE_ON_PGD
simulation-object
mode (CR3 points to
the sibling PML4)
faulting access to
a remote stock
• Upon a CR3 switch, the penalty incurred is a flush of the TLB
55 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
ECS System
Property
When a Cross-State Dependency is materialized at simulation time T ,
the involved LP observes the state snapshot that would have been
observed in a sequential-run.
• To support this we introduce:
◦ temporary LP blocking: the execution of an event can be suspended
◦ rendez-vous events: system-level simulation events not causing state
updates
• Events are “transactified”: read/write operations are serialized
without pre-declarations
• Bypass of the RPC (copy-restore) model
56 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
ECS System
LPx
WCT
CSDx = {}
Cross-State
Dependency Set
LPy
57 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
WCT
ECS System
LPx
CSDx = {}
ex
LPy
57 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
WCT
WCT
ECS System
memory
fault
LPx
CSDx = {}
ex
LPy
57 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
WCT
WCT
ECS System
Generation of
a unique ID
LPx
CSDx = {}
ex
LPy
57 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
WCT
WCT
ECS System
block
state
LPx
CSDx = {}
ex
LPy
57 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
WCT
WCT
ECS System
LPx
CSDx = {}
LPy
WCT
ex
ervx
Incorporated in the
event queue
57 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
WCT
ECS System
LPx
CSDx = {}
LPy
WCT
ex
ervx
57 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
WCT
ECS System
LPx
CSDx = {}
LPy
ervax
ex
ervx
57 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
WCT
WCT
ECS System
LPx
CSDx = {y}
LPy
ervax
ex
ervx
57 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
WCT
WCT
ECS System
ready
state
LPx
CSDx = {y}
LPy
ervax
ex
ervx
57 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
WCT
WCT
ECS System
event
execution
LPx
CSDx = {y}
LPy
ervax
ex
ervx
57 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
WCT
WCT
ECS System
LPx
CSDx = {y}
LPy
WCT
ervax
ex
ervx
eubx
57 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
WCT
ECS System
LPx
CSDx = {y}
LPy
WCT
ervax
ex
ervx
eubx
57 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
WCT
Rollback
• Rollback of LPx is managed via traditional annihilation scheme
• Rollback of LPy must be explicitly notified
◦ A restart event ervr
is sent to LPx
x
• All other events are not incorporated in the queue
◦ They do not require special care for rollback operations
◦ They are simply discarded if no rendez-vous ID is found
58 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Progress: Deadlock
issued rendez-vous with
source objects blocked
waiting for acks
t1
object x
object y
t2
object z
deadlock generator
rendez-vous
t3
59 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Progress: Domino Effect
3) Snapshot reconstruction
for rendez-vous requires
coasting-forward from an
older log
1) rollback (requires coasting-forward up to ts(ex)
log
ex
object x
straggler
object y
ey
log
2) Snapshot reconstruction for rendez-vous
requires coasting-forward up to ts(ex)
60 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Experimental Evaluation: Overhead Assessment
• Personal Communication System Benchmark
• 1024 wireless cells, 1000 wireless channels each
• 25%, 50%, and 75% channel utilization factor
50
Speedup
40
30
20
10
0
0.25
0.5
Channel Utilization Factor
No Ad-Hoc Memory Management
0.75
Ad-Hoc Memory Management
61 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Experimental Evaluation: Effectiveness Assessment
• NoSQL data-grid simulation
• 2-Phase-Commit (2PC) protocol to ensure transactions atomicity
• Two different implementations:
◦ Not using ECS: the write set is sent via an event
◦ ECS-based: a pointer to the write set is sent
• 64 nodes (degree of replication 2 of each hkey, valuei pair)
• Closed-system configuration: 64 active concurrent clients
continuously issuing transactions
• Amount of keys touched in write mode by transactions varied
between 10 and 100
62 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Experimental Evaluation: Effectiveness Assessment
160
Overall Execution Time (seconds)
140
120
100
80
60
40
20
0
10
100
Average Transaction Write Set Size
ECS
Traditional Parallel
Serial
63 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Overview
1. Dynamic Memory and Incremental State Saving
◦ Instrumentation-based interception of memory updates
◦ Efficient log/restore operations relying on compact metadata
◦ Possibility to switch to full mode depending on the operation cost
(based on an innovative performance optimization/stability
methodology)
2. Interacting with the Outside World
◦ Materialization moved off from the critical path
◦ System-wide ordering of output
◦ Activation/Deactivation scheme to enhance simulation efficiency
64 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Overview (2)
3. Managing Global Variables
◦ Based on multiversion lists
◦ Non-blocking algorithm to access the list
◦ Reduced rollback impact in case of read operations
4. Cross-Accessing Logical Processes’ States
◦ Kernel-level low-cost detection of cross-state accesses
◦ Introduction of execution suspension of LPs
◦ Introduction of a new synchronization protocol (ECS)
• All approaches are fully transparent vs the common event handler
programming paradigm
65 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Thanks for your attention
s?
n
o
i
t
es
u
Q
pellegrini@dis.uniroma1.it
http://www.dis.uniroma1.it/∼pellegrini
http://www.dis.uniroma1.it/∼ROOT-Sim
66 of 66 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
DES Skeleton
1:
2:
3:
4:
5:
procedure Init
End ← false
initialize State, Clock
schedule IN IT
end procedure
6:
7:
8:
9:
10:
11:
12:
13:
procedure Simulation-Loop
while End == false do
Clock ← next event’s time
process next event
Update Statistics
end while
end procedure
1 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Di-DyMeLoR Architecture
SetState()
Application Level Software
malloc()
free()
realloc()
calloc()
Calls to 3rd
party functions
memory
accesses
Simulation Platform
set_current_lp()
Memory Map Manager
and Allocator
Third Party
Library Wrapper
Update
Tracker
dymelor_init()
take_full_log()
take_incremental_log()
state_restore()
Log/Restore Subsystem
Memory Recovery
Subsystem
2 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
prune_log()
Restore Operations
3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Restore Operations
3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Restore Operations
3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Restore Operations
3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Restore Operations
3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Restore Operations
3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Restore Operations
3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Restore Operations
3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Restore Operations
3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Restore Operations
3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Restore Operations
3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Restore Operations
3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Restore Operations
3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Restore Operations
3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Restore Operations
3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Restore Operations
3 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Comparison of Instrumenting Tools
ATOM
EEL
BIRD
Stat
X
X
X
PEBIL
Dyninst
X
X
Dyn
X
Pin
DynamoRIO
X
X
Valgrind
X
Hijacker
X
Features
Alpha machines only
C++ interfaces to alter code
Windows Only, Rewrites first 5 bytes of functions
Linux Only, Relocates the code
Mostly used for performance profiling and debugging
Just-in-time instrumentation
Powerful, multiplatform, must implement a
runtime client
Powerful, but invasive. Emulates the execution of programs
xml-based instrumentation, offers built-in
features
4 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Rule-based instrumentation
• Code is described in terms of functions, specific
(machine-dependent) instructions or instruction families
• Instrumentation entails adding code snippets in specific points, and
injecting completely new functionalities
• Replacing particular instructions with a function call is as simple as:
<Inject file="mcopy.c"/>
<Instruction instruction="movs" replace="nop">
<AddCall where="after" function="mcopy"
arguments="target" convention="registers"/>
</Instruction>
5 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Instruction Families
I MEMRD: The instruction reads from memory
I MEMWR: The instruction writes to memory
I CTRL: The instruction performs checks on data
I JUMP: The instruction alters the execution flow
I CALL: The instruction calls a different function
I RET: The instruction returns from a callee
I CONDITIONAL: The instruction is executed only if a condition is met
I STRING: The instruction operates on large amount of data
I ALU: The instruction does some logical/arithmetic operation
I FPU: The instruction does some floating-point operation
I STACK: The instruction works on stack
I INDIRECT: The instruction behavior might depend on some
runtime value
6 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Hijacker’s Architecture
Hijacker
Front-End
XML Parser
XML
Con g
File
Input
Relocatable
Executable
Output
Relocatable
Executable
F
i
l
e
L
o
a
d
e
r
F
i
l
e
W
r
i
t
e
r
Back-End
Executable Formats
Interpreters
Instruction Sets
Disassemblers
Executable Formats
Generators
Instrumentation Rule Manager
Internal Executable
Representation
Instrumentation Engine
Instruction Sets
Assemblers
7 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Full Log Model
OHF =
SF
χF − 1
δLB + Pr (SF δRB +
δe )
χF
2
δe average event execution cost.
SF average size of a full log.
δLB average cost for logging one byte of the state image
(including metadata)
δRB average cost for restoring one byte from the log
Pr rollback probability (frequency of rollback occurrences
over event executions)
χF selected log interval when operating in non-incremental
mode.
8 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Full Log Model
• The non-incremental overhead is minimized for :
&r
χF =
2 δLB SF
Pr δ e
'
• The optimal checkpointing interval is denoted as χopt
F
9 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Incremental Log Model
OHI =
SP
(SF −SP )
χI−1
δLB +
δLB +Pr SF δRB+
(δe +δm ) +δm
χI
χI χI,F
2
SP average size of an incremental log
XI selected log-interval when operating in incremental
mode.
XI,F interleave step between full and incremental logs
δm is the per-event cost for running the memory-update
tracking module
10 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Incremental Log Model
• The non-incremental overhead is minimized for :
&r
χI =
2 δLB SP
Pr δe + δm
'
• The optimal checkpointing interval is denoted as χopt
I
11 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Integral Model
• The the best suited operating mode is selected usign a cost
opt
function CF (χopt
F , χI ) defined as:
opt
opt
opt
CF (χopt
F , χI ) = OHF (χF ) − OH(χI )
• The final choice is taken on the result of the integration of this
cost function over a multi-dimensional domain defined by the
values of the parameters {δe , δm , δLB , δRB , Pr , SF , SP }.
12 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
PCS: Throughput (variable τA )
1.0⋅108
Cumulated Committed Events
9.0⋅107
8.0⋅107
7.0⋅107
6.0⋅107
5.0⋅107
4.0⋅107
3.0⋅107
2.0⋅107
1.0⋅107
0
0
50
100
150
200
250
300
Wall-clock-time (seconds)
Incremental (Scattered state)
Full (Scattered state)
350
400
No Monitor (Scattered state)
Full (Contiguous state)
13 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
450
PCS: Throughput (fading recalculation)
4.5⋅107
Cumulated Committed Events
4.0⋅107
3.5⋅107
3.0⋅107
2.5⋅107
2.0⋅107
1.5⋅10
7
1.0⋅107
5.0⋅106
0
0
20
40
60
80
100
120
Wall-clock-time (seconds)
Incremental (Scattered state)
Full (Scattered state)
140
160
No Monitor (Scattered state)
Full (Contiguous state)
14 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
180
PCS: Memory Consumption
1000
Log Metadata
Actual State Log
Total
Memory Occupancy (MB)
800
V: Variable τA
F: Fading Recalc.
600
C: Contiguous State
S: Scattered State
400
200
0
,F
)
S,
F)
F)
V)
V)
S,
l(
r(
ta
ito
en
on
(C
M
ll
Fu
o
N
F)
)
S,
S,
l(
r(
ta
ito
,V
,
(S
(C
on
V)
en
em
cr
In
ll
Fu
M
ll
Fu
o
N
,
(S
em
cr
In
ll
Fu
15 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Event Latency (microsec.)
PCS: Event Latency
Instrumented
Non-Instrumented
120
100
80
60
40
20
1
2
3
4
5
Call Inter-Arrival Frequency
16 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
6
Simulation Throughput
Throughput 1%
1.6e+06
With Daemon
Without Running Daemon
Without Subsystem
No printf
Output in Event Queue
Cumulated Committed Events
1.4e+06
1.2e+06
1e+06
800000
600000
400000
200000
0
0
10
20
30
40
50
60
Wall-clock-time (seconds)
70
17 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
80
90
Non-Blocking Stack Throughput
18 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Non-Blocking Linked List Throughput
19 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Concurrent Allocator
1: procedure Allocate
2:
m ← generate mark()
3:
slot ← first node free
4:
while true do
5:
alloc ← vers[slot].alloc;
6:
if alloc ∨ ¬ CAS(vers[slot].alloc, alloc, m) then
7:
slot ← next slot in circular policy
8:
else
9:
break
10:
end if
11:
end while
12:
atomically update first node free
13:
return slot
14: end procedure
20 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Read Operation
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
procedure Read(addr, lvt)
slot ← hash table’s entry associated with addr
if slot ∈ AccessSet then
version ← AccessSet[slot]
else
version ← Find-Node(slot, lvt)
AccessSet[slot] ← version
end if
return vers[version].value;
end procedure
21 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Write Operation
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
procedure Write(addr, lvt, val)
slot ← hash table’s entry associated with addr
if slot ∈ AccessSet then
version ← AccessSet[slot]
vers[version].value ← val
else
version ← Insert-Version(slot, lvt, val)
AccessSet[slot] ← version
end if
end procedure
22 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Output Message Rollback
• Upon a rollback, a ROLLBACK message is written to the logical
device, piggybacking:
◦ a [f rom, to] interval
◦ the involved LP
◦ an era, a monotonic counter updated by the simulation kernel upon
the execution of a rollback operation on a per-LP basis
• Calendar Queue’s buckets are augmented with a Bloom filter,
storing eras of messages contained
• f rom and to are mapped to buckets
• A linear search is performed in between the two buckets, checking
only the ones which are expected to contain an element by the
Bloom filter
23 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Where do we go now?
• Full support for statefull libraries
• Support for any third-party library
• What if the modeler knows something about parallelism?
◦ What does it mean to call, e.g., pthread create() in a handler?
• Specific optimizations for other programming languages
• Testing, testing, testing!
24 of 24 - Techniques for Transparent Parallelization of Discrete Event Simulation Models
Download