An Accurate Prefetch Technique for Dynamic Paging

advertisement
An Accurate Prefetch Technique for
Dynamic Paging Behaviour for
Software Distributed Shared
Memory
Jie Cai and Peter Strazdins
Research School of Computer Science
The Australian National University
ICPP 2012
Pittsburgh, PA, USA
Outline
• Introduction
• Background
• Related Work on Existing Prefetch Techniques
• Stride-augmented Run-length Encoding Method
(sRLE)
• Dynamic Region-based Prefetch Technique
• Evaluation Results
• Conclusion
ICPP 2012 @ Pittsburgh, PA
Introduction
•
Software Distributed Shared Memory (sDSM) systems provide
programming environments that enable the use of shared
programming model such as OpenMP on clusters.
• sDSM systems inherit the good programmability of shared
memory programming models.
• Removing explicit control of data exchange from programmer
•
However, sDSM suffers from significant system overheads.
• Prefetch techniques, fitting well with lazy release consistency (LRC), can
be used to improve performance.
•
Prefetch techniques for sDSM face two major challenges:
• Applications’ dynamic memory access patterns
• Page misses caused by non-global synchronization operations
ICPP 2012 @ Pittsburgh, PA
Introduction (Cont.)
•
In this talk, we address the challenges of prefetch
techniques for sDSM systems
•
Reconstruct page miss record using strided-augmented
run-length encoding (sRLE) method
• Designed a dynamic region-based prefetch (DReP)
technique based on sRLE’d records to predict and issue
prefetches.
• Implemented into the only commercialized sDSM system,
Intel Cluster OpenMP (CLOMP)
• DReP and sRLE with CLOMP are evaluated using NPBOMP benchmark suite, LINPACK, and a memory
consistency cost micro-benchmark (MCBENCH)
ICPP 2012 @ Pittsburgh, PA
Background (1)
•
Fork-join type shared
memory programming
models:
•
Single Thread
Start parallel region, implicit barrier
Thread0
Thread1
Thread2
Regions are separated
Thread3 Parallel Region #1
Explicit barrier
using global
synchronizations, e.g.
implicit and explicit
barriers;
•
Sequential Region #1
Thread0
Thread1
Thread2
Thread3 Parallel Region #2
End parallel region, implicit barrier
Region-executions
Single Thread
are multiple executions
of the same region when
this region is enclosed in
a loop.
Sequential Region #2
Start parallel region, implicit barrier
Thread0
Thread1
Thread2
Thread3 Parallel Region #3
End parallel region, implicit barrier
Single Thread
ICPP 2012 @ Pittsburgh, PA
Sequential Region #3
Global memory is managed in
blocks, pages
Background (2)
Virtual Shared Memory
•
sDSM memory
consistency model
•
Shared Memory Programming Model (OpenMP)
Each process has a local
view of the shared pages
• The shared pages are
kept consistent via
mprotect (please refer to
the page state machine
for details).
MPI
Local
Memory
ICPP 2012 @ Pittsburgh, PA
…
…
…
Local
Memory
Background (3)
•
sDSM memory
consistency costs
•
The major sDSM system
overhead is the memory
consistency cost.
• MCBENCH is a in-house
developed microbenchmark that
measures this cost for
different OpenMP
implementations,
including cluster enabled
OpenMPs.
ICPP 2012 @ Pittsburgh, PA
Related Work
•
Dynamic Aggregation (C. Amza et al. 1997)
•
•
Simple assumption of temporal paging behavior before and after
a barrier.
B+ and Adaptive++ (R. Bianchini et al. 1996 & 1998)
•
B+: simple assumption of temporal paging behavior before and
after a barrier.
• Adaptive++: assuming page misses occurred before a barrier or
even before the previous barrier will occur again after the barrier.
•
Third order differential finite context method (TODFCM)
(E. Speight et al. 2002)
•
Generic technique prefetch a page when three previous
consecutive misses had happened before.
ICPP 2012 @ Pittsburgh, PA
Related Work (Cont.)
•
Temporal region-based pretech (TReP) technique (J. Cai
et al. 2010)
•
Deployed idea of region and region-executions
• Assume page misses in the previous region-execution will occur
in the current region-execution
• Considered temporal paging behaviour between consecutive
region-executions
•
Hybrid region-based prefetch (HReP) technique (J. Cai et
al. 2010)
•
Deployed idea of region and region-executions
• Combined TReP and Adaptive++
• Addressed temporal paging behaviour between consecutive
region-executions and spatial paging behaviour within a
region-execution. ICPP 2012 @ Pittsburgh, PA
sRLE Method -- Observation
•
LINPACK dynamic
page access
pattern with 4
processes
•
Corresponding
dynamic page miss
pattern
ICPP 2012 @ Pittsburgh, PA
sRLE Method
•
Step (a) group sub-list with
common stride;
• Step (b) encode the sublists into first level format:
•
•
Step (c) group consecutive
encoded sub-lists with
common stride into second
level encoding format:
•
•
(start page, stride, run
length)
(first level encoded record,
stride, run length)
Ordinary page fault list can
be converted to 2D fault
regions with sRLE.
ICPP 2012 @ Pittsburgh, PA
DReP Technique Designs
•
All page fault
records (per region)
has been encoded
twice with sRLE
method.
•
Each record
contains a list of
second level
encoded entries.
ICPP 2012 @ Pittsburgh, PA
DReP Technique Designs (cont.)
•
Beginning of a
region-execution
At the beginning of each
region-execution, DReP
predict and prefetch pages.
Yes
Previously
executed twice?
No
Compare every entries between two
records. Issue prefetches ONLY for
the following three cases.
Case 1:
Prefetch the entry if
it is common to both
lists
Case 2:
No Prefetch issued
Case 3:
When strides and run lengthes are
common to both lists, predict a start
page, and prefetch with the common
strides and run length
pred.l1_en_col.start_page = p_list.l1_en_col.start_page +
(p_list.l1_en_col.start_page − bp_list.l1_en_col.start_page)
When strides are common and run
lengthes are highly similar to both lists,
predict a start page and a run lengthes,
then prefetch with the common strides.
pred.l1_en_col.run len = p_list.l1_en_col.run_len +
(p_list.l1_en_col.run_len − bp_list.l1_en_col.run_len)
pred.run_len = p_list.run_len + (p_list.run_len
−bp_list.run_len)
ICPP 2012 @ Pittsburgh, PA
DReP Implementation
•
DReP has been implemented into Intel Cluster OpenMP
runtime.
•
New region notification user interactive interface:
•
KMP_USER_NOTIFY_NEW_REGION(1) : 1 indicates this is a parallel region
• KMP_USER_NOTIFY_NEW_REGION(0): 0 indicates this is a sequential region
•
Flush filtering solved the problem of single page can be missed
multiple times within one region-execution by removing duplicated
records.
• Enlarged message header of the communication layer which can
accommodate 128 page IDs to leverage network bandwidth.
• Each process first communicate to its right neighbor that avoid
network congestion.
ICPP 2012 @ Pittsburgh, PA
DReP Implementation (Cont.)
•
DReP has been implemented into Intel Cluster OpenMP
runtime.
•
Page state machine
has been updated
with two new
introduced page states
• Prefetched_diff
• Prefetched_page
ICPP 2012 @ Pittsburgh, PA
Evaluation
•
Experimental setup
•
Software and benchmarks
• NPB-OMP suite
• LINPAK OpenMP implementation (n=8196, nb=64)
• MCBENCH (a = 4MB, c = 4B and 4KB)
• Hardware platform
• 8-node Intel cluster
• Each node consists of 2 Intel E5472 3.0Ghz CPUs
• 16GB memory
• Gigabit Ethernet
• DDR Infiniband
ICPP 2012 @ Pittsburgh, PA
Efficiency and Coverage
•
•
•
•
•
•
Nf: total number of page faults
Np: number of prefetches
Nu: number of useful prefetches, Nu = Nf*C
C = Nu/Nf, coverage
E = Nu/Np, efficiency
Bold font represents best results
ICPP 2012 @ Pittsburgh, PA
Efficiency and Coverage (Cont.)
•
MCBENCH: DReP vs TReP and HReP
• c = 4B: extreme false sharing
• c= = 4KB: no false sharing
•
Bold font represents best results
ICPP 2012 @ Pittsburgh, PA
Memory Consistency Cost
•
Measured using MCBENCH, a = 4MB, c = 4B and 4KB
• c = 4B: extreme false sharing (reduced ~86% cost)
• c = 4KB: no false sharing
ICPP 2012 @ Pittsburgh, PA
Memory Consistency Cost (Cont.)
•
LINPACK OpenMP implementation with n=8196 and nb=64
• DReP is represented as a reduction rate to that of original
CLOMP implementation, e.g. (Orig-DReP)/Orig.
ICPP 2012 @ Pittsburgh, PA
Memory Consistency Cost (Cont.)
•
NPB-OMP
• Rates are represented as an average of each class from A to
C.
ICPP 2012 @ Pittsburgh, PA
Overhead Analysis of DReP
•
NPB-OMP IS.C
•
•
•
•
•
•
Tsegv: total memory consistency cost in seconds for original CLOMP and
DReP enabled CLOMP.
TMK Comm (% to Tsegv): communication time spent in the DSM layer of
CLOMP (TMK)
TMK local (% to Tsegv): the local software overhead of TMK layer
DReP Comm (% to Tsegv): communication cost of data prefetching
DReP local (% to Tsegv): the local software cost introduced by DReP
Communication costs are further broke down to cost for transferring diffs
and pages.
ICPP 2012 @ Pittsburgh, PA
Conclusions
•
With assistance of sRLE, DReP accurately analyses the paging
behaviour exhibiting both static and dynamic memory access
patterns, such as NPB-OMP and LINPACK.
• On average of NPB and LINPACK, DReP improves 34% efficiency
and 47% coverage based on existing prefetch techniques, in details:
•
55% and 5% better efficiency compared to Adaptive++ and TODFCM; 55%
and 44% better coverage compared to Adaptive++ and TODFCM
• 47% and 30% better efficiency compared to TReP and HReP; and 56% and
34% better coverage compared to TReP and HReP.
•
DReP dramatically reduces 86% memory consistency cost for the
false sharing scenario; and ~45% and ~38% for LINPACK and NPB
on GigE and IB respectively.
• A detailed breakdown analysis showed a ~2% introduced overhead
for DReP.
ICPP 2012 @ Pittsburgh, PA
Acknowledgement
•
•
•
•
•
Australian Research Council Grant LP0669726
ANU CECS Faculty Research Grant
Intel Corp.
Sun Microsystems
NCI National Facility / ANU Supercomputer Facility
ICPP 2012 @ Pittsburgh, PA
Download