NUMA(YEY)
BY JACOB KUGLER
© Copyright 2015 EMC Corporation. All rights reserved.
1
MOTIVATION
• Next generation of EMC VPLEX hardware is NUMA
based
– What is the expected performance benefit?
– How to best adjust the code to NUMA?
• Gain experience with NUMA tools
© Copyright 2015 EMC Corporation. All rights reserved.
2
VPLEX OVERVIEW
• A unique virtual storage technology that enables:
– Data mobility and high availability within and between data
centers.
– Mission critical continuous availability between two
synchronous sites.
– Distributed RAID1 between 2 sites.
© Copyright 2015 EMC Corporation. All rights reserved.
3
UMA OVERVIEW – CURRENT STATE
Uniform Memory Access
CPU0
CPU1
CPU2
RAM
CPU3
CPU4
CPU5
© Copyright 2015 EMC Corporation. All rights reserved.
4
NUMA OVERVIEW – NEXT GENERATION
Non Uniform Memory Access
NODE1
RAM
© Copyright 2015 EMC Corporation. All rights reserved.
NODE0
CPU3
CPU0
CPU4
CPU1
CPU5
CPU2
RAM
5
POLICIES
• Allocation of memory on specific nodes
• Binding threads to specific nodes/CPUs
• Can be applied to:
– Process
– Memory area
© Copyright 2015 EMC Corporation. All rights reserved.
6
DEFAULT POLICY
Node 0
Running
thread
Local memory
access
© Copyright 2015 EMC Corporation. All rights reserved.
Node 1
Running
thread
Local memory
access
8
BIND/PREFFERED POLICY
Node 0
Node 1
Running
thread
local
© Copyright 2015 EMC Corporation. All rights reserved.
Running
thread
remote
9
INTERLEAVE POLICY
Node 0
Node 1
Running
thread
Local
memory
access
© Copyright 2015 EMC Corporation. All rights reserved.
Remote
memory
access
10
NUMACTL
• Command line tool for running a specific NUMA
Policy.
• Useful for programs that cannot be modified or
recompiled.
© Copyright 2015 EMC Corporation. All rights reserved.
11
NUMACTL EXAMPLES
• numactl –cpubind=0 –membind=0,1
<program>
run the program on node 0 and allocate memory
from nodes 0,1
• numactl –interleave=all <program>
run the program with memory interleave on all
available nodes.
© Copyright 2015 EMC Corporation. All rights reserved.
12
LIBNUMA
• A library that offers an API for NUMA policy.
• Fine grained tuning of NUMA policies.
– Changing policy in one thread does not affect other threads.
© Copyright 2015 EMC Corporation. All rights reserved.
13
LIBNUMA EXAMPLES
• numa_available() – checks if NUMA is supported on the
system.
• numa_run_on_node(int node) – binds the current thread on
a specific node.
• numa_max_node() – the number of the highest node in the
system.
• numa_alloc_interleave(size_t size) – allocates size bytes of
memory page interleaved on all available nodes.
• numa_alloc_onnode(size_t size, int node) – allocate
memory on a specific node.
© Copyright 2015 EMC Corporation. All rights reserved.
14
HARDWARE OVERVIEW
2 hyper
threads
Node 0
Quick Path
Interconnect
Node 1
cpu0
cpu1
cpu2
cpu0
cpu1
cpu2
L1
L2
L1
L2
L1
L2
L1
L2
L1
L2
L1
L2
cpu3
cpu4
cpu5
cpu3
cpu4
cpu5
L1
L2
L1
L2
L1
L2
L1
L2
L1
L2
L1
L2
QPI
L3 cache
L3 cache
RAM
RAM
© Copyright 2015 EMC Corporation. All rights reserved.
15
HARDWARE OVERVIEW
Processor
# Cores
GB/s Processor
= GT/s *
Intel Xeon
BUS bandwidth
E5-2620
Gigatransfers
(8B)
per second
6
# Threads
12
QPI speed
8.0 GT/s = 64 GB/s
L1 data cache
32 KB
L1 instruction cache
32 KB
L2 cache
256 KB
L3 cache
15 MB
RAM
62.5 GB
© Copyright 2015 EMC Corporation. All rights reserved.
16
LINUX PERF TOOL
• Command line profiler
• Based on perf_events
– Hardware events – counted by the CPU
– Software events – counted by the kernel
• perf list – a list of pre-defined events (to be used in –e).
–
–
–
–
instructions
context-switches OR cs
L1-dcache-loads
rNNN
© Copyright 2015 EMC Corporation. All rights reserved.
[Hardware event]
[Software event]
[Hardware cache event]
[Raw hardware event descriptor]
17
PERF STAT
• Keeps a running count of selected events during
process execution.
• perf stat [options] –e [list of events] <program> <args>
Examples:
– perf stat –e page-faults my_exec.
#page-faults that occurred during execution of my_exec.
– perf stat –a –e instructions,r81d0 sleep 5
System wide count on all CPUs. Counts #intructions and l1
dcache loads.
© Copyright 2015 EMC Corporation. All rights reserved.
18
CHARACTERIZING OUR SYSTEM
• Linux perf tool
• CPU Performance counters
– L1-dcache-loads
– L1-dcache-stores
• Test: ran IO for 120 seconds
• Result: RD/WR = 2:1
© Copyright 2015 EMC Corporation. All rights reserved.
19
THE SIMULATOR
• Measuring performance for different memory
allocation policies on a 2 node system.
• Throughput is measured as the time it takes to
complete N iterations.
• Threads randomly access a shared memory.
© Copyright 2015 EMC Corporation. All rights reserved.
20
THE SIMULATOR CONT.
Config file:
• #Threads
• RD/WR ratio – ratio between the number of read and write operations a
thread performs
• Policy – local / interleave / remote.
• Size – the size of memory to allocate.
• #Iterations
• Node0/Node1 – ratio between threads bound to Node 0 and threads bound
Node 1
• RW_SIZE - size of read or write operation in each iteration.
© Copyright 2015 EMC Corporation. All rights reserved.
21
EXPERIMENT #1
• Compare performance of 3 policies:
– Local – threads access memory on node they run on.
– Remote – threads access memory on a different node from
which they run on.
– Interleave – memory is interleaves across nodes (threads
access both local and remote memory)
© Copyright 2015 EMC Corporation. All rights reserved.
22
EXPERIMENT #1
• 3 policies – local, interleave, remote.
• #Threads varies from 1-24 (the maximal number of
concurrent threads in the system)
• 2 setups – balanced/unbalanced workload
balanced
© Copyright 2015 EMC Corporation. All rights reserved.
unbalanced
23
EXPERIMENT #1
Configurations:
• #Iterations = 100,000,000
• Data size = 2 * 150 MB
• RD/WR ratio = 2:1
• RW_SIZE = 128 Bytes
© Copyright 2015 EMC Corporation. All rights reserved.
24
RESULTS - BALANCED WORKLOAD
+83%
+69%
-37%
-46%
Time it took until the last
thread finished working.
© Copyright 2015 EMC Corporation. All rights reserved.
25
RESULTS - UNBALANCED WORKLOAD
+87%
+73%
-45%
-35%
© Copyright 2015 EMC Corporation. All rights reserved.
26
RESULTS - COMPARED
local remote interleave
balanced
unbalanced
© Copyright 2015 EMC Corporation. All rights reserved.
27
CONCLUSIONS
• The more concurrent threads in the system, the
more impact memory locality has on performance.
• In applications with #concurrent threads up to
#cores in 1 node, the best solution is to bind the
process and allocate memory on the same node.
• In applications with #concurrent threads up to
#cores in a 2 node system, disabling NUMA
(interleaving memory) will have similar performance
to binding the process and allocating memory on the
same node.
© Copyright 2015 EMC Corporation. All rights reserved.
28
EXPERIMENT #2
• Local access is significantly faster than remote.
• Our system uses RW locks to synchronize memory
access.
Is maintaining read locality by mirroring the data on
both nodes have better performance than the
current interleave policy?
© Copyright 2015 EMC Corporation. All rights reserved.
29
EXPERIMENT #2
• Purpose: find RD/WR ratio for which maintaining
read locality is better than memory interleaving.
• Setup 1: Interleaving
– Single RW lock
– Data is interleaved across both nodes
• Setup 2: Mirroring data
– RW lock per node
– Each read operation accesses local memory.
– Each write operation is done to both local and remote
memory.
© Copyright 2015 EMC Corporation. All rights reserved.
30
EXPERIMENT #2
Configurations:
• #Iterations = 25,000,000
• Data size = 2 * 150 MB
• RD/WR ratio = 12 : i , { 1 <= i <= 12}
• #Threads = 8 ; 12
• RW_SIZE = 512 ; 1024 ; 2048 ; 4096 Bytes
© Copyright 2015 EMC Corporation. All rights reserved.
31
RW LOCKS – MIRRORING VS. INTERLEAVING
8 threads
Mirroring win for % write ops
© Copyright 2015 EMC Corporation. All rights reserved.
IO size (B)
% write ops
512
0 – Unbound
1024
0 - 40%
2048
0 - 27%
4096
0 - 12%
32
RW LOCKS – MIRRORING VS. INTERLEAVING
12 threads
Mirroring win for % write ops
© Copyright 2015 EMC Corporation. All rights reserved.
IO size (B)
% write ops
512
0 – Unbound
1024
0 - 50%
2048
0 - 25%
4096
0 - 7.7%
33
CONCLUSIONS
• Memory op size and % of write operations both play
a role in deciding which memory allocation policy is
better.
• In applications with small mem-op size (512B) and
up to 50% write operations, mirroring is the better
option.
• In applications with mem-op size of 4KB and more,
mirroring is worse than interleaving the memory and
using a single RW lock.
© Copyright 2015 EMC Corporation. All rights reserved.
34
SUMMARY
• Fine grained memory allocation can lead to
performance improvement for certain workloads.
– More investigation is needed in order to configure a suitable
memory policy that utilizes NUMA abilities.
© Copyright 2015 EMC Corporation. All rights reserved.
35