NUMA(YEY) BY JACOB KUGLER © Copyright 2015 EMC Corporation. All rights reserved. 1 MOTIVATION • Next generation of EMC VPLEX hardware is NUMA based – What is the expected performance benefit? – How to best adjust the code to NUMA? • Gain experience with NUMA tools © Copyright 2015 EMC Corporation. All rights reserved. 2 VPLEX OVERVIEW • A unique virtual storage technology that enables: – Data mobility and high availability within and between data centers. – Mission critical continuous availability between two synchronous sites. – Distributed RAID1 between 2 sites. © Copyright 2015 EMC Corporation. All rights reserved. 3 UMA OVERVIEW – CURRENT STATE Uniform Memory Access CPU0 CPU1 CPU2 RAM CPU3 CPU4 CPU5 © Copyright 2015 EMC Corporation. All rights reserved. 4 NUMA OVERVIEW – NEXT GENERATION Non Uniform Memory Access NODE1 RAM © Copyright 2015 EMC Corporation. All rights reserved. NODE0 CPU3 CPU0 CPU4 CPU1 CPU5 CPU2 RAM 5 POLICIES • Allocation of memory on specific nodes • Binding threads to specific nodes/CPUs • Can be applied to: – Process – Memory area © Copyright 2015 EMC Corporation. All rights reserved. 6 DEFAULT POLICY Node 0 Running thread Local memory access © Copyright 2015 EMC Corporation. All rights reserved. Node 1 Running thread Local memory access 8 BIND/PREFFERED POLICY Node 0 Node 1 Running thread local © Copyright 2015 EMC Corporation. All rights reserved. Running thread remote 9 INTERLEAVE POLICY Node 0 Node 1 Running thread Local memory access © Copyright 2015 EMC Corporation. All rights reserved. Remote memory access 10 NUMACTL • Command line tool for running a specific NUMA Policy. • Useful for programs that cannot be modified or recompiled. © Copyright 2015 EMC Corporation. All rights reserved. 11 NUMACTL EXAMPLES • numactl –cpubind=0 –membind=0,1 <program> run the program on node 0 and allocate memory from nodes 0,1 • numactl –interleave=all <program> run the program with memory interleave on all available nodes. © Copyright 2015 EMC Corporation. All rights reserved. 12 LIBNUMA • A library that offers an API for NUMA policy. • Fine grained tuning of NUMA policies. – Changing policy in one thread does not affect other threads. © Copyright 2015 EMC Corporation. All rights reserved. 13 LIBNUMA EXAMPLES • numa_available() – checks if NUMA is supported on the system. • numa_run_on_node(int node) – binds the current thread on a specific node. • numa_max_node() – the number of the highest node in the system. • numa_alloc_interleave(size_t size) – allocates size bytes of memory page interleaved on all available nodes. • numa_alloc_onnode(size_t size, int node) – allocate memory on a specific node. © Copyright 2015 EMC Corporation. All rights reserved. 14 HARDWARE OVERVIEW 2 hyper threads Node 0 Quick Path Interconnect Node 1 cpu0 cpu1 cpu2 cpu0 cpu1 cpu2 L1 L2 L1 L2 L1 L2 L1 L2 L1 L2 L1 L2 cpu3 cpu4 cpu5 cpu3 cpu4 cpu5 L1 L2 L1 L2 L1 L2 L1 L2 L1 L2 L1 L2 QPI L3 cache L3 cache RAM RAM © Copyright 2015 EMC Corporation. All rights reserved. 15 HARDWARE OVERVIEW Processor # Cores GB/s Processor = GT/s * Intel Xeon BUS bandwidth E5-2620 Gigatransfers (8B) per second 6 # Threads 12 QPI speed 8.0 GT/s = 64 GB/s L1 data cache 32 KB L1 instruction cache 32 KB L2 cache 256 KB L3 cache 15 MB RAM 62.5 GB © Copyright 2015 EMC Corporation. All rights reserved. 16 LINUX PERF TOOL • Command line profiler • Based on perf_events – Hardware events – counted by the CPU – Software events – counted by the kernel • perf list – a list of pre-defined events (to be used in –e). – – – – instructions context-switches OR cs L1-dcache-loads rNNN © Copyright 2015 EMC Corporation. All rights reserved. [Hardware event] [Software event] [Hardware cache event] [Raw hardware event descriptor] 17 PERF STAT • Keeps a running count of selected events during process execution. • perf stat [options] –e [list of events] <program> <args> Examples: – perf stat –e page-faults my_exec. #page-faults that occurred during execution of my_exec. – perf stat –a –e instructions,r81d0 sleep 5 System wide count on all CPUs. Counts #intructions and l1 dcache loads. © Copyright 2015 EMC Corporation. All rights reserved. 18 CHARACTERIZING OUR SYSTEM • Linux perf tool • CPU Performance counters – L1-dcache-loads – L1-dcache-stores • Test: ran IO for 120 seconds • Result: RD/WR = 2:1 © Copyright 2015 EMC Corporation. All rights reserved. 19 THE SIMULATOR • Measuring performance for different memory allocation policies on a 2 node system. • Throughput is measured as the time it takes to complete N iterations. • Threads randomly access a shared memory. © Copyright 2015 EMC Corporation. All rights reserved. 20 THE SIMULATOR CONT. Config file: • #Threads • RD/WR ratio – ratio between the number of read and write operations a thread performs • Policy – local / interleave / remote. • Size – the size of memory to allocate. • #Iterations • Node0/Node1 – ratio between threads bound to Node 0 and threads bound Node 1 • RW_SIZE - size of read or write operation in each iteration. © Copyright 2015 EMC Corporation. All rights reserved. 21 EXPERIMENT #1 • Compare performance of 3 policies: – Local – threads access memory on node they run on. – Remote – threads access memory on a different node from which they run on. – Interleave – memory is interleaves across nodes (threads access both local and remote memory) © Copyright 2015 EMC Corporation. All rights reserved. 22 EXPERIMENT #1 • 3 policies – local, interleave, remote. • #Threads varies from 1-24 (the maximal number of concurrent threads in the system) • 2 setups – balanced/unbalanced workload balanced © Copyright 2015 EMC Corporation. All rights reserved. unbalanced 23 EXPERIMENT #1 Configurations: • #Iterations = 100,000,000 • Data size = 2 * 150 MB • RD/WR ratio = 2:1 • RW_SIZE = 128 Bytes © Copyright 2015 EMC Corporation. All rights reserved. 24 RESULTS - BALANCED WORKLOAD +83% +69% -37% -46% Time it took until the last thread finished working. © Copyright 2015 EMC Corporation. All rights reserved. 25 RESULTS - UNBALANCED WORKLOAD +87% +73% -45% -35% © Copyright 2015 EMC Corporation. All rights reserved. 26 RESULTS - COMPARED local remote interleave balanced unbalanced © Copyright 2015 EMC Corporation. All rights reserved. 27 CONCLUSIONS • The more concurrent threads in the system, the more impact memory locality has on performance. • In applications with #concurrent threads up to #cores in 1 node, the best solution is to bind the process and allocate memory on the same node. • In applications with #concurrent threads up to #cores in a 2 node system, disabling NUMA (interleaving memory) will have similar performance to binding the process and allocating memory on the same node. © Copyright 2015 EMC Corporation. All rights reserved. 28 EXPERIMENT #2 • Local access is significantly faster than remote. • Our system uses RW locks to synchronize memory access. Is maintaining read locality by mirroring the data on both nodes have better performance than the current interleave policy? © Copyright 2015 EMC Corporation. All rights reserved. 29 EXPERIMENT #2 • Purpose: find RD/WR ratio for which maintaining read locality is better than memory interleaving. • Setup 1: Interleaving – Single RW lock – Data is interleaved across both nodes • Setup 2: Mirroring data – RW lock per node – Each read operation accesses local memory. – Each write operation is done to both local and remote memory. © Copyright 2015 EMC Corporation. All rights reserved. 30 EXPERIMENT #2 Configurations: • #Iterations = 25,000,000 • Data size = 2 * 150 MB • RD/WR ratio = 12 : i , { 1 <= i <= 12} • #Threads = 8 ; 12 • RW_SIZE = 512 ; 1024 ; 2048 ; 4096 Bytes © Copyright 2015 EMC Corporation. All rights reserved. 31 RW LOCKS – MIRRORING VS. INTERLEAVING 8 threads Mirroring win for % write ops © Copyright 2015 EMC Corporation. All rights reserved. IO size (B) % write ops 512 0 – Unbound 1024 0 - 40% 2048 0 - 27% 4096 0 - 12% 32 RW LOCKS – MIRRORING VS. INTERLEAVING 12 threads Mirroring win for % write ops © Copyright 2015 EMC Corporation. All rights reserved. IO size (B) % write ops 512 0 – Unbound 1024 0 - 50% 2048 0 - 25% 4096 0 - 7.7% 33 CONCLUSIONS • Memory op size and % of write operations both play a role in deciding which memory allocation policy is better. • In applications with small mem-op size (512B) and up to 50% write operations, mirroring is the better option. • In applications with mem-op size of 4KB and more, mirroring is worse than interleaving the memory and using a single RW lock. © Copyright 2015 EMC Corporation. All rights reserved. 34 SUMMARY • Fine grained memory allocation can lead to performance improvement for certain workloads. – More investigation is needed in order to configure a suitable memory policy that utilizes NUMA abilities. © Copyright 2015 EMC Corporation. All rights reserved. 35