NVMe over Fabrics Wael Noureddine, Ph.D. Chelsio Communications Inc. 2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved. Overview r Introduction r NVMe/Fabrics and iWARP r RDMA Block Device and results r Closing notes 2 2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved. NVMe Devices r Accelerating pace of progress r Capacity r Reliability r Efficiency r Performance r Latency r Bandwidth Samsung Keynote, Flash Memory Summit, Santa Clara, August’2015 r IOPS – Random Access 3 2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved. A Decade of NIC Evolution 2003 2013 2013 2013 2013 2007 7Gbps aggregate PCI-X Physical Functions: 1 DMA Queues: 1 Interrupt: Line Logical NICs: 1 Lock-based Synchronization 11Gbps bidirectional PCIe Gen1 Physical Functions: 1 DMA Queues: 8 Interrupt: MSI Logical NICs: 8 Up to 8 cores 55Gbps bidirectional PCIe Gen3 Physical Functions: 8 DMA Queues: 64K Interrupt: MSI-X Logical NICs: 128 Lock-free user space I/O 2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved. 4 NVMe – Non-Volatile Memory Express r Modern PCIe interface architecture Benefit from experience with other high performance devices r Scale with exponential speed increases r r r Standardized for enhanced plug-andplay Parallels to high performance NIC interface 64K x deep queues r MSI-X r Lock-free operation r Security r Data integrity protection r Register set r Feature set r Command set r 5 2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved. iWARP RDMA over Ethernet Host MEMORY Payload Host CQ No(fica(ons No(fica(ons Unified Wire RNIC RDMA TCP IP ETH MEMORY CQ Payload Unified Wire RNIC End-to-end CRC TCP/IP in hardware r Open, transparent and mature r r RDMA TCP IP ETH Plug-and-play r r r r TCP/IP/Ethernet Frames r r r LAN/Data Center WAN/Long Haul r Architecture, scale, distance, RTT, link speeds Hardware performance r Any Switches/Routers Regular switches Same network appliances Works with DCB but NOT required No need for lossless configuration No network restrictions r r Standard and proven TCP/IP Built-in reliability congestion control, flow control Natively routable Cost effective r TCP/IP/Ethernet Frames IETF Standard (2007) r r Exception processing in HW Ultra low latency High packet rate and bandwidth 2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved. NVMe over Fabrics r r r Address PCI reach and scalability limits Extend native NVMe over large fabric iWARP RDMA over Ethernet first proof point r r r Intel NVMe drives Chelsio 40GbE iWARP Compared to local disk r r Less than 8usec additional RTT latency Same IOPS performance 7 2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved. NVMe/Ethernet Fabric with iWARP SMB3 T5 Unified Adapter Windows clients T5 Unified Adapter NVMe/Fabrics NVM Storage ANY Switch/ router ANY Switch/ router iSCSI SMBv3 NVMe/Fabrics iSCSI NVM Storage NVM Storage …. Linux clients NVMe/Fabrics Appliance Frontend Remote replicaIon ANY Switch/ router NVM Storage NVMe op(mized client Ethernet Front-­‐end Fabric Ethernet Back-­‐end Fabric T5 supports iSCSI, SMBDirect and NVMe/Fabrics offload simultaneously 8 2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved. T5 Storage Protocol Support SMB NBD NFS RPC Network Driver SMB Direct OS Block IO NDK NVMe/Fabrics and RBD RDMA Driver iSCSI FCoE iSER FCoE Driver iSCSI Driver Lower Layer Driver T10-DIX RDMA Offload NIC iSCSI Offload FCoE Offload TCP Offload T5 Hardware 9 2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved. NVMe/Fabrics Layering Ini(ator Target Linux File System Linux Block IO Kernel NVMe Device Driver NVMe/Fabrics T5 NIC Kernel NVMe/Fabrics RDMA Verbs RDMA Verbs RDMA Driver RDMA Driver NVMe Device Driver RDMA Offload NVMe Drive RDMA Offload TCP Offload T5 NIC TCP Offload NVMe Drive 10 2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved. RBD Layering Ini(ator Target Linux File System Linux Block IO Kernel T5 NIC RDMA Block Driver Linux Block IO Kernel RDMA Block Driver RDMA Verbs RDMA Verbs RDMA Driver RDMA Driver Block Device Driver RDMA Offload NVMe Drive RDMA Offload TCP Offload T5 NIC TCP Offload SAS Drive.. 11 2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved. Generic RDMA Block Device (RBD) r Open Source Chelsio Linux driver r r r r Claims the physical device using blkdev_get_by_path() Submits IOs from the Initiator via submit_bio() Initiator r r r Part of Chelsio’s Unified Wire SW r r r r Registers as a BIO device driver Registers connected Target devices as a BIO device using add_disk() Max RBD IO size is 128KB 256 max outstanding IOs in flight Initiator physical/logical sector size is 4KB Initiator provides character device command interface for adding/ removing devices Simple command, rbdctl, used to connect, login, and register the target devices Examples r Target r r Current RBD Protocol Limits r Management via rbdctl command r Virtualizes block devices over an RDMA connection Target r r r Load rbdt module Initiator r r r r Load rbdi module Add a new target:rbdct –n –a <ipaddr> -d /dev/nvme0n1 Remove an RBD device: rbdctl – r –d /dev/rbdi0 List connected devices: rbdctl -l 12 2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved. RBD vs. NBD Throughput – 40GbE 40 35 Throughput (Gbps) 30 25 20 15 10 5 0 4k 8k 16k 32k 64k 128k 256k 512k 1M I/O Size (B) RBD_read NBD_Read RBD_Write NBD_Write Intel Xeon CPU E5-1660 v2 6-core processor clocked at 3.70GHz (HT enabled) with 64 GB of RAM, Chelsio T580-CR, RHEL 6.5 (Kernel 3.18.14), standard MTU of 1500B, using RAMdisks on target (4) fio --name=<test_type> --iodepth=32 --rw=<test_type> --size=800m --direct=1 --invalidate=1 --fsync_on_close=1 --norandommap --group_reporting --ioengine=libaio --numjobs=4 --bs=<block_size> --runtime=30 --time_based --filename=/dev/rbdi0 13 2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved. RBD vs. NBD Latency – 40GbE 200 180 160 Latency (us) 140 120 100 80 60 40 20 0 4k 8k 16k 32k 64k 128k I/O Size (B) RBD_Read NBD_Read RBD_Write NBD_Write Intel Xeon CPU E5-1660 v2 6-core processor clocked at 3.70GHz (HT enabled) with 64 GB of RAM. Chelsio T580-CR, RHEL 6.5 (Kernel 3.18.14), standard MTU of 1500B, using RAMdisk on target (1) fio --name=<test_type> --iodepth=1 --rw=<test_type> --size=800m --direct=1 --invalidate=1 --fsync_on_close=1 --norandommap --group_reporting --ioengine=libaio --numjobs=1 --bs=<block_size> --runtime=30 --time_based --filename=/dev/rbdi0 14 2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved. SSD Latency Results Random Read Latency 450 400 350 Latency (usec) 300 250 Local 200 RBD 150 100 50 0 4 8 16 32 64 128 256 I/O Size (KB) Supermicro MB, 16 cores - Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz, 64GB memory Chelsio T580-CR, RHEL6.5, linux-3.18.14 kernel, standard MTU of 1500B, through Ethernet switch, NVMe SSD target fio-2.1.10, 1 thread, direct IO, 30 second runs, libaio IO engine, IO depth = 1, random read, write, and read-write. 15 2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved. SSD Bandwidth Results Random Read Throughput (1 Thread) 9 8 Throughput (Gbps) 7 6 5 Local 4 RBD 3 2 1 0 4 8 16 32 64 128 256 I/O Size (KB) Supermicro MB, 16 cores - Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz, 64GB memory Chelsio T580-CR, RHEL6.5, linux-3.18.14 kernel, standard MTU of 1500B, through Ethernet switch, SSD target fio-2.1.10, 1 thread, direct IO, 30 second runs, libaio IO engine, IO depth = 32, random read, write, and read-write. 16 2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved. Local vs. RBD Bandwidth – 2 SSD Write 9 14 Throughput (Gbps) Throughput (Gbps) 10 Read 16 12 10 8 6 4 2 8 7 6 5 4 3 2 1 0 0 4 8 16 32 64 128 256 4 8 16 I/O Size (KB) Local 32 64 128 256 I/O Size (KB) RBD Local RBD Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz 2x6 cores 64GB RAM, 2xNVMe SSD Chelsio T580-CR, CentOS Linux release 7.1.1503 Kernel 3.17.8, standard MTU of 1500B through 40Gb switch fio --direct=1 --invalidate=1 --fsync_on_close=1 --norandommap --group_reporting --name=foo --bs=$2 --size=800m --numjobs=8 --ioengine=libaio --iodepth=16 --rw=randwrite --runtime=30 --time_based --filename_format=/dev/rbdi\$jobnum --directory=/ 17 2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved. RDMA over Fabrics Evolution Memory CPU Memory CPU Memory CPU PCIe RNIC PCIe SSD SSD DMA to/from host memory CPU handles command transfer YESTERDAY PCIe SSD SSD RNIC Peer-to-peer DMA through PCI bus No host memory use TODAY SSD SSD RNIC Network integrated NVMe storage No CPU or memory involvement TOMORROW 18 2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved. Conclusions r r r NVMe over Fabrics for scalable high performance and high efficiency fabrics that leverage disruptive solid-state storage Standard interfaces and middleware r Ease of use and widespread support iWARP RDMA ideal for NVMe/Fabrics r Familiar and proven foundation r Plug-and-play Works with DCB but does not require it r No DCB tax and complexity r r Scalability, reliability, routability Any switches and any routers over any distance r Scales from 40 to 100Gbps and beyond r r NVMe/Fabrics devices with built-in processing power 2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved. 19