The ‘zero-copy’ initiative A look at the ‘zero-copy’ concept and an x86 Linux implementation for the case of outgoing packets From Wikipedia, the free encyclopedia: Zero-copy is an adjective that refers to computer operations in which the CPU does not perform the task of copying data from one area of memory to another. The availability of zero-copy versions of operating system elements such as device drivers, file systems and network protocol stacks greatly increases the performance of many applications, since using a CPU that is capable of complex operations just to make copies of data can be a great waste of resources. Zero-copy also reduces the number of context-switches from User space to Kernel space and vice-versa. Several OS like Linux support zero copying of files through specific API's like sendfile, sendfile64, etc. Techniques for creating zero-copy software include the use of DMA-based copying, and memory-mapping through an MMU. These features require specific hardware support and usually involve particular memory alignment requirements. Zero-copy protocols are especially important for high-speed networks, as memory copies would cause a serious workload for the host cpu. Still, such protocols have some initial overhead so that avoiding programmed IO (PIO) there only makes sense for large messages. Application source-code char message[] = “This is a test of network-packet transmission \n”; int main( void ) { int fd = open( “/dev/nic”, O_RDWR ); if ( fd < 0 ) { perror( “/dev/nic” ); exit(1); } int msglen = strlen( message ); int nbytes = write( fd, message, msglen ); if ( nbytes < 0 ) { perror( “write” ); exit(1); } printf( “Transmitted %d bytes \n”, nbytes ); } Transmit operation user space kernel space Linux OS kernel runtime library write() file subsystem nic device-driver my_write() user data-buffer copy_from_user() packet buffer DMA application program hardware We want to eliminate this copying-operation Our driver’s packet-layout destn-address source-address TYPE/ LENGTH count -- data --- data --- data – packet-buffer in kernel-space 16 bytes base-address (64-bits) Packetlength CSO cmd status CSS Format for Legacy Transmit-Descriptor special Can zero-copy be transparent? • We would like to implement the zero-copy concept in out ‘nic2.c’ character driver in such a manner that no changes would be required to an ‘application’ program’s code • We will show how to do this for ‘outgoing’ packets (i.e., by modifying ‘my_write()’), but achieving zero-copy with ‘incoming’ packets would be a lot more complicated! TX Descriptor’s CMD byte Command-Byte Format I D E V L E 0 0 R S I C I F C S E O P EOP = End-Of-Packet (1=yes, 0=no) RS = Report Status (1=yes, 0=no) VLE = VLAN-tag Enable Key question: What will the NIC do if we don’t set the EOP-bit in a TX Descriptor? Splitting our packet-layout destn-address source-address TYPE/ LENGTH count HDR -- data -- LEN -- data --- data – packet-buffer in kernel-space base-address (64-bits) PacketLength (=HDR) base-address (64-bits) PacketLength (=LEN) CSO cmd status CSS special CSO cmd status CSS special EOP=0 EOP=1 Format for Legacy Transmit-Descriptor Pair Splitting our packet-buffer destn-address source-address TYPE/ LENGTH count HDR packet-buffer in kernel-space -- data -- LEN -- data -- -- data – packet-buffer in user-space base-address (64-bits) PacketLength (=HDR) base-address (64-bits) PacketLength (=LEN) CSO cmd status CSS special CSO cmd status CSS special EOP=0 EOP=1 Format for Legacy Transmit-Descriptor Pair Two physical packet-buffers comprise one logical packet that gets transmitted! Transmitting a ‘split-packet’ The 82573L controller ‘merges’ the contents of these separate buffers into just a single ethernet-packet Application-program packet-data buffer User-space Kernel-space Device-driver module DMA packet-header buffer DMA NIC hardware The ‘virt_to_phys()’ macro • Linux provides a convenient macro which kernel-module code can employ to obtain the physical-address for a memory-region from its virtual-address – but it only works for addresses that aren’t in ‘high’ memory • For ‘normal’ memory-regions, conversion between ‘virtual’ and ‘physical’ addresses amounts to a simple addition/subtraction Linux memory-mapping = persistent mapping = transient mappings HMA kernel space 896-MB physical RAM There is more physical RAM in our classroom’s systems than can be ‘mapped’ into the available address-range for kernel virtual addresses user space CPU’s virtual address-space Two-Level Translation Scheme PAGE DIRECTORY CR3 PAGE TABLES PAGE FRAMES Linear to Physical linear address dir-index table-index offset physical address-space page table page directory CR3 page frame Address-translation • The CPU examines any virtual address it encounters, subdividing it into three fields 31 22 21 12 11 index into page-directory index into page-table 10-bits 10-bits This field selects one of the 1024 array-entries in the Page-Directory This field selects one of the 1024 array-entries in that Page-Table 0 offset into page-frame 12-bits This field provides the offset to one of the 4096 bytes in that Page-Frame Format of a Page-Table entry 31 PAGE-FRAME BASE ADDRESS 12 11 10 9 8 7 6 5 4 3 2 1 0 P P AVAIL 0 0 D A C W U W P D T LEGEND P = Present (1=yes, 0=no) W = Writable (1 = yes, 0 = no) U = User (1 = yes, 0 = no) A = Accessed (1 = yes, 0 = no) D = Dirty (1 = yes, 0 = no) PWT = Page Write-Through (1=yes, 0 = no) PCD = Page Cache-Disable (1 = yes, 0 = no) Finding the user-buffer’s PFN • To program the ‘base-address’ field in the second TX-Descriptor, our driver’s ‘write()’ function will need to know which physical Page-Frame the application’s buffer lies in • And its PFN (Page-Frame Number) can be found from its virtual address by ‘walkingthe-cpu-page-tables’ – even when Linux puts some page-tables in ‘high’ memory Performing ‘virt_to_phys()’ ssize_t my_write( struct file *file, const char *buf, size_t len, loff_t *pos ) { unsigned int _cr3, *pgdir, *pgtbl, pfn_pgtbl, pfn_frame; unsigned int dindex, pindex, offset; // take apart the virtual-address of the user’s ‘buf’ variable dindex = ((int)buf >> 22) & 0x3FF; // pgdir-index (10-bits) pindex = ((int)buf >> 12) & 0x3FF; // pgtbl-index (10-bits) offset = ((int)buf >> 0) & 0xFFF; // frame-offset (12-bits) // then walk the CPU’s paging-tables to get buf’s physical-address asm(“ mov %%cr3, %%eax \n mov %%eax, %0 “ : “=m”(_cr3) : : “ax” ); pgdir = (unsigned int*)phys_to_virt( _cr3 & ~0xFFF ); pfn_pgtbl = (pgdir[ dindex ] >> 12); pgtbl = (unsigned int *)kmap( &mem_map[ pfn_pgtbl ] ); pfn_frame = (pgtbl[ pindex ] >> 12); kunmap( &mem_map[ pfn_pgtbl ]; txring[ txtail + 1 ].base_address = (pfn_frame << 12) + offset; Can’t cross a ‘page-boundary’ • In order for the NIC to fetch the user’s data using its Bus-Master DMA capability, it is necessary for the buffer needs to reside in a physically contiguous memory-region buf • But we can’t be sure Linux will have setup the CPU’s page-tables that way – unless the ‘buf’ is confined to a single page-frame Truncate ‘len’ if necessary ssize_t my_write( struct file *file, const char *buf, size_t len, loff_t *pos ) { if ( offset + len > PAGE_SIZE ) len = PAGE_SIZE – offset; offset len buf PAGE_SIZE PAGE_SIZE PAGE_SIZE ‘zerocopy.c’ • We created this modification of our ‘nic2.c’ device-driver so it’s ‘my_write()’ function lets an application perform transmissions without performing a memory-to-memory copy-operation (i.e., copy_from_user()’ ) • It is not so easy to implement ‘zero-copy’ for receiving packets – can you say why? Website article • We’ve posted a link on our CS686 website to a frequently cited research-article about the various issues that arise when trying to implement the ‘zero-copy’ concept for the case of ‘incoming’ network-packets: The Need for Asynchronous, Zero-Copy Network I/O, by Ulrich Drepper, Red Hat, Inc.