The ‘zero-copy’ initiative A look at the ‘zero-copy’ concept

advertisement
The ‘zero-copy’ initiative
A look at the ‘zero-copy’ concept
and an x86 Linux implementation
for the case of outgoing packets
From Wikipedia, the free encyclopedia:
Zero-copy is an adjective that refers to computer operations in which the
CPU does not perform the task of copying data from one area of memory
to another.
The availability of zero-copy versions of operating system elements such
as device drivers, file systems and network protocol stacks greatly increases
the performance of many applications, since using a CPU that is capable of
complex operations just to make copies of data can be a great waste of
resources. Zero-copy also reduces the number of context-switches from
User space to Kernel space and vice-versa. Several OS like Linux support
zero copying of files through specific API's like sendfile, sendfile64, etc.
Techniques for creating zero-copy software include the use of DMA-based
copying, and memory-mapping through an MMU. These features require
specific hardware support and usually involve particular memory alignment
requirements.
Zero-copy protocols are especially important for high-speed networks, as
memory copies would cause a serious workload for the host cpu. Still, such
protocols have some initial overhead so that avoiding programmed IO (PIO)
there only makes sense for large messages.
Application source-code
char
message[] = “This is a test of network-packet transmission \n”;
int main( void )
{
int
fd = open( “/dev/nic”, O_RDWR );
if ( fd < 0 ) { perror( “/dev/nic” ); exit(1); }
int
msglen = strlen( message );
int
nbytes = write( fd, message, msglen );
if ( nbytes < 0 ) { perror( “write” ); exit(1); }
printf( “Transmitted %d bytes \n”, nbytes );
}
Transmit operation
user space
kernel space
Linux OS kernel
runtime library
write()
file subsystem
nic device-driver
my_write()
user data-buffer
copy_from_user()
packet buffer
DMA
application program
hardware
We want to eliminate this copying-operation
Our driver’s packet-layout
destn-address
source-address
TYPE/
LENGTH
count
-- data --- data --- data –
packet-buffer in kernel-space
16 bytes
base-address (64-bits)
Packetlength
CSO cmd status CSS
Format for Legacy Transmit-Descriptor
special
Can zero-copy be transparent?
• We would like to implement the zero-copy
concept in out ‘nic2.c’ character driver in
such a manner that no changes would be
required to an ‘application’ program’s code
• We will show how to do this for ‘outgoing’
packets (i.e., by modifying ‘my_write()’),
but achieving zero-copy with ‘incoming’
packets would be a lot more complicated!
TX Descriptor’s CMD byte
Command-Byte Format
I
D
E
V
L
E
0
0
R
S
I
C
I
F
C
S
E
O
P
EOP = End-Of-Packet (1=yes, 0=no)
RS = Report Status (1=yes, 0=no)
VLE = VLAN-tag Enable
Key question: What will the NIC do if we don’t set the EOP-bit in a TX Descriptor?
Splitting our packet-layout
destn-address
source-address
TYPE/
LENGTH
count
HDR
-- data --
LEN
-- data --- data –
packet-buffer in kernel-space
base-address (64-bits)
PacketLength
(=HDR)
base-address (64-bits)
PacketLength
(=LEN)
CSO cmd status CSS
special
CSO cmd status CSS
special
EOP=0
EOP=1
Format for Legacy Transmit-Descriptor Pair
Splitting our packet-buffer
destn-address
source-address
TYPE/
LENGTH
count
HDR
packet-buffer in kernel-space
-- data --
LEN
-- data --
-- data –
packet-buffer in user-space
base-address (64-bits)
PacketLength
(=HDR)
base-address (64-bits)
PacketLength
(=LEN)
CSO cmd status CSS
special
CSO cmd status CSS
special
EOP=0
EOP=1
Format for Legacy Transmit-Descriptor Pair
Two physical packet-buffers comprise one logical packet that gets transmitted!
Transmitting a ‘split-packet’
The 82573L controller ‘merges’ the
contents of these separate buffers
into just a single ethernet-packet
Application-program
packet-data buffer
User-space
Kernel-space
Device-driver module
DMA
packet-header buffer
DMA
NIC hardware
The ‘virt_to_phys()’ macro
• Linux provides a convenient macro which
kernel-module code can employ to obtain
the physical-address for a memory-region
from its virtual-address – but it only works
for addresses that aren’t in ‘high’ memory
• For ‘normal’ memory-regions, conversion
between ‘virtual’ and ‘physical’ addresses
amounts to a simple addition/subtraction
Linux memory-mapping
= persistent mapping
= transient mappings
HMA
kernel
space
896-MB
physical RAM
There is more physical RAM
in our classroom’s systems
than can be ‘mapped’ into
the available address-range
for kernel virtual addresses
user
space
CPU’s virtual
address-space
Two-Level Translation Scheme
PAGE
DIRECTORY
CR3
PAGE
TABLES
PAGE
FRAMES
Linear to Physical
linear address
dir-index
table-index
offset
physical address-space
page
table
page
directory
CR3
page frame
Address-translation
• The CPU examines any virtual address it
encounters, subdividing it into three fields
31
22 21
12 11
index into
page-directory
index into
page-table
10-bits
10-bits
This field selects
one of the 1024
array-entries in
the Page-Directory
This field selects
one of the 1024
array-entries in
that Page-Table
0
offset into
page-frame
12-bits
This field provides
the offset to one
of the 4096 bytes
in that Page-Frame
Format of a Page-Table entry
31
PAGE-FRAME BASE ADDRESS
12 11 10 9 8 7 6 5 4 3 2 1 0
P P
AVAIL 0 0 D A C W U W P
D T
LEGEND
P = Present (1=yes, 0=no)
W = Writable (1 = yes, 0 = no)
U = User (1 = yes, 0 = no)
A = Accessed (1 = yes, 0 = no)
D = Dirty (1 = yes, 0 = no)
PWT = Page Write-Through (1=yes, 0 = no)
PCD = Page Cache-Disable (1 = yes, 0 = no)
Finding the user-buffer’s PFN
• To program the ‘base-address’ field in the
second TX-Descriptor, our driver’s ‘write()’
function will need to know which physical
Page-Frame the application’s buffer lies in
• And its PFN (Page-Frame Number) can be
found from its virtual address by ‘walkingthe-cpu-page-tables’ – even when Linux
puts some page-tables in ‘high’ memory
Performing ‘virt_to_phys()’
ssize_t my_write( struct file *file, const char *buf, size_t len, loff_t *pos )
{
unsigned int
_cr3, *pgdir, *pgtbl, pfn_pgtbl, pfn_frame;
unsigned int
dindex, pindex, offset;
// take apart the virtual-address of the user’s ‘buf’ variable
dindex = ((int)buf >> 22) & 0x3FF;
// pgdir-index (10-bits)
pindex = ((int)buf >> 12) & 0x3FF;
// pgtbl-index (10-bits)
offset = ((int)buf >> 0) & 0xFFF;
// frame-offset (12-bits)
// then walk the CPU’s paging-tables to get buf’s physical-address
asm(“ mov %%cr3, %%eax \n mov %%eax, %0 “ : “=m”(_cr3) : : “ax” );
pgdir = (unsigned int*)phys_to_virt( _cr3 & ~0xFFF );
pfn_pgtbl = (pgdir[ dindex ] >> 12);
pgtbl = (unsigned int *)kmap( &mem_map[ pfn_pgtbl ] );
pfn_frame = (pgtbl[ pindex ] >> 12);
kunmap( &mem_map[ pfn_pgtbl ];
txring[ txtail + 1 ].base_address = (pfn_frame << 12) + offset;
Can’t cross a ‘page-boundary’
• In order for the NIC to fetch the user’s data
using its Bus-Master DMA capability, it is
necessary for the buffer needs to reside in
a physically contiguous memory-region
buf
• But we can’t be sure Linux will have setup
the CPU’s page-tables that way – unless
the ‘buf’ is confined to a single page-frame
Truncate ‘len’ if necessary
ssize_t my_write( struct file *file, const char *buf, size_t len, loff_t *pos )
{
if ( offset + len > PAGE_SIZE ) len = PAGE_SIZE – offset;
offset
len
buf
PAGE_SIZE
PAGE_SIZE
PAGE_SIZE
‘zerocopy.c’
• We created this modification of our ‘nic2.c’
device-driver so it’s ‘my_write()’ function
lets an application perform transmissions
without performing a memory-to-memory
copy-operation (i.e., copy_from_user()’ )
• It is not so easy to implement ‘zero-copy’
for receiving packets – can you say why?
Website article
• We’ve posted a link on our CS686 website
to a frequently cited research-article about
the various issues that arise when trying to
implement the ‘zero-copy’ concept for the
case of ‘incoming’ network-packets:
The Need for Asynchronous, Zero-Copy Network I/O,
by Ulrich Drepper, Red Hat, Inc.
Download