OS support for (flexible) high-speed networking u uspace k kspace n nspace Herbert Bos Vrije Universiteit Amsterdam • monitoring • network processors • intrusion detection • multiple concurrent apps • packet processing (NATs, firewalls, etc.) • optical infrastructure we’re hurting because of ● hardware problems ● software problems ● language problems ● lack of flexibility ● ● network speed may grow too fast anyway multiple apps problem CPU MEM NIC What are we building? flows different barriers What are we building? if (pkt.proto == TCP) then if (http) then scan webattacks; else if (smtp) scan spam; scan viruses fi else if (pkt.proto == UDP) then • reconfigure when load changes if (NFS traffic) • dynamic optical infrastructure: = statistics (pkt); -mem configure at startup time else scan SLAMMER; - construct datapaths fi reduce copying ● FFPF avoids both ‘horizontal’ and ‘vertical’ copies Application A Application B U K ‘filter’ - no ‘vertical’ copies - no ‘horizontal’ copies within flow group - more than ‘just filtering’ in kernel (e.g.,statistics) minimise copying, context switching different copying regimes endianness new languages control dataplane different buffer regimes flowgraph = DAG combine multiple requests Software structure applications FFPF toolkit libpcap FFPF MAPI userspace userspace-kernel boundary FFPF host- NIC boundary FFPF kernelspace ixp2xxx current status ● ● ● working for Linux PC with – NICs – IXP1200s and IXP2400s (plugged in) – remote IXP2850 (with poetic license) rudimentary support for distributed packet processor speed comparable to tailored solutions Thank you! http://ffpf.sourceforge.net/ bridging barriers Three flavours Three flavours of packet processing on IXP & host • Regular - copy only when needed - may be slow depending on access pattern • Zero copy - never copy - may be slow depending on access pattern • Copy once - copy always 11 combining filters (dev,expr=/dev/ixp0) > [ [ (BPF,expr=“…”) >(PKTCOUNT) ] | (FPL2,expr=“…”) ] > (PKTCOUNT) Example 1: writing applications 1. int main(int argc, char** argv) { void *opaque_handle; 2. int count, returnval; 3. /** pass the string to the subsystem. It initializes and returns an opaque pointer */ 4. 5. opaque_handle = ffpf_open(argv[1], strlen(argv[1])); 6. if (start_listener(asynchronous_listener, NULL)) 7. { 8. printf (WARNING,“spawn failed\n"); 9. ffpf_close (opaque_handle); return -1; 10. } 11. count = ffpf_process(FFPF_PROCESS_FOREVER); 12. stop_listener(1); 13. returnval = ffpf_close(opaque_handle); 14. printf(,"processed %d packets\n",count); 15. return returnval; 16. } Example 1: writing applications 1. void* asynchronous_listener(void* arg) { 2. int i; 3. while (!should_stop()) { 4. sleep(5); 5. for (i=0;i<export_len;i++) { printf("flowgrabber %d: read %d, written%d\n", i, get_stats_read(exported[i]), get_stats_processed(exported[i])); 6. while ((pkt = get_packet_next(exported[i]))) { 7. dump_pkt (pkt); 8. } 9. } 10. 11. } 12. return NULL; Example 2: FPL-2 • new pkt processing language: FPL-2 • • • for IXP, kernel and userspace simple, efficient and flexible simple example: filter all webtraffic IF ( (PKT.IP_PROTO == PROTO_TCP) && (PKT.TCP_PORT == 80)) THEN RETURN 1; • more complex example: count pkts in all TCP flows IF (PKT.IP_PROTO == PROTO_TCP) THEN R[0] = Hash[ 14, 12, 1024]; M[ R[0] ]++; FI FPL-2 ● all common arithmetic and bitwise operations ● all common logical ops ● all common integer types ● – for packet – for buffer ( useful for keeping state!) statements – Hash – External functions ● to call hardware implementations ● to call fast C implementations – If … then … else – For … break; … Rof – Return Example application: dynamic ports 1. // R[0] stores no. of dynports found (initially 0) 2. IF (PKT.IP_PROTO==PROTO_TCP) THEN 3. IF (PKT.TCP_DPORT==554) THEN 4. M[R[0]]=EXTERN("GetDynTCPDPortFromRTSP",0,0); 5. R[0]++; 6. ELSE // compare pkt’s dst port to all ports in array – if match, return pkt 7. FOR (R[1]=0; R[1] < R[0]; R[1]++) 8. IF (PKT.TCP_DPORT == M[ R[1] ] ) THEN 9. RETURN TRUE; 10. FI 11. ROF 11. FI 12. FI 12. RETURN FALSE; efficient userspace ? reduced copying and context switches kernel ● sharing data ● ● flowgraphs: sharing computations ? x “push filtering tasks as far down the processing hierarchy as possible” ? ? network card Network Monitoring ● Increasingly important – ● traffic characterisation, security traffic engineering, SLAs, billing, etc. Existing solutions: – designed for slow networks or traffic engineering/QoS – not very flexible We’re hurting because of spread of SAPPHIRE in 30 minutes ● – hardware (bus, memory) – software (copies, context switches) demand for solution: - scales to high link rates - scales in no. of apps - flexible -process at lowest possible level -minimise copying -minimise context switching -freedom at the bottom generalised notion of flow Flow: “a stream of packets that match arbitrary user criteria” TCP SYN UDP with CodeRed bytecount eth0 HTTP U IP TCP RTSP “contains worm” UDP RTP Flowgraph UID 0 reduce copying ● FFPF avoids both ‘horizontal’ and ‘vertical’ copies Application A U K ‘filter’ Application B Extensible (device,eth0) | (device,eth1) -> (sampler,2) -> (FPL-2,”..”) | (BPF,”..”) -> (bytecount) (device,eth0) -> (sampler,2) -> (BPF,”..”) -> (packetcount) ✔ modular framework ✔ language agnostic ✔ plug-in filters BuffersR O O ● ● PacketBuf – circular buffer with N fixed-size slots – large enough to hold packet O O IndexBuf – circular buffer with N slots – contains classification result + pointer O O O W Buffers O O ● ● PacketBuf – circular buffer with N fixed-size slots – large enough to hold packet O O IndexBuf – circular buffer with N slots – contains classification result + pointer O R O O W Buffers X X ● ● PacketBuf – circular buffer with N fixed-size slots – large enough to hold packet X X IndexBuf – circular buffer with N slots – contains classification result + pointer X O R O W R1 Buffer managementO what to do if writer catches up with slowest reader? ● – ● fast reader preference – application responsible for keeping up ● ● O O O can check that packets have been overwritten different drop rates for different apps O O overall speed determined by slowest reader overwrite existing packets O O drop new packets (traditional way of dealing with this) – O O slow reader preference – O O O O O O W R1 Languages IF (PKT.IP_PROTO == PROTO_TCP) THEN // reg.0 = hash over flow fields R[0] = Hash (14,12,1024) ● FFPF is language neutral ● Currently we support: – BPF – C – OKE Cyclone – FPL // increment pkt counter at this // location in MBuf MEM[ R[0] ]++ FI • simple to use • compiles to optimised native code • resource limited (e.g., restricted FOR loop) • access to persistent storage (scratch memory) • calls to external functions (e.g., fast C functions or hardware assists) • compiler for uspace, kspace, and nspace (ixp1200) packet sources ● currently three kinds implemented -netfilter -net_if_rx() -IXP1200 uspace ● kspace implementation on IXPs: NIC-FIX -bottom of the processing hierarchy -eliminates mem & bus bottlenecks nspace Network Processors “programmable NIC” zero copy copy once on-demand copy Performance results pkt loss: FFPF: < 0.5% LSF: 2-3% Performance results pkt loss: LSF:64-75% FFPF: 10-15% Performance Copy St rat egies 100 reference d rop accept 90 processed (in %) 80 70 60 50 40 30 20 10 0 r eg ular copy copy once zer o copy Summary concept of ‘flow’ generalised copying and context switching minimised processing in kernel/NIC complex programs + ‘pipes’ FPL: FFPF Packet Languages fast + flexible persistent storage flow-specific state authorisation + third-party code any user flow groups applications sharing packet buffers Thank you! http://ffpf.sourceforge.net/ microbenchmarks we’re hurting because of ● hardware problems ● software problems ● language problems ● lack of flexibility ● ● network speed may grow too fast anyway multiple apps problem • kernel: too little, too often • vertical + horizontal copies • context popular switching ones: not expressive • expressive ones: not popular • heterogeneous hardware • same for speed • programmable cards, routers, •• no way40Gbps: to mix and match 10 what will you do etc. CPU MEM in 10 ns? • no easy way to combine • build beyond of single oncapacity abstractions that are processor too hi-level • hard to use as ‘generic’ packet processor NIC