High Performance Logging System for Embedded UNIX and GNU/Linux Applications IEEE RTCSA 2013 (8/21/13) Cisco Systems Jaein Jeong Introduction - Embedded UNIX in many places Traditional UNIX Logging System App log Process App log Process syslogd … App log Process syslog File System USER KERNEL Buffer 2 / 25 Problem Statement - Apps slow down w. large amount of logging • Long latency to logging daemon • Inefficiency of unbuffered writes to flash FS • Long latency even with output buffering App log Process App log Process syslogd … App log Process Flash Logger syslog Flash File System USER KERNEL Buffer Named pipe 3 / 25 Our Approach • Faster Message Transfer • Compatibility with Existing Logging Apps • Destination-Aware Message Formatting 4 / 25 Organization • Related Work for UNIX Logging Systems • Background – Cisco UCS and Virtual Interface Card (VIC) – Evolution of VIC Logging System • Design Requirements and Implementation • Evaluation and Optimization • Conclusion 5 / 25 Related Work - Logging Methods for UNIX Apps • Not designed for embedded/flash logging – Slow msg passing (msg copying over kernel) – Unbuffered message writes Rsyslog • An extension used in latest distros • Multi-threading. Syslog-ng • An extension based on nsyslogd • Reliable transport, encryption, and richer set of information and filtering Syslog • Introduced in early 80’s • Still most notable one 6 / 25 Background - Cisco UCS and Virtual Interface Card Cisco UCS server Cisco UCS datacenter server system 10GBASE-KR Unified Network Fabric, 1 to Each Fabric Extender Cisco UCS Virtual Interface Card (VIC) Mgmt CPU MIPS proc core (500MHz, MIPS 24Kc) VIC ASIC Mgmt CPU FCPU 0 Embedded Linux (Linux kernel 2.6.23-rc5) FCPU 1 128 Programmable Virtual Interfaces Ethernet NICs Fibre Channel HBAs 7 / 25 Background - Evolution of VIC Logging System App Process log App Process log logd … App Process log Flash Logd – a simple logging daemon Buffered • Logging from syslogd Switch JFFS2 System Process … App Process log Flash Flash Logger syslog JFFS2 USER KERNEL log App Process log Switch System Process System Process syslogd … System Process … log Flash syslog Switch syslogd log App Process App Process Multiple Processes System Process • Different Severity Levels System App Process log • Process Formatting and flash writing … App Process Switch Unbuffered syslogd JFFS2 USER KERNEL Buffer • Improves Forwards serious flash write msgs performance to switches • Functional, of unbufferedbut syslogd with worse write • performance Still suffers long latency 8 / 25 Buffer Named pipe Organization • Related Work for UNIX Logging Systems • Background – Cisco UCS and Virtual Interface Card (VIC) – Evolution of VIC Logging System • Design Requirements & Implementation • Evaluation & Optimization • Conclusion 9 / 25 Design Requirements - Faster Message Transfer • Avoid kernel-to-user space msg copying Switch Syslogd Logging App Process log App Process log Switch System Process System Process syslogd … App Process log KERNEL … System Process App Process Switch Flash App Process System Process log mqlogd log dequeue … JFFS2 USER Switch System Process App Process Flash Logger syslog Mqlogd Logging Named pipe KERNEL System Process enqueue log Flash Memory Mapped File Flash Logger JFFS2 USER Buffer … 10 / 25 Named pipe Design Requirements - Faster Message Transfer • Reduce message copying from 4 to 2 3 Syslogd copy 1 App locallocal copy 4 Write to named pipe 2 kernel buffer 2’ Write directly from shared memory to named pipe 1’ to shared memory Switch Syslogd Logging App Process log App Process log System Process System Process 1 syslogd … App Process Switch 2 3 log syslog 4 Switch Switch System Process … App Process System Process App Process Flash App Process Flash Logger System Process log log … JFFS2 USER KERNEL Mqlogd Logging 1’ mqlogd dequeue enqueue 2’ log Named pipe KERNEL System Process Flash Memory Mapped File Flash Logger JFFS2 USER Buffer … 11 / 25 Named pipe Design Requirements - Compatibility with Existing Logging Apps • Thru Logging API – Replace syslog() with share memory lib calls Logging Client app1 mcp app2 fls • Direct Syslog Calls – Server receives msgs through UDP Unix socket Logging Client … klogd fls xinetd … Logging API : log_info(), log_error(), … syslog() library call syslog() call Shared library Memory Logging UDP UnixLibrary Socket UDP Unix Socket Logging Server (mqlogd) (Syslogd) Logging Server (mqlogd) (Syslogd) 12 / 25 Design Requirements - Destination-Aware Message Formatting • Syslogd – Working but limited – Redundant – Coarse time granularity (in seconds) • Mqlogd – Destination-aware formatting with space saving – Uses system supported timing (in micro-seconds) 13 / 25 Implementation - Shared Memory and Circular Queue Queue Memory Layout Header Entry Circular Queue Header Non-Header Entry Non-Header Entry Notification Disable Flag Non-Header Entry … Logging Event Logging Enqueue Client … Logging Client • Notification Mechanism Shared Memory Dequeue Logging Server Notification – Write-and-select – Signal • Locking Mechanism – Semaphore lock – Pthread lock 14 / 25 Organization • Related Work for UNIX Logging Systems • Background – Cisco UCS and Virtual Interface Card (VIC) – Evolution of VIC Logging System • Design Requirements & Implementation • Evaluation & Optimization • Conclusion 15 / 25 Evaluation • Metrics – Request Latency – Request Drop Rate • Parameters – Number of clients – Number of iterations (Depth of queue size) – Locking mechanism – Notification mechanism 16 / 25 Performance Results - Performance compared to syslogd • Avg Latency: >10x speed-up • Min Latency: >20x speed-up • Max Latency: >2x speed-up Maximum Minimum Average Request RequestLatency Latency- -11Client Client Latency (us) 800 50000 40000 600 30000 400 20000 200 10000 syslogd mqlogd (select, semaphore) mqlogd (signal, semaphore) mqlogd (select, pthread) 0 0 100 100 1000 1000 5000 5000 10000 10000 Number NumberofofIterations Iterations 50000 50000 mqlogd (signal, pthread) 17 / 25 Performance Results - Effect of Queue Size • No drops within queue size (e.g. 10000) • Queue size should be larger than max expected burst size Request Drop Rate - 1 Client 100% Percent 80% 60% mqlogd (select, semaphore) 40% mqlogd (signal, semaphore) 20% mqlogd (select, pthread) 0% mqlogd (signal, pthread) 100 1000 5000 10000 Number of Iterations 50000 18 / 25 Performance Results - Effect of Multiple Clients • Avg request latency increases proportionally • With 2 clients, request starts to drop with smaller number of iterations Avg Request Request Drop Latency Rate - -11and and22Clients Clients (us) Latency Percent 2000.0 100% 80% 1500.0 60% 1000.0 40% 500.0 20% syslogd (1 client) syslogd (2 mqlogd (select, clients) 1 client) mqlogd (select, 2 1 clients) client) 0% 0.0 mqlogd (select, 2 clients) 100 100 1000 1000 5000 5000 10000 10000 Number Numberof ofIterations Iterations 50000 19 / 25 Performance Results - Effect of Notification Mechanisms • Makes little difference Maximum Minimum Average Request RequestLatency Latency- -11Client Client Latency (us) 100 25 20000 20 80 15000 15 60 10000 10 40 5000 20 5 mqlogd (select, semaphore) mqlogd (signal, semaphore) 00 0 100 100 100 1000 1000 1000 5000 5000 5000 10000 10000 10000 Number Number Number of ofof Iterations Iterations Iterations 50000 50000 20 / 25 Performance Results - Effect of Lock Mechanisms • Pthread mutex is 40% faster than semaphore. • Semaphore is used for our production code due to a limitation of pthread mutex lock (Linux kernel 2.6.23-rc5).. Maximum Minimum Average Request RequestLatency Latency- -11Client Client Latency (us) 250 150 20000 200 15000 100 150 10000 100 50 5000 50 mqlogd (select, semaphore) mqlogd (select, pthread) 0 0 100 100 1000 1000 5000 5000 10000 10000 Number NumberofofIterations Iterations 50000 50000 21 / 25 Performance Results - Effect of Client Interface Type • Logging using UNIX socket interface – Backward compatibility is no faster – About the same level as syslogd. – For compatibility, not for general use. Average Request Latency - 1 Client Latency (us) 1000 800 600 syslogd 400 mqlogd (select, semaphore) 200 mqlogd (Unix socket) 0 100 1000 5000 10000 Number of Iterations 50000 22 / 25 Optimization - Effects of deferred notification • Sends one notification for a batch of msgs • Measured time for host-to-adapter commands (capability & macaddr) with and w.o. logging • 2x speed-up in latency Latency for 'capability' command Latency for 'macaddr' command 1000 1800 1600 1400 1200 1000 800 600 400 200 0 800 logging time (us) msg xfer time (us) logging time (us) 600 400 msgxfer xfer msg time(us) (us) time 200 0 write-and-select deferred syslogd write-and-select deferred syslogd 23 / 25 Future Works • Reduce kernel msg copying even further • Improve performance with faster lock • Avoid loss of serious messages App log Process App log Process enqueue App log Process … Memory Mapped File mqlogd dequeue Memory Mapped File Flash Logger File System USER KERNEL Named pipe 24 / 25 Conclusion • • • • Logging system for embedded UNIX apps Up to 100x speed-up in latency, 10x throughput Backward Compatibility Commercially used in Cisco UCS Virtual Interface Cards 25 / 25