High Performance Logging System for Embedded UNIX and GNU/Linux Applications Jaein Jeong Cisco Systems San Jose, California 95134, USA jajeong@cisco.com Abstract—We present a high performance logging system for embedded UNIX and GNU/Linux applications. Compared to the standard UNIX and GNU/Linux logging method, syslog, our method has two orders of magnitude lower latency and an order of magnitude higher message throughput. This speed-up is mainly due to the use of a memory-mapped file as the means of interprocess communication, fewer memory copies and the batching of output messages in the logging daemon. In addition, our logging system also accepts syslog messages, providing compatibility with existing applications. Our logging system is in production use in the Cisco UCS Virtual Interface Card. I. I NTRODUCTION Logging is a helpful tool for analyzing the behavior of computer systems and applications. By analyzing a sequence of events in a log, one can see whether computer systems or applications behave properly. In production environments logs are even more important and may be the only way to analyze system problems. In this paper, we focus on logging systems for embedded UNIX applications (we will use UNIX to refer to all UNIX-like systems including GNU/Linux). Traditional UNIX applications log messages by writing them over a socket to a system daemon which then writes the messages to log files and/or forwards them to peers running on remote hosts. The system daemon is syslogd and the application library interface is syslog. There are two performance problems with this design. The first is the time it takes for an application to transmit a message to syslogd. The second is the inefficiency of syslogd writing messages to a file one at a time using unbuffered writes. These unbuffered writes are not a big problem on a conventional file system using a hard disk (the file system buffer cache reduces disk I/O and hard drives are fast compared to flash memory), but are very slow when writing to flash memory using a flash memory file system. The first of these problems increases application response times when there are many messages to log. The second reduces the number of messages that can be logged per second by at least an order of magnitude, and also increases the application response time as no new messages may be transmitted to syslogd while it is writing the previous message. In order to address these performance problems, we developed a logging method that uses shared memory to transfer messages to the system logging daemon and briefly buffers received messages before writing them to permanent storage. In our logging method, application processes create log messages by writing into a shared memory segment. A write lock is used to ensure applications don’t conflict with one another as they write into the shared memory. A logging daemon retrieves messages from the shared memory segment and dispatches them to one or more destinations, such as permanent local storage, a remote syslogd daemon, or a dynamically connectable network client. We note that our logging method drops incoming messages when the shared memory becomes full due to a burst of incoming messages. This trade-off was made to meet real-time requirements for applications. After a message is written to the shared memory, the logging daemon must be notified. This is done by writing to a pipe. The logging daemon waits on the receiving end of the pipe and wakes up when data (i.e. a message notification) is available. Event-driven applications may defer notification until their event handler completes. This has two positive effects, the most important of which is reducing the notification overhead from once per message to once per handler invocation. It also defers the cost of notification until the handler has completed its work. Finally, when the logging daemon is running it sets a flag in the shared memory area indicating that (since it’s already running) it doesn’t need to be notified. As the logging daemon processes each message, those that go to a network destination (either a remote syslogd or a dynamic log viewer) are sent to the destination immediately. Those to be stored locally are formatted into their final text representation and written via a pipe to a buffering process, flashlogger. After the first message that arrives when its buffer is empty, flashlogger waits for up to 0.5 seconds for additional messages. After the wait period or sooner if its buffer fills up, flashlogger writes its buffered messages to flash and resumes waiting for additional messages. This buffering reduces the number of writes to local storage and enables a much higher overall throughput than obtained by writing each message to storage individually. In addition to forwarding to remote syslogd processes, the logging daemon also accepts syslog messages from local (only) applications. This provides full compatibility for existing applications. Messages received via the syslog interface are dispatched the same way (i.e. according to their priority and possibly source) as messages received via the shared memory segment. In our embedded environment with flash storage, our log- 2 ging system is almost two orders of magnitude faster for an application than syslog. This is because logging a message only involves user-space memory copying without a context switch. The maximum message throughput (how many can be saved per second before having to drop or throttle messages) with the buffering is over an order of magnitude greater. The rest of this paper is organized as follows. In Sections II and III, we review previous work and the evolution of logging systems on which our work is based. We describe the design principles and the implementation details for our logging method in Sections IV and V. We evaluate the performance in Section VI, describe a further optimization in Section VII and conclude this paper in Section VIII. II. R ELATED W ORK Introduced in BSD in the early 1980’s, syslog [1], [2] is still the most notable logging method for UNIX applications due to its simplicity, flexibility and public availability. This is exemplified by a number applications that use or extend it [3]– [9]. In order to provide more features, a number of extensions of syslog have been implemented. Syslog-ng [10] and rsyslog [11] are two of the most popular ones [12]–[14]. Syslog-ng is an extension of syslog that was created to support the original BSD syslog protocol [1] from another syslog extension, nsyslogd [15]. Its goals are to provide additional features such as reliable transport, encryption, and richer set of information and filtering. Over time, it has evolved to support the latest syslog protocol [2] and is used by several UNIX distributions. Rsyslog [11] is another wellknown extension of syslog and is being used in many of latest Linux distributions. Rsyslog extends syslog not only in features but also in performance with a multi-threaded implementation. By using threads it avoids the problem of blocking the acceptance of new messages while writing previously received messages. However, these existing systems are not designed for embedded/flash logging and they use the same fundamental design in their message handling and thus have the same limitations when used for embedded flash logging as mentioned in Section I: (i) slow message passing due to message copying over the kernel to user space boundary, and (ii) unbuffered writes of messages to output devices. In the next section, we describe our shared-memory logging methodology and shows how it addresses the performance problems of existing logging systems. III. C ISCO UCS V IRTUAL I NTERFACE C ARD AND E VOLUTION OF L OGGING S YSTEM In this section, we introduce the Cisco UCS Virtual Interface Card and explain how its logging system evolved based on design requirements and experience using it. A. Cisco UCS Virtual Interface Card Cisco Unified Computing System (UCS) [16] is a datacenter server system that enables large deployments of virtualized servers with scalable performance through hardwareaccelerated I/O virtualization. Cisco UCS mainly consists of the following components: servers, virtualization interface cards and switches. A UCS server provides computation capability, and it can be used for a computation-intensive single server or a number of virtualized servers. A UCS virtual interface card [17], or VIC, is a high-speed network interface card for a UCS server, and one of its key features is support of virtual network interfaces (VNICs). To the server, a VNIC is seen as a separate device with a unique PCI bus ID and separate resources. To the network, it is seen as a network frame with a special tag, called VN-Tag. Thus, it is treated by a server in the same way as other PCI devices and does not cause any overhead for translation. A VIC is implemented by an ASIC that consists of a management processor, two fiberchannel processors, VNIC resources and resource mappings. The management processor, which the firmware runs on and is the target of our logging system, has a 500MHz MIPS 24Kc processor core and runs Linux kernel 2.6.23-rc5. B. Evolution of VIC Logging System • • • logd - A Simple Logging Daemon: We initially wrote logd, a simple logging daemon that accepts messages with different severity levels from multiple processes through IPC calls, formats the message and writes the formatted output message to flash storage using stdio functions. Although it is very simple, logd achieved its goals with reasonable performance. Since writes through stdio functions are buffered, logd was able to write messages reasonably quickly most of the time. Unbuffered syslogd: logd served its purpose well enough at first, but there came a need to support forwarding serious errors to syslogd on the UCS switches. Instead of extending logd with such a feature, we decided to use syslogd, which of course already has the feature. Using syslogd achieved the immediate goal but its write performance was much worse because syslogd writes log messages without buffering them. Buffered syslogd: In order to address the write performance problem of syslogd we introduced a buffering process, called flashlogger, between syslogd and the flash storage. flashlogger has an adjustable wait time (currently half a second) to collect messages before it writes out what it has collected. However, if its buffer fills up before the wait time has passed, it goes ahead and writes out the buffer immediately. This combination of syslogd and flashlogger was used for the logging system for VIC until we developed the new logging system that is introduced in this paper. IV. D ESIGN R EQUIREMENTS A. Faster Message Transfer A representation of syslogd is shown in Figure 1(a). The round-trip time between a user application process and the syslogd process is typically around 700us on average and has been observed as high as 13,000us under heavy load. This is not a small number for a modern computing system. The reason why accessing syslogd takes a long time is because 3 Switch Switch Switch Switch System Process System Process App log Process … syslogd App log Process System Process … Flash Flash Logger syslog App Process log … mqlogd System Process dequeue q enqueue App Process log JFFS2 USER System Process App log P Process … App log g Process KERNEL System Process Flash Memory Mapped File Flash Logger JFFS2 USER Buffer Named pipe Named pipe KERNEL (a) Typical UNIX Logging (b) Shared Memory Logging Fig. 1. Data Flow of Log Messages it requires passing data between user and kernel spaces and context switches between the application and syslogd. In order to address the long round-trip problem of syslogd, we designed a logging scheme that does not require message transfer between user and kernel spaces in its critical data path, as illustrated in Figure 1(b). With this scheme, an application or system process uses the same logging API to log a message. The message is copied to shared memory by the log functions and forwarded to the flash memory or a remote host by a message collector mqlogd. Since the shared memory is allocated in user space, message transfer can be very fast once the shared memory section is initialized. Thus, a user or system application is finished logging right after it writes a message to the shared memory instead of waiting for the reply from the message collector, and this improves latency. The faster message transfer of the shared memory logging is also attributable to the number of times a message is copied in memory, which is a significant factor for message transfer time. With syslogd, a message is copied four times in memory: first, it is formatted into a user buffer; second, it is written from the user buffer to a kernel socket; third, it is read from the kernel socket to a syslogd user buffer; and fourth, it is written from the syslogd user buffer to a file. With the shared memory logging, the number reduces to two. A process that logs a message formats the message into the shared memory. Since the logging daemon accesses the shared memory, the first three message copies in syslogd are reduced to a single message copy. B. Compatibility with Existing Logging Applications VIC applications use the following logging macros: log_info, log_debug, log_warn, and log_error. These macros format input strings and previously made syslog calls. Applications that use these logging macros were converted to shared-memory-based logging without modification of application code by changing the syslog calls to message copies to the shared memory. However, there are some applications that directly call syslog without using the logging macros. Therefore the logging daemon of our logging system also accepts messages intended for syslogd. This allows existing applications that use syslogd to be used with our logging system without modifications. One such application is klogd. C. Output Flash File System The syslogd program for VIC used the flashlogger program to buffer messages, and our logging system takes the same approach. D. Support of Destination-Aware Message Formatting One of the advantages of syslogd is that it allows log messages to be sent to multiple destinations of different types, where a destination may be a simple file, a named pipe or a socket on a remote host. However, there is a drawback. syslogd uses the same data representation shown in Figure 2 for all destinations. This representation has two problems. The first problem is inefficiency. The string representation of date and time requires more space than a binary or simpler text representation. In addition, the host name is redundant for flash logs because all of the events that are stored in the flash memory are local and are from the same host. The second problem is loss of precision. The message format specifies the time of a log message at a granularity of seconds, while the system supports timing in milliseconds. Logging data is more useful with more precise timings. Our logging system formats messages differently depending on the destination. For remote forwarding, it outputs log messages in the same way as the current Cisco version of syslogd does for compatibility. For writing to flashlogger, it changes the message format to use a higher precision timestamp while omitting the hostname. 4 Format: DateStr TimeStr Host app.module[pid] : Message Example: Mar 9 15:53:23 simserv01 log.logtest[16665]: This is a message Fig. 2. Message format for syslogd Header Entry (48B in size) CLIENT logperf Client Utility (an example) CLIENT log Receives API call Data Transfer mq Shared Memory Management Queue Management logger UDP Unix Socket SERVER mqlogd Outputs Log Data Transfer mq Shared Memory Management Queue Management Message Formatting Named Pipe UDP IP Socket TCP IP Socket OUTPUT UTILITY OUTPUT UTILITY OUTPUT UTILITY Flash l logger Fig. 3. Local Logging syslogd Remote Forwarding netcat Tail (4B) Client Utility (an example) syslog() system call Memory Mapped File Flag Unused Head (1B) (3B) (4B) Unused Prog Name (16B) Logging ID (16B) Data Entry (48B in size) Log Lvl (2B) Msg Length (2B) TimePID stamp (4B) (8B) Data Entry only with Msg String (48B in size) Msg String (48B) Dynamic Remote Forwarding Program Modules for the mqlogd Logging Service Reader Mutex PID (20B) (4B) Queue Memory Layout Header Entry (48B) V. I MPLEMENTATION A. Platform B. Modular Description Our logging method consists of a core library, server modules, client modules and output utilities as illustrated in Figure 3. The module mq, which is in charge of managing the memory-mapped file, the circular queue, and message formatting, is made as a library that can be used by both a logging daemon and clients. The logging daemon, mqlogd, accepts logging messages either through the shared memory or a UDP UNIX socket, and it outputs messages through the named pipe or a UDP IP socket. The client modules receive logging requests in a printf-like format and insert them into the shared memory. As for the client utilities, we consider two different kinds depending on how they submit logging requests: one uses shared memory and the other uses the syslog library call and UDP UNIX socket. Output utilities are connected to the server through a named pipe, a UDP IP socket, or a TCP IP socket. The named pipe interface is for a program such as flashlogger that stores logging messages locally. The UDP IP socket interface is to forward messages to the syslogd process on the switch. The server also supports dynamic remote forwarding through a TCP IP socket interface. This interface is different from the UDP IP socket interface in that it dynamically maintains forwarding destinations to a client such as netcat on a remote host. 0 Non-Header Entry (48B) 1 * BlockSize Non-Header Entry (48B) 2 * BlockSize … While our logging method can be used for any UNIX system that does logging, it is especially designed for the embedded firmware for the Cisco UCS Virtual Interface Card [17] where low-latency logging is desirable and the storage medium log files are written to is slow. Offset Non-Header Entry (48B) Fig. 4. (N-1) * BlockSize Memory Layout of the Circular Queue C. Shared Memory and Circular Queue Management 1) Data Representation: The shared memory is used as an array of fixed-sized blocks that are managed as a circular queue as illustrated in Figure 4. The first block in the memorymapped file is used as the header entry, which contains all the bookkeeping information for the circular queue and the logging service. The rest of the blocks are data entries, which contain the information for each logging message. Just after each message header, the message uses as many blocks as it needs for the message string. The length of the message string is specified in the message length field in the first data entry. Since data entries are consecutive, we can handle multiple segments of message string fields like a single field by pointing to the starting address of the first message string field. However, there is an exception to this. When a sequence of data entries roll around the end of the memory-mapped file, we cannot treat the message string as consecutive memory. In that case, the message string must be handled as two pieces: the first one is from the beginning of the message string and to the end of the memory-mapped file, the second one is from the beginning of the memory-mapped file and the location where 5 the last data entry ends. Thus, a message string can be handled as one or two consecutive byte arrays. 2) Notification Mechanism: The queue management module notifies the logging daemon when a client writes a logging message in the queue or when the queue is full. It implements two different methods for client-to-server notification: one using write-and-select (the default) and the other with a signal. Each time a client writes a logging message to the memorymapped file, it checks whether it should notify the logging daemon. If the daemon sets the notify flag in the header entry, the client does not send a notification. If the daemon clears the notify flag, the client sends a data-available command, which makes the daemon retrieve all the logging messages in the queue. This notification batching mechanism suppresses redundant notifications and helps improve the latency and throughput for message passing and dispatch. The client also sends a queue-full notice when it detects that the queue is full. It then drops the message. This makes the daemon output a queue-full message to the destinations in order to mark that there was a drop. 3) Locking Mechanism: The queue management module uses a locking mechanism in order to maintain the consistency of the circular queue. It implements two different mechanisms: semaphore lock and pthread mutex lock. The semaphore lock is a set of lock functions based on the Linux semaphore API. While it is not really fast due to a kernel access for each lock or unlock operation, it is widely supported. The pthread mutex lock uses the Linux pthread API. While it is faster than the semaphore lock, it is unreliable for inter-process use in our current version of Linux. Thus, the deployed queue management module uses the semaphore lock. D. Client Logging Interface Client applications in the original logging system first initialized application-wide logging objects before they called one of the logging macros (log_info, log_debug, log_warn, and log_error). This two-step approach fits well to the shared memory based logging and allowed existing applications to be converted to the new logging without modification of application code. In order to handle syslog messages the logging daemon waits on the UNIX socket at /dev/log, adding received messages to the circular queue. This functionality provides the same application interface as syslog, allowing clients for syslog to be used with the shared memory logging without any modification. port for syslogd, this makes the syglogd processes on the switches receive the message. In order to support dynamic remote forwarding, the logging daemon listens to a TCP port. When a client on a remote host makes a TCP connection to this port, the daemon adds the file descriptor of the TCP connection to a pool of active TCP connections. For each incoming logging message, the daemon formats the message and sends it to clients in the pool. When a remote client disconnects the TCP connection, any attempt to send a message to the client generates a file error. The daemon detects this as an inactive TCP connection and removes the file descriptor from the pool. The dynamic remote forwarding may look similar to the UDP port forwarding, but these two methods are designed for different purposes. The UDP port forwarding sends high-priority messages to the switches and is always enabled. The TCP dynamic remote forwarding, which is designed for easy debugging, sends messages of all priority levels and is enabled only on explicit requests to avoid unnecessary overhead. VI. E VALUATION A. Experiment Set-up 1) Metrics: For evaluation, we measured the performance of our shared memory-based service (mqlogd) and syslogd for two metrics – request latency and drop rate: • • 2) Parameters: For performance comparison of mqlogd and syslogd, we varied the number of clients and number of iterations for both logging services: • E. Server Output Interface After receiving each logging message, the logging daemon formats it and outputs the formatted message to the internal flash storage logger. For a logging message with a high priority level (i.e. error or warn), the logging daemon also outputs the formatted message to a UDP socket on remote hosts. By default, the daemon sends the formatted message string to UDP port 514 of the two switches to which a UCS Virtual Interface Card is attached. Since UDP port 514 is a designated Request Latency: We measured statistics (average, minimum, and maximum) for the request latency by repeating a logging message multiple times for each set of parameters. Among these statistics, average and maximum request time are more interesting because they show the expected and worst-case service times. Request Drop Rate: We also measured the request drop rate. syslogd has a zero drop rate at the sacrifice of latency, which is an order of magnitude higher. In contrast, the mqlogd logging service drops requests under high load. By measuring the request drop rate for various sets of parameters, we will show the conditions where the mqlogd logging service achieves lower request latency while achieving a zero or acceptably low drop rate. • Number of clients - 1 and 2: Changing the number of clients gives the lower bound for performance (a single client) and shows system performance as contention increases (two clients). No more than 2 clients were measured because most of the time, there is only one client logging messages. Number of iterations - 100, 1000, 5000, 10000, and 50000: Currently, the queue size of the system is set to 10240 items. In this test, we see how our logging service module behaves with different numbers of requests. For all of the measured configurations, five iterations were run. 6 Average Request Latency − 1 Client For evaluating the most suitable design out of a few design variations, we measured the performance while varying the following parameters – locking mechanism and notification mechanism: • Locking Mechanism - semaphore lock and pthread mutex lock: The purpose of this test is to see how the semaphore lock performs compared to the pthread mutex lock and how it performs when we change various parameters. Notification Mechanism - signal vs. write-and-select: While both signals and write-and-select are legitimate ways of notifying the server, signals may have side effects when used with the VIC library code. Thus, the purpose of this test is to see how write-and-select performs relative to signal, which has a smaller latency. B. Performance Results As for the average request latency, mqlogd showed more than 10-times speed-up for all configurations compared to syslogd. 1) Effects of Notification and Lock Mechanisms: For notification mechanisms, the measurements showed little difference between the implementation with write-and-select and the one with signal. In addition, signals cause problems with VIC library. This made us use write-and-select. For lock mechanisms, the implementation with the pthread mutex lock was about 40% faster than the one with the semaphore lock. This implies that we can easily get an additional 40% improvement by replacing calls to the semaphore lock with the pthread mutex lock when a working version is available. As for minimum request latency, mqlogd showed more than 20-times speed-up for all its configurations: 25times speed-up for semaphore lock implementations and 40times speed-up for pthread mutex lock implementations compared to syslogd. As for maximum request latency, mqlogd showed more than a 2x speed-up for all its configurations: by a factor of 2.3 for semaphore lock implementations and 4.0 for pthread mutex lock implementations compared to syslogd. 2) Effects of Queue Size: While syslogd did not drop requests, mqlogd dropped logging messages for a burst longer than or equal to 10000 (the queue size), and this became more severe as there were more requests. We can see that we need to have the queue size larger than the maximum expected size of a burst to avoid drops. 3) Effects of Multiple Clients: We measured the request latency and the request drop rate for both one and two clients. As for the average request latency, we have observed that having two clients makes about a factor of two increase. This ratio gets higher when the number of requests is larger than the size of queue as in the case of 50000 iterations. This shows that setting the queue size is important to have the performance bounded unsurprisingly. We notice that with two clients the request starts to drop with with smaller number of iterations than the case with a single client. syslogd mqlogd (select, semaphore) mqlogd (signal, semaphore) mqlogd (select, pthread) mqlogd(signal, pthread) 1000 800 Latency (us) • 1200 600 400 200 0 100 1000 5000 10000 Number of Iterations 50000 # - Iterations 100 1000 5000 10000 50000 syslogd 710.6 741.0 748.0 739.8 716.0 mqlogd (select, 50.4 49.8 42.2 45.8 72.6 semaphore) mqlogd (signal, 90.4 49.2 46.0 50.4 85.2 semaphore) mqlogd (select, 191.4 27.4 23.4 27.8 78.8 pthread) mqlogd (signal, 69.2 26.2 23.4 31.2 94.2 pthread) Speed-up over syslogd mqlogd (select, semaphore) mqlogd (signal, semaphore) mqlogd (select, pthread) mqlogd (signal, pthread) Fig. 5. 100 1000 5000 10000 50000 14.1 14.9 17.7 16.2 9.9 7.9 15.1 16.3 14.7 8.4 3.7 27.0 32.0 26.6 9.1 10.3 28.3 32.0 23.7 7.6 Average Request Latency for a Single Client 4) Effects of Client Interface Type: We evaluated the performance of our logging system using the UNIX socket interface and compared it with the performance with shared memory. For reference, we compared this with the performance of syslogd. The experiment results are shown in Figure 11. We find that the average latency with the UNIX socket is more than an order of magnitude longer on average than that with the shared memory and about the same level as that of syslogd. While we provided the UNIX socket interface for compatibility reasons, the use of the UNIX socket interface (e.g., syslog) should be minimized to achieve the best performance. VII. O PTIMIZATION We have shown that mqlogd provides a logging service that is at least 10 times faster than syslogd by using user-space 7 Minimum Request Latency − 1 Client 4 1000 4.5 syslogd mqlogd (select, semaphore) mqlogd (signal, semaphore) mqlogd (select, pthread) mqlogd(signal, pthread) 900 800 Maximum Request Latency − 1 Client syslogd mqlogd (select, semaphore) mqlogd (signal, semaphore) mqlogd (select, pthread) mqlogd(signal, pthread) 4 3.5 3 Latency (us) Latency (us) 700 x 10 600 500 400 2.5 2 300 1.5 200 1 100 0.5 0 100 1000 5000 10000 Number of Iterations 50000 # - Iterations 100 1000 5000 10000 50000 syslogd 575.2 574.6 560.8 478.2 302.4 mqlogd (select, 23.0 19.0 19.0 18.8 18.6 semaphore) mqlogd (signal, 23.4 21.0 21.0 20.4 20.2 semaphore) mqlogd (select, 122.4 10.0 10.0 10.0 10.0 pthread) mqlogd (signal, 16.4 10.0 10.2 10.0 10.0 pthread) Speed-up over syslogd mqlogd (select, semaphore) mqlogd (signal, semaphore) mqlogd (select, pthread) mqlogd (signal, pthread) Fig. 6. 100 1000 5000 10000 50000 25.0 30.2 29.5 25.4 16.3 24.6 27.4 26.7 23.4 15.0 4.7 57.5 56.1 47.8 30.2 35.1 57.5 55.0 47.8 30.2 Minimum Request Latency for a Single Client shared memory as the means of inter-process communication. In this section, we discuss a notification mechanism that can improve this speed-up even further. A. Deferred Notification Mechanism In order to improve the processing time of logging requests, we introduce the deferred notification mechanism. With deferred notification, a logging client sends one notification for a batch of messages rather than sending a separate notification for each message. Deferred notification is expected to give a smaller latency than per-message notification, but it has a restriction in that it requires an application to determine an appropriate time to send notifications. This is easy for most applications in VIC because they are event driven and the completion of an event handler is a very suitable time to send 0 100 1000 5000 10000 Number of Iterations 50000 # - Iterations 100 1000 5000 10000 50000 syslogd 9129 11977 14366 24143 41100 mqlogd (select, 1984 11814 12234 12251 15844 semaphore) mqlogd (signal, 1976 9796 12635 13813 15060 semaphore) mqlogd (select, 696 7440 12223 14516 14980 pthread) mqlogd (signal, 789 8159 11850 11415 13927 pthread) Speed-up over syslogd mqlogd (select, semaphore) mqlogd (signal, semaphore) mqlogd (select, pthread) mqlogd (signal, pthread) Fig. 7. 100 1000 5000 10000 50000 4.6 1.0 1.2 2.0 2.6 4.6 1.2 1.1 1.7 2.7 13.1 1.6 1.2 1.7 2.7 11.6 1.5 1.2 2.1 3.0 Maximum Request Latency for a Single Client a notification. B. Host-Side Round-Trip Time Measurement In this section, we compare the performance of the deferred notification mechanism by measuring the round-trip time from the host. To make this measurement, we execute a program called devcmd on the host. The devcmd program sends queries to the main firmware application, which may produce logging messages, and measures the total latency from the perspective of the host. Using devcmd is useful because the latency on the host side is the latency a user of VIC (i.e., a network driver) will experience. The devcmd program allows a user to specify a command to issue, and each command results in a different action in 8 Average Request Latency − 1 and 2 Clients Request Drop Rate − 1 Client 2000 100 90 80 mqlogd (select, semaphore) mqlogd (signal, semaphore) mqlogd (select, pthread) mqlogd(signal, pthread) 1800 1600 1400 Latency (us) Percent 70 60 50 40 400 10 200 0 100 1000 5000 10000 Number of Iterations 50000 100 1000 5000 10000 50000 0.0% 0.0% 0.0% 21.2% 80.0% 0.0% 0.0% 0.0% 27.7% 83.1% 0.0% 0.0% 0.0% 35.0% 86.7% 0.0% 0.0% 0.0% 39.3% 87.6% Request Drop Rate for a Single Client firmware. Firmware processes the received command and logs messages when the minimum threshold for logging is set to the debug level. When the minimum threshold is set to the info level, firmware processes a received command without logging messages. The difference in the round-trip time between the two modes is the logging time for the specified command. The parameters we considered are summarized as follows: • 800 20 Fig. 8. • 1000 600 # - Iterations mqlogd (select, semaphore) mqlogd (signal, semaphore) mqlogd (select, pthread) mqlogd (signal, pthread) • 1200 30 0 syslogd (1 client) syslogd (2 clients) mqlogd (select, 1 client) mqlogd (select, 2 clients) Notification mechanisms: write-and-select, deferred and syslogd Commands: macaddr (gets the MAC address) and capability (checks availability of a given command) Minimum threshold for logging: info and debug Figure 12 shows the latency for the capability command. Figure 12(a) is the graph for total latency for all combinations of notification mechanisms and minimum threshold levels. As expected, the total latencies for the three notification mechanisms are very close when there is no logging. For total latencies with a minimum logging threshold of debug, we see differences among the notification mechanisms. Since the difference in latency between the debug level and the info level is the time to process logging messages, we can quantify the cost of each notification mechanism. Figure 12(b) shows that mqlogd with the deferred notification mechanism achieves more than 2x speed-up compared to the one with write-and-select. Figure 13 shows the latency for the macaddr command. 100 1000 5000 10000 Number of Iterations 50000 # - Iterations 100 1000 5000 10000 50000 syslogd (1 710.6 741.0 748.0 739.8 716.0 client) syslogd (2 1376.8 1433.0 1458.2 1400.4 1439.6 clients) mqlogd (select, 50.4 49.8 42.2 45.8 72.6 1 client) mqlogd (select, 139.4 73.6 72.2 81.8 206.4 2 clients) mqlogd (signal 90.4 49.2 46.0 50.4 85.2 1 client) mqlogd (signal 111.0 82.2 63.4 84.2 196.0 2 clients) Speed-up over 1 client syslogd mqlogd (select) mqlogd (signal) Fig. 9. 100 1000 5000 10000 50000 1.9 2.8 1.9 1.5 1.9 1.7 1.9 1.8 2.0 2.8 1.2 1.7 1.4 1.7 2.3 Average Request Latency for Two Clients We see that the ratio among each notification mechanism is similar to that Figure 12; mqlogd with the deferred notification mechanism achieves about 2x speed-up compared to the one with write-and-select. The difference is that absolute values of latency for macaddr are smaller than those for capability because macaddr is a trivial command – it’s essentially retrieving 4 bytes. VIII. C ONCLUSION We presented a logging system for embedded UNIX applications that provides faster logging times than syslogd, while providing application-level interface compatibility. It achieves almost two orders of magnitude speed-up in latency and an order of magnitude improvement in message throughput. This 9 Total Latency for ’capability’ Command Logging Latency for ’capability’ Command 2000 2000 Logging Time (us) 1500 Time (us) Time (us) Average (us) 1000 500 0 write−and− select (debug) write−and− select (info) deferred (debug) deferred (info) syslogd (debug) syslogd (info) 1500 1000 500 0 write−and− select (a) Total Latency Latency for the Capability Command Total Latency for ’macaddr’ Command Logging Latency for ’macaddr’ Command 1000 1000 Logging Time (us) Time (us) Time (us) Average (us) 500 write−and− select (debug) write−and− select (info) deferred (debug) deferred (info) syslogd (debug) syslogd (info) 500 0 write−and− select (a) Total Latency syslogd Latency for the Macaddr Command Request Drop Rate − 1 and 2 Clients 100 mqlogd (select, 1 client) mqlogd (select, 2 clients) as syslogd. ACKNOWLEDGMENTS I would like to thank my colleagues at Cisco, Jim Orosz, Brad Smith, Puneet Shenoy and James Dong for having fruitful discussions and helping edit and proofread this paper. 80 70 Percent deferred (b) Logging Latency Fig. 13. 90 syslogd (b) Logging Latency Fig. 12. 0 deferred 60 R EFERENCES 50 [1] C. Lonvick, “The BSD Syslog Protocol,” 2001. http://tools.ietf.org/html/ rfc3164. [2] R. Gerhards, “The Syslog Protocol,” 2009. http://tools.ietf.org/html/ rfc5424. [3] T. Qiu, Z. Ge, D. Pei, J. Wang, and J. Xu, “What happened in my network: mining network events from router syslogs,” in Proceedings of the 10th ACM SIGCOMM conference on Internet measurement (IMC ‘10), ACM, November 2010. [4] K. Slavicek, J. Ledvinka, M. Javornik, and O. Dostal, “Mathematical processing of syslog messages from routers and switches,” in Proceedings of the 4th International Conference on Information and Automation for Sustainability (ICIAFS 2008), IEEE, December 2008. [5] M. Bing and C. Erickson, “Extending unix system logging with sharp,” in Proceedings of the 14th Systems Administration Conference (LISA 2000), USENIX, December 2000. [6] M. Hutter, A. Szekely, and J. Wolkerstorfer, “Embedded system management using wbem,” in IFIP/IEEE International Symposium on Integrated Network Management, 2009 (IM ‘09), IEEE, June 2009. [7] T. Park and I. Ra, “Design and evaluation of a network forensic logging system,” in Convergence and Hybrid Information Technology, 2008. ICCIT’08. Third International Conference on, vol. 2, pp. 1125–1130, IEEE, 2008. [8] M. Roesch et al., “Snort-lightweight intrusion detection for networks,” in Proceedings of the 13th USENIX conference on System administration, pp. 229–238, Seattle, Washington, 1999. [9] M. Schütte, “Implementation of ietf syslog protocols in the netbsd syslogd,” 2009. [10] B. Scheidler, “syslog-ng.” http://www.balabit.com/network-security/ syslog-ng. [11] R. Gerhards, “rsyslog.” http://www.rsyslog.com. 40 30 20 10 0 100 1000 5000 10000 Number of Iterations 50000 # - Iterations 100 1000 5000 10000 50000 mqlogd (select, 0.0% 0.0% 0.0% 21.2% 80.0% 1 client) mqlogd (select, 0.0% 0.0% 10.0% 45.7% 91.9% 2 clients) Fig. 10. Request Drop Rate for Two Clients speed-up is attributed to the use of user-level shared memory as the means of inter-process message communication. It also supports the syslog interface, allowing system applications that use syslog to be interoperable without additional modifications while achieving about the same level of performance 10 Average Request Latency − 1 Client 1200 1000 syslogd mqlogd (select, semaphore) mqlogd (Unix socket) Latency (us) 800 600 400 200 0 100 1000 5000 10000 Number of Iterations 50000 # - Iterations 100 1000 5000 10000 50000 syslogd 701.8 717.8 707.2 714.4 691.2 mqlogd (select, 50.4 49.8 42.2 45.8 72.6 semaphore) mqlogd (UNIX 743.6 767.6 754.6 747.6 734.8 socket) Speed-up over syslogd mqlogd (select, semaphore) mqlogd (UNIX socket) 100 1000 5000 10000 50000 14.1 14.9 17.7 16.2 9.9 1.0 1.0 1.0 1.0 1.0 Fig. 11. Average Request Latency for a Single Client with Different Client Interfaces [12] A. Tomono, M. Uehara, M. Murakami, and M. Yamagiwa, “A log management system for internal control,” in Network-Based Information Systems, 2009. NBIS’09. International Conference on, pp. 432–439, IEEE, 2009. [13] F. Nikolaidis, L. Brarda, J.-C. Garnier, and Nikifeld, “A universal logging system for lhcb online,” International Conference on Computing in High Energy and Nuclear Physics (CHEP 2010), October 2010. [14] J. Garnier, “Lhcb online log analysis and maintenance system,” in Proceedings of the 13th International Conference on Accelerator and Large Experimental Physics Control Systems, October 2011. [15] D. Reed, “nsyslogd.” http://coombs.anu.edu.au/∼avalon/nsyslog.html. [16] S. Gai, T. Salli, and R. Andersson, Project California: a Data Center Virtualization Server UCS (Unified Computing System). lulu.com, March 2009. [17] Cisco Systems, “Cisco UCS M81KR Virtual Interface Card Data Sheet.” http://www.cisco.com/en/US/prod/collateral/ps10265/ps10280/ data sheet c78-525049.html.