Internal publication of Faculty of Electrical Engineering and Computing, not for distribution On distributing network intensive workloads in symmetric multiprocessing systems with emphasis on high performance Ivan Voras1 1 Faculty of Electrical Engineering and Computing, University of Zagreb, Croatia Abstract – In a technological landscape that is quickly moving toward dense multi-CPU and multicore symmetric multiprocessing computer systems, where multithreading is an increasingly popular application design decision, it is important to choose a proper model for distributing tasks across multiple threads that will result in the best efficiency for the application and the system as a whole. This paper describes the creation, implementation and evaluation of various models of distributing tasks to CPU threads and investigates their characteristics for use in modern high-performance network servers. The results presented here comprise a roadmap of models for building multithreaded server applications for modern server hardware and advanced operating systems. The focus of this work and its practical result is a high performance memory database designed for caching web application data. Primary goals of this database are speed and efficiency while running on contemporary SMP systems with several CPU cores (four and more). A secondary goal is the support for simple metadata structures associated with cached data that can aid in efficient use of the cache. Due to these goals, some data structures and algorithms normally associated with this field of computing needed to be adapted to the new environment. Keywords – Multitasking, multithreading, multiprocessing, SMP, SMT, threading models, concurrency control, processor scheduling, data structures, web cache, cache database, cache daemon, memory database 1. Introduction Modern server applications cannot avoid taking note of recent developments in computer architecture that resulted in wide-spread presence of multi-core CPUs even on low-end equipment [1]. Different equipment manufacturers use different approaches [2-4] with the same end result of having multiple hardware processing units available to the software. This is not a completely recent development [5] but it has recently received a spotlight treatment because of the proliferation of multi-core CPUs in all environments. This has resulted in changed requirements for highperformance server software. The prevailing and recommended model for building high-performance network servers for a long time was based on asynchronous IO and event-driven architectures in various forms [68]. Shortcomings of this model grew as the number of CPU cores available in recent servers increased. Despite the complexities implicit in crating multithreaded programs, it was inevitably accepted as it allows full use of hardware possibilities for multithreading [9], though recently the pure multithreaded model is combined with the multiprocessing model for increased performance or convenience of implementation [10]. This work studies the different ways tasks in a database-like network server can be assigned to CPU threads. To aid the study of effects the choice of the threading model has on the 2 application and its performance, we have created an application, a cache database for web application that is, besides investigation into the subject matter, also immediately usable in our web application projects. We observe that, as the number of web applications created in scripting languages and rapid prototyping frameworks continues to grow, the importance of off-application cache engines is rapidly increasing [11]. Our own experiences in building high-performance web applications has yielded an outline of a specification for a cache database that would be best suited for dynamic web applications written in an inherently stateless environment like the PHP language and framework1. 2. Requirements for the multithreaded cache database The basic form for an database is a store of keyvalue pairs (a dictionary), where both the key and the value are more or less opaque binary strings. The keys are treated as addresses by which the values are stored and accessed. Because of the simplicity of this model, it can be implemented efficiently and it is suitable for fast cache databases. Thus our first requirement is that the cache database is to implement key-value based storage with opaque binary strings for keys and values. However, the simplistic approach of the keyvalue databases can be limiting and inflexible if complex data needs to be stored. In our experience one of the most valuable features a cache database could have is the ability to group cached keys by some criterion, so multiple keys belonging to the same group can be fetched and expired together, saving communication roundtrips and removing grouping logic [11] from the application. To make the cache database more versatile, the cache database is to implement simple metadata for stored records in the form of typed numeric tags. 1 Because of the stateless nature of the HTTP, most languages and frameworks widely used to build web applications are stateless, with various workarounds like HTTP cookies and server-side sessions that are more-or-less integrated into frameworks. Today's pervasiveness of multi-core CPUs and servers with multiple CPU sockets has resulted in a significantly different environment than what was common when single-core single-CPU computers were dominant. Algorithms and structures that were efficient in the old environment are sometimes suboptimal in the new. Thus, the cache database is to be optimized in its architecture and algorithms for multi-processor servers with symmetric multi-processing (SMP), with the goal of efficient operation of 4-core and larger systems. Finally, to ensure portability and applicability of the cache database on multiple operating systems, its implementation is to follow the POSIX specification. 2.1. Rationale and discussion Cache databases should be fast. To aid in the study of threading models in realistic usage patterns, the created cache database should be usable in production in web applications. The target web applications for the cache database are those written in relatively slow interpreted languages and dynamic frameworks such as PHP, Python and Ruby. A common use case for cache databases in such web applications is for storing results of complex calculations, both internal (such as generating HTML content from pre-defined templates) and external (such as data from SQL databases). The usage of dedicated cache databases pays off if the cost of storing (and especially retrieving) data to (and from) the cache database is lower than the cost of performing the operation that generates the original data. This cost might be in IO and memory allocation but we observe that the more likely cost is in CPU load. Caching often-generated data instead of generating it repeatedly (which is a common case in web applications, where the same content is presented to a large number of users) can obviously dramatically improve the application's performance. Though they are efficient, we have observed that pure key-value cache databases face a problem when the application needs to atomically retrieve or expire multiple records at once. While this can be solved by keeping track of groups of records in the application or (to a 3 lesser extent) folding group qualifiers into key names, we have observed that the added complexity of bookkeeping this information in a slow language (e.g. PHP) distracts from the simplicity of using the cache engine and can even lead to slowdowns. Thus we added the requirement for metadata in the form of numeric "tags" which implement a system of high-level types for the data and which can be used to query the data. Adding tags to keyvalue pairs would enable the application to offload group operations to the cache database where the data is stored. With the recent trend of increasing the number of CPU cores in computers, especially with multi-core CPUs [12], we decided the cache database must be designed from the start to make use of multiprocessing capabilities. The goal is to explore efficient algorithms that should be employed to achieve the best performance on servers with 4 and more CPU cores. Due to the read-mostly nature of its target implementation environment, the cache database should never allow readers (clients that only fetch data) to block other readers. Multi-platform usability is non-critical, it is an added bonus that will make the result of the project usable on other platforms and increase the number of possible implementations. 3. Implementation overview We have implemented the cache database in the C programming language, using only the standard or widespread library functions present in modern Unix-like system to increase its portability on POSIX-like operating systems. POSIX Threads (pthreads) were used to achieve concurrency on multi-processor hardware. The specific implementation of the database is as a Unix server deamon, i.e. a background process that serves network request. We have named this database mdcached (the names comes from multi-domain cache database daemon). Its architecture can be roughly divided into three modules: the network interface, the data processing module and the data storage module. The network interface and the data processing modules can be executed in an arbitrary number of CPU threads. We have also implemented special support for "threadless" operation, in which the network interface code calls the data processing code as a function call in the same thread, eliminating multi-threading synchronization overheads. The following sections will describe the implementation details of each of the modules. 3.1. Network interface The network interface module is responsible for accepting and processing commands from clients. It uses the standard BSD "sockets" API in non-blocking mode. By default, the daemon creates both a "Local Unix" socket and a TCP socket. The network interface module handles all incoming data asynchronously, dispatching complete requests to data processing threads. The network protocol (for both "Local Unix" and TCP connections) is binary, designed to minimize protocol parsing and the number of system calls necessary to process a request. The number of threads available for executing the network interface module is adjustable via command-line arguments. 3.2. Data processing The data processing module is responsible for receiving prepared data from the network module, interpreting the commands and options received by the clients, executing the specified operations and preparing the response data, which is handed back to the network module. The data processing threads communicate with the network interface using a protected (thread-safe) job queue. The number of threads is adjustable via command-line arguments. 3.3. Data storage Data storage is the most important part of the cache database in its narrowest sense with regards to its performance and capabilities. There are two large data structures implemented in the cache database. The first is a hash table of static size whose elements (buckets) are roots of red-black trees [17] which contain key-value pairs. These elements are protected by reader-writer locks (also called "shared-exclusive" locks). The hash table is populated by hashing the key portion of 4 the pair. This mixing of data structures ensures a very high level of concurrency in accessing the data. Reader-writer locks per hash buckets allow for the highly desirable behaviour that readers (clients that only read data) never block other readers, greatly increasing performance for usual cache usage. The intention behind the design of this data structure was that, using a reasonably well distributed hash function, high concurrency of writers can also be achieved (up to the number of hash buckets). An illustration of the data storage organisation is presented in Fig. 1. . . . Fig. 1. Illustration of the key-value data structure used as the primary data storage pool The second big data structure indexes the metadata tags for key-value records. It is a redblack tree of type tags whose elements are again red-black trees containing tag data (also an integer value) with pointers to the key-value records to which they are attached. The purpose of this organization is to enable performing queries on the data tags of the form "find all records of the given tag type" and "find all records of the given type and whose tag data conforms to a simple numerical comparison operation (lesser than, greater then)". The tree of data types is protected by a readerwriter lock and each of the tag data trees is protected by its own reader-writer lock, as illustrated in Fig. 2. Again, readers never block other readers and locking for write operations is localised in a way that allows concurrent access for tag queries. V T V V T T V V Fig. 2. Illustration of the tag tree structure Records without metadata tags don't have any influence on or connection with the tag trees. 3.4. Rationale and discussion We elected to implement both asynchronous network access and multithreading to achieve the maximum performance for the cache database [13], with the option of changing the number of threads dedicated to each module. In this way we can achieve hybrid models such as the combination of pure multi-process architecture (MP) and the even-driven architecture, which is sometimes called asymmetric multi-process event-driven (AMPED) [14]. Our implementation tries hard to avoid unnecessary system calls, context switches and memory reallocation [15], [16]. The implementation has avoided most of protocol parsing overheads by using a binary protocol which includes data structure sizes and count fields in command packet headers. Since the number of clients in the intended usage (web cache database for dynamic web applications) is relatively low (in the order of hundreds to thousands), we have avoided explicit connection scheduling described in [18][20]. We have opted for a thread-pool design (in which a fixed number of worker threads perform work on behalf of many independent connections) to allow the implementation of different multithreading model, as well as to allow the administrator to tune the number of worker threads (via command line arguments) and thus the acceptable CPU load on the server. We have also implemented a special "threadless" mode in which there are no worker threads, but the network code makes a direct function call into the protocol parser, effectively making the daemon single process event driven (SPED). This mode can not make use of multiple CPUs, but is included for comparison with the other model. 5 As discussed in [21] and [22], the use of multiprocessing and the relatively high standards we have set for concurrency of the requests have resulted in a need for careful choice of the structures and algorithms used for data storage. Traditional structures and algorithms used in caches, such as LRU and Splay trees are not directly usable in high-concurrency environments. LRU and its derivatives need to maintain a global queue of objects, the maintenance of which needs to happen on every read access to a tracked object, which effectively serializes read operations. Splay trees radically change with every access and thus need to be exclusively locked for every access, serializing both readers and writers much more seriously than LRU. The metadata tags structures design was driven by the same concerns, but also with the need to make certain query operations efficient (ranged comparison and grouping, i.e. less-than or greater-than). We have decided to allow the flexibility of queries on both the type and the value parts of metadata tags, and thus we implemented binary trees which are effective for the purpose of ordering data and accessing data in a certain order.. In order to maximize concurrency (minimize exclusive locking) and to limit the in-memory working set used during transactions (as discussed in [16]), we have elected to use a combination of data structures, specifically a hash table and binary search trees, for the principal data storage structure. Each bucket of the hash table contains one binary search tree holding elements that hash to the bucket and a shared-exclusive locking object (pthread rwlock), thus setting a hard limit to the granularity of concurrency: write operations (and other operations requiring exclusive access to data) exclusively lock at most one bucket (one binary tree). Read operations acquire shared locks and do not block one another. The hash table is the principal source of writer concurrency. Given an uniform distribution of the hash function and significantly more hash buckets than there are worker threads (e.g. 256 vs. 4), the probability of threads blocking on data access is negligibly small, which is confirmed by our simulations. To increase overall performance and reduce the total number of locks, the size of the hash table is determined and fixed at program start and the table itself is not protected by locks. The garbage collector (which is implemented naively instead of a LRU-like mechanism) operates when the exclusive lock is already acquired (probabilistically, during write operations) and operates per hash-bucket. The consequence of operating per hash-bucket is a lower flexibility and accuracy in keeping track of the total size of allocated memory. This simulation models the behaviour of the system with a tunable number of worker threads and hash buckets. The simulated subsystems are: a task generator, worker threads, lock acquisition and release according to pthread rwlock semantics (with increased writer priority to avoid writer starvation) and the hash buckets. The task generator attempts to saturate the system. The timings used in the model are approximate and thus we're only interested in trends and proportions in the results. Fig. 3, 4 and 5 illustrate the percentage of "fast" lock acquisitions from the simulations, where "fast" is either an uncontested lock acquisition or a contested acquisition where total time spent waiting on a lock is deemed insignificant (less than 5% of average time spent holding the lock i.e. processing the task). 3.5. Simulations A GPSS simulation was created to aid in understanding of the performance and behaviour of the key-value store (the hash table containing binary search trees). Graphs in Fig. 3-5 are plotted with the number of hash buckets on the X-axis and a percentage of fast lock acquisitions on the Yaxis. Results cover the timings for both shared locks and exclusive locks, obtained during same simulations. The simulations were run for two cases: with 64 worker threads (which could today realistically be run on consumer-grade products such as the UltraSPARC T2 from Sun Microsystems [4]) and with 8 worker threads (which can be run on readily available servers on the common x86 platform [1], [12]). Individual figures show the simulated system's behaviour with a varied ratio of reader to writer tasks. Percentage of fast lock acquisitions Percentage of fast lock acquisitions 6 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 4 8 16 32 64 128 NCPU=64, EXCLUSIVE NCPU=8, SHARED 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 4 256 # of hash table buckets NCPU=64, SHARED 100% NCPU=8, EXCLUSIVE 8 16 32 64 128 256 # of hash table buckets NCPU=64, SHARED NCPU=64, EXCLUSIVE NCPU=8, SHARED NCPU=8, EXCLUSIVE Fig. 5. Cache behaviour with 50% readers and 50% writers As predicted by the model, Fig 3. shows how a high ratio of hash buckets to threads, and a high ration of readers to writers makes almost all lock acquisitions fast. Since the number of hash table buckets can always be made higher than the number of available CPU cores, and the intended implementation environment for the cache database emphasises read transactions over write transactions, we expect that this scenario will be the most important one in the cached database usage. In the situation presented in Fig. 5 the abundance of exclusive locks in the system introduces significant increases in time spent waiting for hash bucket locks (as expected). Both kinds of locks are acquired with noticeable delays and the number of fast lock acquisitions falls appropriately. However, this does not present a significant problem if the number of hash buckets is high enough. Percentage of fast lock acquisitions Fig. 3. Cache behaviour with 90% readers and 10% writers 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 4 8 16 32 64 128 256 # of hash table buckets NCPU=64, SHARED NCPU=64, EXCLUSIVE NCPU=8, SHARED NCPU=8, EXCLUSIVE Fig. 4. Cache behaviour with 80% readers and 20% writers Trends in the simulated system continue in Fig. 4 with expected effects of having a larger number of exclusive locks in the system. We observe that this load marks the boundary where having the same number of hash buckets and worker threads makes 90% of shared lock acquisitions fast. The simulation results emphasise the lock contention, showing that equal relative performance can be achieved with the same ratio of worker threads and hash table buckets, and show an optimistic picture when the number of hash buckets is high. From these results, we have set the default number of buckets used by the program to 256, as that is clearly adequate for today's hardware. The graphs do not show the number of tasks dropped by the worker threads due to timeouts (simulated by the length of the queue in the task generator). Both types of simulated systems were subjected to the same load and the length of the task queue in systems with 8 worker threads was from 1.5 to 13 times as large as the same length in systems with 64 worker threads, leading to conclusion that the number of timeouts for the 8 worker threads would be significantly higher than for the case with 64 worker threads. This, coupled with simulated inefficiencies (additional delays) when lock contention is high between worker threads can have the effect of favouring systems with a lower number of worker threads (lock acquisition is faster because the contention is lower, but on the other hand less actual work is being done). 7 4. Multithreading models In this work the term “multithreading model” refers to a distinct way the tasks in a IO driven network servers are distributed to CPU threads. It concerns itself with general notion of threads dedicated for certain class of tasks, rather than the quantity of threads or the tasks present at any one point (except for some special or degenerate cases like 0 or 1). We observe that in a typical network server (and in particular in mdcached) there are three classes of tasks that allow themselves to be parallelized or delegated to different CPU threads: 1. Accepting new client connections and closing or garbage collecting existing connections. 2. Network communication with accepted client connections. This class corresponds to the network interface module of the cache database. 3. Payload work that the server does on behalf of the client. This class corresponds to the data processing module. Using these three classes of tasks, there are several models recognized in practice and literature [13]. At one part of the spectrum is the single process event driven (SPED) model, where all three classes of tasks are always performed by a single thread (or a process), using kernelprovided event notification mechanisms to manage multiple clients at the same time, but performing each individual task sequentially. On the opposite side of SPED is staged eventdriven architecture (SEDA), where every task class is implemented as a thread or a thread pool and a response to a request made to the client might involve multiple different threads. In our implementation SEDA is implemented as the connection accepting thread which also runs code from the network interface module (or simply “network threads”) and data processing threads. Between the extremes are the asymmetric multi-process event-driven (AMPED) model where the network operations are processed in a way similar to SPED (i.e. in a single thread) while the payload work (in this case data processing) is delegated to separate threads to avoid running long-term operations directly in the network loop and symmetric multi-process event driven (SYMPED) which uses multiple SPED-like threads. A special class of AMPED is a threadper-connection model (usually called simply the multithreaded or multiprocessing – MT or MP – model) where connection acceptance and garbage collecting is implemented in a single thread which delegates both the network communication and payload work to a single thread for each connected client (this model will not be specially investigated here as there's a large body of work covering it). Each of the multithreaded models can also be implemented as multiprocessing models, and in fact some of them are more well known in this variant (the process- or thread- per connection model is well known and often used in Unix environments). In this work we limit ourselves only to the multithreaded variants. Our experience with [11] lead us to be particularly interested in certain edge cases, which we investigate in this work. In particular, we are interested in reducing unwanted effects of inter-thread communication and limiting context switching between threads. We have implemented one important variation of the SEDA model where the number of worker threads exactly equals the number of network threads and each network thread always communicates with the same worker thread, avoiding some of the inter-thread locking of communication queues. We call this model SEDA-S (for symmetric). 4.1. Multithreading models and the test implementation We have evaluated the described multithreaded models during and after the implementation of the cache database. Since it uses aggressive techniques to achieve highly parallel operation and concurrent access to shared data, the biggest influence on overall performance of this server is observed to be in the choice of the multithreading model used to schedule the workload. This behaviour has made it a great platform for experimenting and evaluating with multithreading models. 8 The classes of tasks described earlier are implemented as follows in the mdcached server: 1. New connection handling and connection expiring is always allocated to a single thread. The reasons for this is that connection acceptance is inherently not parallelizable and that previous research [13] has indicated that this aspect of network servers cannot be significantly improved. 2. The code path implementing communication between the server and the clients can be run in one or more threads. Data asynchronously received from the remote client is only checked for validity in code path and passed to the next code path for further processing. 3. The data processing code path, which implements payload work can be run in one or more threads. The generated response data is immediately passed to the network stack for asynchronous sending. If the response is too large to be accepted by the network stack at once, the unsent data is handled by the network code path. The described division into code paths allows us to implement the SPED model by executing the network and the payload code paths inside the connection handling loop, the SEDA model by executing all three code paths in separate threads (even several instances of the same code paths in multiple threads), the AMPED model by executing single instances of the connection handling and network threads and multiple payload threads, and the SYMPED model by executing a single connection handling thread and multiple network threads which directly call into the payload handling code path (no separate payload threads). Table 1 shows the mapping between threads and code paths in mdcached to implement the described multithreading models. Connection Network acceptance communication Payload work SPED 1 thread In connection thread In connection thread SEDA 1 thread N1 threads N2 threads SEDA-S 1 thread N threads N threads AMPED 1 thread 1 thread N threads SYMPED 1 thread N threads In network thread Table 1. Mapping of code paths and threads in mdcached to multithreading models Table 1 also describes the exact models that are used in this research. 4.2. Operational characteristics of mdcached The communication between various threads within mdcached is implemented using asynchronous queues in which job descriptors (describing work to be executed) are exchanged. Within a class of threads (connection, network, data processing threads) there is no further division of priority or preference. Jobs that are passed from one thread class to the other (e.g. a client request from a network thread to a payload thread) are enqueued on the receiving threads' event queues in a round-robin fashion. Network IO operations are always performed by responding to events (event-driven architecture) received from the operating system kernel. The network protocol is binary, optimized for performance and avoidance of system (kernel) calls in data receiving code path. The primary design concern in the design of the protocol and the database operation was to achieve as many transactions per second as possible. In this work we evaluate the multithreading models implemented in the database cache server by looking at the number of client-server transactions per second processed by the server. Since mdcached is a pure memory database server it avoids certain classes of behaviour present in servers that interact with a larger environment. In particular, it avoids waiting for disk IO, which would have been a concern for a disk-based database, a web server or a similar application. We observe that the mdcached server operation is highly dependant on system characteristics such as memory bandwidth and operating system performance in the areas of 9 system calls, inter-process communication and general context switching. To ensure maximum performance we use advanced operating system support such as kqueues [23] for network IO and algorithms such as lock coalescing (locking inter-thread communication queues for multiple queue/dequeue operations instead of a single operation) to avoid lock contention and context switches. 5. Methods of evaluation of multithreading models in the cache database We evaluate the multithreaded models by configuring mdcached for each of them individually. Each model has a unique configuration of mapping code paths to threads that is applied to the configuration of mdcached server at start-up and which remains during the time the process is active. As well as the thread model, the number of threads is also varied (if it is desirable for the model under evaluation) to present a clearer picture of its performance characteristics. The server is accessed by a specially created benchmark client that always uses the same multithreading model (SYMPED) and number of threads but varies the number of simultaneous client connections it establishes. Each client-server transaction involves the client sending a request to the server and the server sending a response to the request. For testing purposes the data set used is deliberately small, consisting of 30,000 records with record sizes ranging from 10 bytes to 103 bytes in a logarithmic distribution. To emphasise the goal of maximizing the number of transactions per second achieved and to eliminate external influences like network latency and network hardware issues, both the server and the client are run on the same server, communicating over Unix sockets. The test server is based on Intel's 5000X server platform with two 4-core Xeon 5405 CPUs (for a total of 8 CPU cores in the system). The benchmark client always starts 4 threads, meaning it can use up to at most 4 CPU cores. We have observed that the best results are achieved when the number of server threads and the number of benchmark client threads together is approximate to the total number of CPU cores available in the system. The operating system used on the server is the development version of FreeBSD, 8CURRENT, retrieved at September 15 2008. The operating system is fully multithreaded and supports parallel operation of applications and its network stack. During test runs of the benchmarks we have observed that the results vary between 100,000 transactions per second and 500,000 transactions per second, with standard deviation being usually less than 5,000 transactions per second. All tests were performed with previous “warmup” runs, and the reported results are averages of 5 runs within each configuration of benchmark client and the server. We've determined that for the purpose of this work it is sufficient to report results rounded to the nearest 5,000 transactions per second, which is enough to show the difference between the multithreading models. We have also observed that the most interesting characteristics of the client-server system as a whole are visible when the number of clients is between 40 and 120 clients. This is the location of peak of performance for most models. 6. Evaluation Results We have collected a large body of results with different combinations of multithreading models and number of clients. Presented here are comparisons within multithreading models, in order of increasing performance. The SPED model, presented in Fig. 6, quickly reaches its peak performance at which it doesn't scale further with the increasing number of clients. The performance achieved with the SPED model can be used as a baseline result, achieved without the use of multithreading or multiple CPU cores. This result is the best that can be achieved with an event-driven architecture 10 investigation is needed to reveal if a better distribution of the code paths to CPU threads would yield better performance, but a conclusion suggests itself that, for this implementation, the chosen SEDA model presents a sort of worst-case scenario with regards to efficiency, tying up multiple CPU cores but not achieving expected performance. 120 T housands of trasnactions/s 100 80 60 40 20 300 0 40 60 80 100 120 140 Number of clients no network threads, no payload threads Fig. 6. Performance of the SPED multithreading model, depending on the number of simultaneous clients. Semantically on the opposite side but with surprisingly little performance benefit in our implementation is the SEDA model. Since some variations of the SEDA model overlap with other models presented here, Fig. 7 shows only the case were there are 2 network threads receiving requests from the clients and passing them to 4 payload threads. 120 T housands of transactions/s 100 80 60 40 20 0 20 40 60 80 100 120 140 Number of clients 2 network, 4 payload threads Fig. 7. Performance of the SEDA multithreading model with two network threads and four payload threads, depending on the number of simultaneous clients. Surprisingly, the SEDA model in our implementation achieves performance comparable to the SPED model, despite that it should allow the usage of much more CPU resources than the SPED model. Probing the system during the benchmark runs suggests that the majority of time is spent in the kernel, managing inter-thread communication and context switches, which hinders global performance of the client-server system. Further 250 T housands of transactions/s 20 200 150 100 50 0 20 40 60 80 100 120 140 Number of clients 0 network, 1 payload thread 1 network, 2 payload threads 1 network, 3 payload threads 1 network, 4 payload threads Fig. 8. Performance of the AMPED multithreading model with 0 or one network threads and between 1 and 4 payload threads, depending on the number of simultaneous clients. The performance characteristics of the AMPED multithreading model is presented in Fig. 8. For AMPED we present two basic variations of the model, one without a separate network thread, running SPED-like network communication in the main network loop (the connection loop), and one with a single network thread and a varying number of payload thread. System probing during benchmark runs on this multithreading model reveals that the single network thread is saturated with the work it needs to process and increasing the number of worker threads achieves little except increasing the number of context switches and inter-thread communication. The best performance for this model was achieved with one network thread and two payload threads. In the increasing order of achieved performance, the SEDA-S model whose performance is presented in Fig. 9 yields results slightly better than the AMPED mode. This model uses multiple network and payload threads, with the number of network threads equalling the number of payload threads, and where each network thread communicates with a single payload threads. 11 300 450 400 350 300 250 200 150 100 50 0 200 20 150 40 60 80 100 120 140 Number of clients 1 network thread 100 50 2 network threads 3 network threads 4 network threads 5 network threads Fig. 10. Performance of the SYMPED multithreading model with 1 to 5 network threads, depending on the number of simultaneous clients. 0 20 40 60 80 100 120 140 Number of clients 1 network, 1 payload thread 2 network, 2 payload threads 3 network, 3 payload threads 4 network, 4 payload threads Fig. 9. Performance of the SEDA-S multithreading model with the number of network threads equalling the number of payload threads, from 1 to 4 threads each, depending on the number of simultaneous clients. The highest performance is achieved with the SYMPED model, whose benchmark results are presented in Fig 10. The best result achieved with the SYMPED model is 54% better better than the next best result, indicating that it is a qualitatively better model for the purpose of this work. Investigation of the system behaviour during the benchmark runs suggests that the primary reason for this is that there is no interthread communication overhead in passing jobs between the network thread(s) and the payload thread(s). In other words, all communication and data structures from a single network request are handled within a single CPU thread. Additionally, the SYMPED model exhibits very regular scaling as the number of threads allocated to the server increases, as presented by Fig. 11. The performance increases linearly as the number of server threads increases, up to 4 threads. We believe that this limitation is a consequence of running both the server and the client on the same computer system and that the performance would continue to rise if the server and the clients are separated into their own systems. Investigation into this possibility will be a part of our future research. Note that the presented results involve bidirectional communication, so that the 450,000 achieved transactions per second translate to 900,000 individual network operations per second. 500 450 400 T housands of transactions/s T housands of transactions/s 250 500 T housands of transactions/s The performance of SEDA-S is comparatively high, scaling up with the number of threads up to 3 network threads and 3 worker threads, after which it falls as the number of threads increases and so do the adverse effects of inter-thread communication and context switches. 350 300 250 200 150 100 50 0 2 netw ork threads 4 netw ork threads 1 netw ork thread 3 netw ork threads 5 netw ork threads Number of network threads Fig. 11. Scalability of the SYMPED model depending on the number of server threads. Finally, we present a comparison between the presented models in Fig. 12, choosing best 12 performance from each model irrespectively of the number of threads used in the model. 500 450 T housands of transactions/s 400 350 300 250 200 150 100 50 0 20 40 60 80 100 120 140 Number of clients SPED SEDA SEDA-S AMPED SYMPED Fig. 12. Comparison of multithreading models at their peek performance, depending on the number of simultaneous clients. The SYMPED model exhibits the best scalability and best overall performance in the implemented cache database. This result will be used in future studies of network server optimization for the purpose of investigating network intensive workloads. We wish to emphasize that the achieved maximum number of transactions per second presented in this work come from both the client and the server processes running on the same system, and that in the ideal case, if the operating system and the network conditions allow, the potential absolute maximum could be twice as much transactions per second, close to 1,000,000 transactions/s on the same hardware. Performance measurements involving the client and the server on separate computers have shown that there currently exists significant bottlenecking in the FreeBSD's TCP network stack that hinders SMP operation. Future work might include improving it. 7. Conclusion We have investigated the performance characteristics of five multithreading models for IO driven network applications. The models differ in the ways they assign classes of tasks commonly found in network applications to CPU threads. These models are: Single-process event-driven (SPED), Staged event-driven architecture (SEDA), Asymmetric multi-process event-driven (AMPED), Symmetric multiprocess event driven (SYMPED) and a variation of the SEDA model we call SEDA-S. We have evaluated the multithreading models as applied to a memory database called mdcached we created, which allows selecting the configuration of threads and its code paths at program start. For the purpose of this work we have configured the server and the benchmark client to maximize the number of transactions per second. Our investigation results in quantifying the differences between the multithreading models. Their relative performance are apparent in their performance curves and in the maximum performance that can be achieved with each of them. The best performing among the evaluated multithreading models was SYMPED, which, for the implementation on which it was tested, offers the best scalability and the best overall performance result. We are surprised by the low performance of the SEDA model and will investigate it a future work. Another candidate for future research is the limits of scalability of the SYMPED model on more complex databases and server hardware. The described work was done mostly in the years 2007 and 2008 of the author's postgraduate studies on the Faculty of Electrical Engineering and Computing of the University of Zagreb, Croatia. 8. References [1] Intel Corp., “Intel ® Multi-Core Technology” [Online] Available: http://www.intel.com/multicore/index.htm [Accessed: September 15 2008] [2] Intel Corp., “Teraflops Research Chip” [Online] Available: http://techresearch.intel.com/articles/Tera-Scale/ 1449.htm [Accessed: September 15 2008] [3] Sun Microsystems Inc. “Multithreaded Application Acceleration with Chip Multithreading (CMT), Multicore/Multithread UltraSPARC ® Processors”, White paper [Online] Available: http://www.sun.com/products/microelectronics/ cmt_wp.pdf [Accessed: September 15, 2008] [4] OpenSPARC T2 Core Microarchitecture Specification, Sun Microystems, Inc., http://opensparc.sunsource.net/specs/UST1UASuppl-current-draft-P-EXT.pdf, Jan 2007. 13 [5] G. V. Wilson, “The History of the Development of Parallel Computing” [Online] Available: http://ei.cs.vt.edu/~history/Parallel.html [Accessed: September 15, 2008] [18] M. Crovella, R. Frangioso, and M. Harchol-Balter, "Connection scheduling in web servers" in Proceedings of the 2nd USENIX Symposyum on Internet Technologies and Systems, 1999. [6] F. Dabek, N. Zeldovich, F. Kaashoek, D. Mazieres, R. Morris, “Event-driven programming for robust software”, Proceedings of the 10th ACM SIGOPS European Workshop , 2002. [7] J. K. Ousterhout, “Why Threads are a Bad Idea (for most purposes)”, presentation given at the 1996 Usenix Technical Conference [Online] Available: http://home.pacbell.net/ouster/threads.ppt [Accessed: September 16, 2008] [19] M. Welsh, D. E. Culler, and E.A. Brewer, "SEDA: An architecture for well-conditioned, scalable Internet services" in Proceedings of the 18th ACM Symposium on Operating System Principles, Banff, 2001. [8] G. Banga, J. C. Mogul, P. Druschel, “A scalable and explicit delivery mechanism for UNIX”, Proceedings of the 1999 Annual Conference on USENIX Annual Technical Conference USENIX Association, Berkeley, CA, 1999 [21] J. Zhou, J. Cieslewicz, K. Ross, and M. Shah, "Improving database performance on simultaneous multithreading processors", in Proceedings of the 31st international Conference on Very Large Data Bases, 2005. [9] R. von Behren, J. Condit, E. Brewer, “Why Events Are A Bad Idea (for high-concurrency servers)“ , Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9, Hawaii 2003 [22] P. Garcia and H. F. Korth, "Database hash-join algorithms on multithreaded computer architectures", in Proceedings of the 3rd Conference on Computing Frontiers, 2006. [10] P. Li, and S. Zdancewic, "Combining events and threads for scalable network services implementation and evaluation of monadic, application-level concurrency primitives", SIGPLAN Notices 42, 2007. [23] J. Lemon, “Kqueue: A generic and scalable event notification facility”, presented at the BSDCon, Ottawa, CA, 2000. [Online] Available: http://people.freebsd.org/~jlemon/papers/kqueu e.pdf [Accessed September 29, 2008] [11] I. Voras, M. Žagar, "Web-enabling cache daemon for complex data", Proceedings of the ITI 2008 30th International Conference on Information Technology Interfaces, Zagreb 2008 [12] Multi-core processors – the next evolution in computing, AMD White Paper, 2005. [13] D. Pariag , T. Brecht, A. Harji, P. Buhr, A. Shukla, and D. R. Cheriton, "Comparing the performance of web server architectures," SIGOPS Operating System Review 41, 3, 2007. [14] T. Brecht, D. Pariag, and L. Gammo "Acceptable strategies for improving web server performance" in Proceedings of the USENIX Annual Technical Conference, 2004 [15] A. Ailamaki, D. J. DeWitt, M. D Hill, D. A. Wood, "DBMSs on a Modern Processor: Where Does Time Go?", in Proceedings of the 25th international Conference on Very Large Data Bases, 1999. [16] L. K. McDowell, S. J. Eggers and S. D. Gribble, "Improving server software support for simultaneous multithreaded processors" in Proceedings of the Ninth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2003. [17] Rudolf Bayer, “Symmetric Binary B-Trees: Data Structures and Maintenance Algorithms”, Acta Informatica, 1:290-306, 1972. [20] Y. Ruan, and V. Pai, "The origins of network server latency & the myth of connection scheduling", SIGMETRICS Performance Evaluation, Rev. 32, 1, 2004. Contact address: Ivan Voras University of Zagreb Faculty of Electrical Engineering and Computing Unska 3 Zagreb Croatia e-mail: ivan.voras@fer.hr I. Voras was born in Slavonski Brod, Croatia. He received Dipl.ing. in Computer Engineering (2006) from the Faculty of Electrical Engineering and Computing (FER) at the University of Zagreb, Croatia. Since 2006 he has been employed by the Faculty as an Internet Services Architect and is a graduate student (PhD) at the same Faculty, where he has participated in research projects at the Department of Control and Computer Engineering. His current research interests are in the fields of distributed systems and network communications, with a special interest in performance optimizations. He is an active member of several Open source projects and is a regular contributor to the FreeBSD project, an international project developing an open-source Unixlike operating system.