On distributing network intensive workloads in symmetric

advertisement
Internal publication of Faculty of Electrical Engineering and Computing, not for distribution
On distributing network intensive
workloads in symmetric multiprocessing
systems with emphasis on high
performance
Ivan Voras1
1
Faculty of Electrical Engineering and Computing, University of Zagreb, Croatia
Abstract – In a technological landscape that is
quickly moving toward dense multi-CPU and multicore symmetric multiprocessing computer systems,
where multithreading is an increasingly popular
application design decision, it is important to choose
a proper model for distributing tasks across multiple
threads that will result in the best efficiency for the
application and the system as a whole. This paper
describes the creation, implementation and
evaluation of various models of distributing tasks to
CPU threads and investigates their characteristics
for use in modern high-performance network servers.
The results presented here comprise a roadmap of
models
for
building
multithreaded
server
applications
for modern server hardware and
advanced operating systems. The focus of this work
and its practical result is a high performance memory
database designed for caching web application data.
Primary goals of this database are speed and
efficiency while running on contemporary SMP
systems with several CPU cores (four and more). A
secondary goal is the support for simple metadata
structures associated with cached data that can aid in
efficient use of the cache. Due to these goals, some
data structures and algorithms normally associated
with this field of computing needed to be adapted to
the new environment.
Keywords – Multitasking, multithreading,
multiprocessing, SMP, SMT, threading models,
concurrency control, processor scheduling, data
structures, web cache, cache database, cache
daemon, memory database
1. Introduction
Modern server applications cannot avoid taking
note of recent developments in computer
architecture that resulted in wide-spread
presence of multi-core CPUs even on low-end
equipment
[1].
Different
equipment
manufacturers use different approaches [2-4]
with the same end result of having multiple
hardware processing units available to the
software. This is not a completely recent
development [5] but it has recently received a
spotlight treatment because of the proliferation
of multi-core CPUs in all environments. This has
resulted in changed requirements for highperformance server software.
The prevailing and recommended model for
building high-performance network servers for
a long time was based on asynchronous IO and
event-driven architectures in various forms [68]. Shortcomings of this model grew as the
number of CPU cores available in recent servers
increased. Despite the complexities implicit in
crating multithreaded programs, it was
inevitably accepted as it allows full use of
hardware possibilities for multithreading [9],
though recently the pure multithreaded model
is combined with the multiprocessing model for
increased performance or convenience of
implementation [10].
This work studies the different ways tasks in a
database-like network server can be assigned to
CPU threads. To aid the study of effects the
choice of the threading model has on the
2
application and its performance, we have
created an application, a cache database for web
application that is, besides investigation into the
subject matter, also immediately usable in our
web application projects.
We observe that, as the number of web
applications created in scripting languages and
rapid prototyping frameworks continues to
grow, the importance of off-application cache
engines is rapidly increasing [11]. Our own
experiences in building high-performance web
applications has yielded an outline of a
specification for a cache database that would be
best suited for dynamic web applications
written in an inherently stateless environment
like the PHP language and framework1.
2. Requirements for the
multithreaded cache
database
The basic form for an database is a store of keyvalue pairs (a dictionary), where both the key
and the value are more or less opaque binary
strings. The keys are treated as addresses by
which the values are stored and accessed.
Because of the simplicity of this model, it can be
implemented efficiently and it is suitable for fast
cache databases. Thus our first requirement is
that the cache database is to implement key-value
based storage with opaque binary strings for keys
and values.
However, the simplistic approach of the keyvalue databases can be limiting and inflexible if
complex data needs to be stored. In our
experience one of the most valuable features a
cache database could have is the ability to group
cached keys by some criterion, so multiple keys
belonging to the same group can be fetched and
expired together, saving communication roundtrips and removing grouping logic [11] from the
application. To make the cache database more
versatile, the cache database is to implement simple
metadata for stored records in the form of typed
numeric tags.
1
Because of the stateless nature of the HTTP, most
languages and frameworks widely used to build web
applications are stateless, with various workarounds
like HTTP cookies and server-side sessions that are
more-or-less integrated into frameworks.
Today's pervasiveness of multi-core CPUs and
servers with multiple CPU sockets has resulted
in a significantly different environment than
what was common when single-core single-CPU
computers were dominant. Algorithms and
structures that were efficient in the old
environment are sometimes suboptimal in the
new. Thus, the cache database is to be optimized in
its architecture and algorithms for multi-processor
servers with symmetric multi-processing (SMP),
with the goal of efficient operation of 4-core and
larger systems.
Finally, to ensure portability and applicability
of the cache database on multiple operating
systems, its implementation is to follow the
POSIX specification.
2.1. Rationale and discussion
Cache databases should be fast. To aid in the
study of threading models in realistic usage
patterns, the created cache database should be
usable in production in web applications. The
target web applications for the cache database
are those written in relatively slow interpreted
languages and dynamic frameworks such as
PHP, Python and Ruby. A common use case for
cache databases in such web applications is for
storing results of complex calculations, both
internal (such as generating HTML content from
pre-defined templates) and external (such as
data from SQL databases). The usage of
dedicated cache databases pays off if the cost of
storing (and especially retrieving) data to (and
from) the cache database is lower than the cost
of performing the operation that generates the
original data. This cost might be in IO and
memory allocation but we observe that the more
likely cost is in CPU load.
Caching often-generated data instead of
generating it repeatedly (which is a common
case in web applications, where the same
content is presented to a large number of users)
can obviously dramatically improve the
application's performance.
Though they are efficient, we have observed
that pure key-value cache databases face a
problem when the application needs to
atomically retrieve or expire multiple records at
once. While this can be solved by keeping track
of groups of records in the application or (to a
3
lesser extent) folding group qualifiers into key
names, we have observed that the added
complexity of bookkeeping this information in a
slow language (e.g. PHP) distracts from the
simplicity of using the cache engine and can
even lead to slowdowns. Thus we added the
requirement for metadata in the form of
numeric "tags" which implement a system of
high-level types for the data and which can be
used to query the data. Adding tags to keyvalue pairs would enable the application to
offload group operations to the cache database
where the data is stored.
With the recent trend of increasing the
number of CPU cores in computers, especially
with multi-core CPUs [12], we decided the cache
database must be designed from the start to
make use of multiprocessing capabilities. The
goal is to explore efficient algorithms that
should be employed to achieve the best
performance on servers with 4 and more CPU
cores. Due to the read-mostly nature of its target
implementation
environment,
the
cache
database should never allow readers (clients
that only fetch data) to block other readers.
Multi-platform usability is non-critical, it is an
added bonus that will make the result of the
project usable on other platforms and increase
the number of possible implementations.
3. Implementation overview
We have implemented the cache database in the
C programming language, using only the
standard or widespread library functions
present in modern Unix-like system to increase
its portability on POSIX-like operating systems.
POSIX Threads (pthreads) were used to achieve
concurrency on multi-processor hardware.
The specific implementation of the database is
as a Unix server deamon, i.e. a background
process that serves network request. We have
named this database mdcached (the names comes
from multi-domain cache database daemon). Its
architecture can be roughly divided into three
modules: the network interface, the data
processing module and the data storage module.
The network interface and the data processing
modules can be executed in an arbitrary number
of CPU threads. We have also implemented
special support for "threadless" operation, in
which the network interface code calls the data
processing code as a function call in the same
thread,
eliminating
multi-threading
synchronization overheads.
The following sections will describe the
implementation details of each of the modules.
3.1. Network interface
The network interface module is responsible for
accepting and processing commands from
clients. It uses the standard BSD "sockets" API in
non-blocking mode. By default, the daemon
creates both a "Local Unix" socket and a TCP
socket. The network interface module handles
all incoming data asynchronously, dispatching
complete requests to data processing threads.
The network protocol (for both "Local Unix"
and TCP connections) is binary, designed to
minimize protocol parsing and the number of
system calls necessary to process a request.
The number of threads available for executing
the network interface module is adjustable via
command-line arguments.
3.2. Data processing
The data processing module is responsible for
receiving prepared data from the network
module, interpreting the commands and options
received by the clients, executing the specified
operations and preparing the response data,
which is handed back to the network module.
The data processing threads communicate
with the network interface using a protected
(thread-safe) job queue. The number of threads
is adjustable via command-line arguments.
3.3. Data storage
Data storage is the most important part of the
cache database in its narrowest sense with
regards to its performance and capabilities.
There are two large data structures
implemented in the cache database.
The first is a hash table of static size whose
elements (buckets) are roots of red-black trees
[17] which contain key-value pairs. These
elements are protected by reader-writer locks
(also called "shared-exclusive" locks). The hash
table is populated by hashing the key portion of
4
the pair. This mixing of data structures ensures a
very high level of concurrency in accessing the
data. Reader-writer locks per hash buckets allow
for the highly desirable behaviour that readers
(clients that only read data) never block other
readers, greatly increasing performance for
usual cache usage. The intention behind the
design of this data structure was that, using a
reasonably well distributed hash function, high
concurrency of writers can also be achieved (up
to the number of hash buckets). An illustration
of the data storage organisation is presented in
Fig. 1.
.
.
.
Fig. 1. Illustration of the key-value data structure used as
the primary data storage pool
The second big data structure indexes the
metadata tags for key-value records. It is a redblack tree of type tags whose elements are again
red-black trees containing tag data (also an
integer value) with pointers to the key-value
records to which they are attached. The purpose
of this organization is to enable performing
queries on the data tags of the form "find all
records of the given tag type" and "find all
records of the given type and whose tag data
conforms to a simple numerical comparison
operation (lesser than, greater then)".
The tree of data types is protected by a readerwriter lock and each of the tag data trees is
protected by its own reader-writer lock, as
illustrated in Fig. 2. Again, readers never block
other readers and locking for write operations is
localised in a way that allows concurrent access
for tag queries.
V
T
V
V
T
T
V
V
Fig. 2. Illustration of the tag tree structure
Records without metadata tags don't have any
influence on or connection with the tag trees.
3.4. Rationale and discussion
We elected to implement both asynchronous
network access and multithreading to achieve
the maximum performance for the cache
database [13], with the option of changing the
number of threads dedicated to each module. In
this way we can achieve hybrid models such as
the combination of pure multi-process
architecture (MP) and the even-driven
architecture, which is sometimes called
asymmetric multi-process event-driven (AMPED)
[14]. Our implementation tries hard to avoid
unnecessary system calls, context switches and
memory
reallocation
[15],
[16].
The
implementation has avoided most of protocol
parsing overheads by using a binary protocol
which includes data structure sizes and count
fields in command packet headers.
Since the number of clients in the intended
usage (web cache database for dynamic web
applications) is relatively low (in the order of
hundreds to thousands), we have avoided
explicit connection scheduling described in [18][20].
We have opted for a thread-pool design (in
which a fixed number of worker threads
perform work on behalf of many independent
connections) to allow the implementation of
different multithreading model, as well as to
allow the administrator to tune the number of
worker threads (via command line arguments)
and thus the acceptable CPU load on the server.
We have also implemented a special "threadless"
mode in which there are no worker threads, but
the network code makes a direct function call
into the protocol parser, effectively making the
daemon single process event driven (SPED). This
mode can not make use of multiple CPUs, but is
included for comparison with the other model.
5
As discussed in [21] and [22], the use of multiprocessing and the relatively high standards we
have set for concurrency of the requests have
resulted in a need for careful choice of the
structures and algorithms used for data storage.
Traditional structures and algorithms used in
caches, such as LRU and Splay trees are not
directly
usable
in
high-concurrency
environments. LRU and its derivatives need to
maintain a global queue of objects, the
maintenance of which needs to happen on every
read access to a tracked object, which effectively
serializes read operations. Splay trees radically
change with every access and thus need to be
exclusively locked for every access, serializing
both readers and writers much more seriously
than LRU.
The metadata tags structures design was
driven by the same concerns, but also with the
need to make certain query operations efficient
(ranged comparison and grouping, i.e. less-than
or greater-than). We have decided to allow the
flexibility of queries on both the type and the
value parts of metadata tags, and thus we
implemented binary trees which are effective for
the purpose of ordering data and accessing data
in a certain order..
In order to maximize concurrency (minimize
exclusive locking) and to limit the in-memory
working set used during transactions (as
discussed in [16]), we have elected to use a
combination of data structures, specifically a
hash table and binary search trees, for the
principal data storage structure. Each bucket of
the hash table contains one binary search tree
holding elements that hash to the bucket and a
shared-exclusive locking object (pthread rwlock),
thus setting a hard limit to the granularity of
concurrency: write operations (and other
operations requiring exclusive access to data)
exclusively lock at most one bucket (one binary
tree). Read operations acquire shared locks and
do not block one another. The hash table is the
principal source of writer concurrency. Given an
uniform distribution of the hash function and
significantly more hash buckets than there are
worker threads (e.g. 256 vs. 4), the probability of
threads blocking on data access is negligibly
small, which is confirmed by our simulations.
To increase overall performance and reduce the
total number of locks, the size of the hash table
is determined and fixed at program start and the
table itself is not protected by locks. The garbage
collector (which is implemented naively instead
of a LRU-like mechanism) operates when the
exclusive
lock
is
already
acquired
(probabilistically, during write operations) and
operates per hash-bucket. The consequence of
operating per hash-bucket is a lower flexibility
and accuracy in keeping track of the total size of
allocated memory.
This simulation models the behaviour of the
system with a tunable number of worker
threads and hash buckets. The simulated
subsystems are: a task generator, worker
threads, lock acquisition and release according
to pthread rwlock semantics (with increased
writer priority to avoid writer starvation) and
the hash buckets. The task generator attempts to
saturate the system. The timings used in the
model are approximate and thus we're only
interested in trends and proportions in the
results. Fig. 3, 4 and 5 illustrate the percentage of
"fast" lock acquisitions from the simulations,
where "fast" is either an uncontested lock
acquisition or a contested acquisition where
total time spent waiting on a lock is deemed
insignificant (less than 5% of average time spent
holding the lock i.e. processing the task).
3.5. Simulations
A GPSS simulation was created to aid in
understanding of the performance and
behaviour of the key-value store (the hash table
containing binary search trees).
Graphs in Fig. 3-5 are plotted with the
number of hash buckets on the X-axis and a
percentage of fast lock acquisitions on the Yaxis. Results cover the timings for both shared
locks and exclusive locks, obtained during same
simulations. The simulations were run for two
cases: with 64 worker threads (which could
today realistically be run on consumer-grade
products such as the UltraSPARC T2 from Sun
Microsystems [4]) and with 8 worker threads
(which can be run on readily available servers
on the common x86 platform [1], [12]).
Individual figures show the simulated system's
behaviour with a varied ratio of reader to writer
tasks.
Percentage of fast lock acquisitions
Percentage of fast lock acquisitions
6
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
4
8
16
32
64
128
NCPU=64,
EXCLUSIVE
NCPU=8,
SHARED
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
4
256
# of hash table buckets
NCPU=64,
SHARED
100%
NCPU=8,
EXCLUSIVE
8
16
32
64
128
256
# of hash table buckets
NCPU=64,
SHARED
NCPU=64,
EXCLUSIVE
NCPU=8,
SHARED
NCPU=8,
EXCLUSIVE
Fig. 5. Cache behaviour with 50% readers and 50% writers
As predicted by the model, Fig 3. shows how
a high ratio of hash buckets to threads, and a
high ration of readers to writers makes almost
all lock acquisitions fast. Since the number of
hash table buckets can always be made higher
than the number of available CPU cores, and the
intended implementation environment for the
cache database emphasises read transactions
over write transactions, we expect that this
scenario will be the most important one in the
cached database usage.
In the situation presented in Fig. 5 the
abundance of exclusive locks in the system
introduces significant increases in time spent
waiting for hash bucket locks (as expected). Both
kinds of locks are acquired with noticeable
delays and the number of fast lock acquisitions
falls appropriately. However, this does not
present a significant problem if the number of
hash buckets is high enough.
Percentage of fast lock acquisitions
Fig. 3. Cache behaviour with 90% readers and 10% writers
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
4
8
16
32
64
128
256
# of hash table buckets
NCPU=64,
SHARED
NCPU=64,
EXCLUSIVE
NCPU=8,
SHARED
NCPU=8,
EXCLUSIVE
Fig. 4. Cache behaviour with 80% readers and 20% writers
Trends in the simulated system continue in
Fig. 4 with expected effects of having a larger
number of exclusive locks in the system. We
observe that this load marks the boundary
where having the same number of hash buckets
and worker threads makes 90% of shared lock
acquisitions fast.
The simulation results emphasise the lock
contention, showing that equal relative
performance can be achieved with the same
ratio of worker threads and hash table buckets,
and show an optimistic picture when the
number of hash buckets is high. From these
results, we have set the default number of
buckets used by the program to 256, as that is
clearly adequate for today's hardware.
The graphs do not show the number of tasks
dropped by the worker threads due to timeouts
(simulated by the length of the queue in the task
generator). Both types of simulated systems
were subjected to the same load and the length
of the task queue in systems with 8 worker
threads was from 1.5 to 13 times as large as the
same length in systems with 64 worker threads,
leading to conclusion that the number of
timeouts for the 8 worker threads would be
significantly higher than for the case with 64
worker threads. This, coupled with simulated
inefficiencies (additional delays) when lock
contention is high between worker threads can
have the effect of favouring systems with a
lower number of worker threads (lock
acquisition is faster because the contention is
lower, but on the other hand less actual work is
being done).
7
4. Multithreading models
In this work the term “multithreading model”
refers to a distinct way the tasks in a IO driven
network servers are distributed to CPU threads.
It concerns itself with general notion of threads
dedicated for certain class of tasks, rather than
the quantity of threads or the tasks present at
any one point (except for some special or
degenerate cases like 0 or 1).
We observe that in a typical network server
(and in particular in mdcached) there are three
classes of tasks that allow themselves to be
parallelized or delegated to different CPU
threads:
1. Accepting new client connections and
closing or garbage collecting existing
connections.
2. Network communication with accepted
client
connections.
This
class
corresponds to the network interface
module of the cache database.
3. Payload work that the server does on
behalf of the client. This class
corresponds to the data processing
module.
Using these three classes of tasks, there are
several models recognized in practice and
literature [13]. At one part of the spectrum is the
single process event driven (SPED) model, where
all three classes of tasks are always performed
by a single thread (or a process), using kernelprovided event notification mechanisms to
manage multiple clients at the same time, but
performing each individual task sequentially.
On the opposite side of SPED is staged eventdriven architecture (SEDA), where every task
class is implemented as a thread or a thread pool
and a response to a request made to the client
might involve multiple different threads. In our
implementation SEDA is implemented as the
connection accepting thread which also runs
code from the network interface module (or
simply “network threads”) and data processing
threads.
Between the extremes are the asymmetric
multi-process event-driven (AMPED) model where
the network operations are processed in a way
similar to SPED (i.e. in a single thread) while the
payload work (in this case data processing) is
delegated to separate threads to avoid running
long-term operations directly in the network
loop and symmetric multi-process event driven
(SYMPED) which uses multiple SPED-like
threads. A special class of AMPED is a threadper-connection model (usually called simply the
multithreaded or multiprocessing – MT or MP –
model) where connection acceptance and
garbage collecting is implemented in a single
thread which delegates both the network
communication and payload work to a single
thread for each connected client (this model will
not be specially investigated here as there's a
large body of work covering it). Each of the
multithreaded models can also be implemented
as multiprocessing models, and in fact some of
them are more well known in this variant (the
process- or thread- per connection model is well
known and often used in Unix environments).
In this work we limit ourselves only to the
multithreaded variants.
Our experience with [11] lead us to be
particularly interested in certain edge cases,
which we investigate in this work. In particular,
we are interested in reducing unwanted effects
of inter-thread communication and limiting
context switching between threads. We have
implemented one important variation of the
SEDA model where the number of worker
threads exactly equals the number of network
threads and each network thread always
communicates with the same worker thread,
avoiding some of the inter-thread locking of
communication queues. We call this model
SEDA-S (for symmetric).
4.1. Multithreading models and the
test implementation
We have evaluated the described multithreaded
models during and after the implementation of
the cache database. Since it uses aggressive
techniques to achieve highly parallel operation
and concurrent access to shared data, the biggest
influence on overall performance of this server
is observed to be in the choice of the
multithreading model used to schedule the
workload. This behaviour has made it a great
platform for experimenting and evaluating with
multithreading models.
8
The classes of tasks described earlier are
implemented as follows in the mdcached server:
1. New
connection
handling
and
connection expiring is always allocated
to a single thread. The reasons for this is
that connection acceptance is inherently
not parallelizable and that previous
research [13] has indicated that this
aspect of network servers cannot be
significantly improved.
2. The
code
path
implementing
communication between the server and
the clients can be run in one or more
threads. Data asynchronously received
from the remote client is only checked
for validity in code path and passed to
the next code path for further
processing.
3. The data processing code path, which
implements payload work can be run in
one or more threads. The generated
response data is immediately passed to
the network stack for asynchronous
sending. If the response is too large to be
accepted by the network stack at once,
the unsent data is handled by the
network code path.
The described division into code paths allows
us to implement the SPED model by executing
the network and the payload code paths inside
the connection handling loop, the SEDA model
by executing all three code paths in separate
threads (even several instances of the same code
paths in multiple threads), the AMPED model
by executing single instances of the connection
handling and network threads and multiple
payload threads, and the SYMPED model by
executing a single connection handling thread
and multiple network threads which directly
call into the payload handling code path (no
separate payload threads). Table 1 shows the
mapping between threads and code paths in
mdcached
to
implement
the
described
multithreading models.
Connection
Network
acceptance communication
Payload work
SPED
1 thread
In connection
thread
In connection
thread
SEDA
1 thread
N1 threads
N2 threads
SEDA-S
1 thread
N threads
N threads
AMPED
1 thread
1 thread
N threads
SYMPED
1 thread
N threads
In network
thread
Table 1. Mapping of code paths and threads in
mdcached to multithreading models
Table 1 also describes the exact models that
are used in this research.
4.2. Operational characteristics of
mdcached
The communication between various threads
within mdcached is implemented using
asynchronous queues in which job descriptors
(describing work to be executed) are exchanged.
Within a class of threads (connection, network,
data processing threads) there is no further
division of priority or preference. Jobs that are
passed from one thread class to the other (e.g. a
client request from a network thread to a
payload thread) are enqueued on the receiving
threads' event queues in a round-robin fashion.
Network IO operations are always performed by
responding to events (event-driven architecture)
received from the operating system kernel. The
network protocol is binary, optimized for
performance and avoidance of system (kernel)
calls in data receiving code path. The primary
design concern in the design of the protocol and
the database operation was to achieve as many
transactions per second as possible.
In this work we evaluate the multithreading
models implemented in the database cache
server by looking at the number of client-server
transactions per second processed by the server.
Since mdcached is a pure memory database
server it avoids certain classes of behaviour
present in servers that interact with a larger
environment. In particular, it avoids waiting for
disk IO, which would have been a concern for a
disk-based database, a web server or a similar
application. We observe that the mdcached server
operation is highly dependant on system
characteristics such as memory bandwidth and
operating system performance in the areas of
9
system calls, inter-process communication and
general context switching. To ensure maximum
performance we use advanced operating system
support such as kqueues [23] for network IO and
algorithms such as lock coalescing (locking
inter-thread communication queues for multiple
queue/dequeue operations instead of a single
operation) to avoid lock contention and context
switches.
5. Methods of evaluation of
multithreading models in the
cache database
We evaluate the multithreaded models by
configuring mdcached for each of them
individually. Each model has a unique
configuration of mapping code paths to threads
that is applied to the configuration of mdcached
server at start-up and which remains during the
time the process is active. As well as the thread
model, the number of threads is also varied (if it
is desirable for the model under evaluation) to
present a clearer picture of its performance
characteristics.
The server is accessed by a specially created
benchmark client that always uses the same
multithreading model (SYMPED) and number
of threads but varies the number of
simultaneous client connections it establishes.
Each client-server transaction involves the client
sending a request to the server and the server
sending a response to the request. For testing
purposes the data set used is deliberately small,
consisting of 30,000 records with record sizes
ranging from 10 bytes to 103 bytes in a
logarithmic distribution.
To emphasise the goal of maximizing the
number of transactions per second achieved and
to eliminate external influences like network
latency and network hardware issues, both the
server and the client are run on the same server,
communicating over Unix sockets. The test
server is based on Intel's 5000X server platform
with two 4-core Xeon 5405 CPUs (for a total of 8
CPU cores in the system). The benchmark client
always starts 4 threads, meaning it can use up to
at most 4 CPU cores. We have observed that the
best results are achieved when the number of
server threads and the number of benchmark
client threads together is approximate to the
total number of CPU cores available in the
system. The operating system used on the server
is the development version of FreeBSD, 8CURRENT, retrieved at September 15 2008. The
operating system is fully multithreaded and
supports parallel operation of applications and
its network stack.
During test runs of the benchmarks we have
observed that the results vary between 100,000
transactions per second and 500,000 transactions
per second, with standard deviation being
usually less than 5,000 transactions per second.
All tests were performed with previous “warmup” runs, and the reported results are averages
of 5 runs within each configuration of
benchmark client and the server. We've
determined that for the purpose of this work it
is sufficient to report results rounded to the
nearest 5,000 transactions per second, which is
enough to show the difference between the
multithreading models. We have also observed
that the most interesting characteristics of the
client-server system as a whole are visible when
the number of clients is between 40 and 120
clients. This is the location of peak of
performance for most models.
6. Evaluation Results
We have collected a large body of results with
different combinations of multithreading
models and number of clients. Presented here
are comparisons within multithreading models,
in order of increasing performance.
The SPED model, presented in Fig. 6, quickly
reaches its peak performance at which it doesn't
scale further with the increasing number of
clients. The performance achieved with the
SPED model can be used as a baseline result,
achieved without the use of multithreading or
multiple CPU cores. This result is the best that
can be achieved with an event-driven
architecture
10
investigation is needed to reveal if a better
distribution of the code paths to CPU threads
would yield better performance, but a
conclusion suggests
itself that, for this
implementation, the chosen SEDA model
presents a sort of worst-case scenario with
regards to efficiency, tying up multiple CPU
cores but not achieving expected performance.
120
T housands of trasnactions/s
100
80
60
40
20
300
0
40
60
80
100
120
140
Number of clients
no network threads, no
payload threads
Fig. 6. Performance of the SPED multithreading model,
depending on the number of simultaneous clients.
Semantically on the opposite side but with
surprisingly little performance benefit in our
implementation is the SEDA model. Since some
variations of the SEDA model overlap with
other models presented here, Fig. 7 shows only
the case were there are 2 network threads
receiving requests from the clients and passing
them to 4 payload threads.
120
T housands of transactions/s
100
80
60
40
20
0
20
40
60
80
100
120
140
Number of clients
2 network, 4 payload
threads
Fig. 7. Performance of the SEDA multithreading model
with two network threads and four payload threads,
depending on the number of simultaneous clients.
Surprisingly, the SEDA model in our
implementation
achieves
performance
comparable to the SPED model, despite that it
should allow the usage of much more CPU
resources than the SPED model. Probing the
system during the benchmark runs suggests that
the majority of time is spent in the kernel,
managing inter-thread communication and
context switches, which hinders global
performance of the client-server system. Further
250
T housands of transactions/s
20
200
150
100
50
0
20
40
60
80
100
120
140
Number of clients
0 network, 1 payload
thread
1 network, 2 payload
threads
1 network, 3 payload
threads
1 network, 4 payload
threads
Fig. 8. Performance of the AMPED multithreading model
with 0 or one network threads and between 1 and 4
payload threads, depending on the number of
simultaneous clients.
The performance characteristics of the
AMPED multithreading model is presented in
Fig. 8. For AMPED we present two basic
variations of the model, one without a separate
network thread, running SPED-like network
communication in the main network loop (the
connection loop), and one with a single network
thread and a varying number of payload thread.
System probing during benchmark runs on this
multithreading model reveals that the single
network thread is saturated with the work it
needs to process and increasing the number of
worker threads achieves little except increasing
the number of context switches and inter-thread
communication. The best performance for this
model was achieved with one network thread
and two payload threads.
In the increasing order of achieved
performance, the SEDA-S model whose
performance is presented in Fig. 9 yields results
slightly better than the AMPED mode. This
model uses multiple network and payload
threads, with the number of network threads
equalling the number of payload threads, and
where each network thread communicates with
a single payload threads.
11
300
450
400
350
300
250
200
150
100
50
0
200
20
150
40
60
80
100
120
140
Number of clients
1 network thread
100
50
2 network threads
3 network threads
4 network threads
5 network threads
Fig. 10. Performance of the SYMPED multithreading model
with 1 to 5 network threads, depending on the number of
simultaneous clients.
0
20
40
60
80
100
120
140
Number of clients
1 network, 1 payload
thread
2 network, 2 payload
threads
3 network, 3 payload
threads
4 network, 4 payload
threads
Fig. 9. Performance of the SEDA-S multithreading model
with the number of network threads equalling the number
of payload threads, from 1 to 4 threads each, depending
on the number of simultaneous clients.
The highest performance is achieved with the
SYMPED model, whose benchmark results are
presented in Fig 10. The best result achieved
with the SYMPED model is 54% better better
than the next best result, indicating that it is a
qualitatively better model for the purpose of this
work. Investigation of the system behaviour
during the benchmark runs suggests that the
primary reason for this is that there is no interthread communication overhead in passing jobs
between the network thread(s) and the payload
thread(s). In other words, all communication
and data structures from a single network
request are handled within a single CPU thread.
Additionally, the SYMPED model exhibits
very regular scaling as the number of threads
allocated to the server increases, as presented by
Fig. 11. The performance increases linearly as
the number of server threads increases, up to 4
threads. We believe that this limitation is a
consequence of running both the server and the
client on the same computer system and that the
performance would continue to rise if the server
and the clients are separated into their own
systems. Investigation into this possibility will
be a part of our future research.
Note that the presented results involve
bidirectional communication, so that the 450,000
achieved transactions per second translate to
900,000
individual network operations per
second.
500
450
400
T housands of transactions/s
T housands of transactions/s
250
500
T housands of transactions/s
The performance of SEDA-S is comparatively
high, scaling up with the number of threads up
to 3 network threads and 3 worker threads, after
which it falls as the number of threads increases
and so do the adverse effects of inter-thread
communication and context switches.
350
300
250
200
150
100
50
0
2 netw ork threads 4 netw ork threads
1 netw ork thread
3 netw ork threads 5 netw ork threads
Number of network threads
Fig. 11. Scalability of the SYMPED model depending on
the number of server threads.
Finally, we present a comparison between the
presented models in Fig. 12, choosing best
12
performance from each model irrespectively of
the number of threads used in the model.
500
450
T housands of transactions/s
400
350
300
250
200
150
100
50
0
20
40
60
80
100
120
140
Number of clients
SPED
SEDA
SEDA-S
AMPED
SYMPED
Fig. 12. Comparison of multithreading models at their
peek performance, depending on the number of
simultaneous clients.
The SYMPED model exhibits the best
scalability and best overall performance in the
implemented cache database. This result will be
used in future studies of network server
optimization for the purpose of investigating
network intensive workloads.
We wish to emphasize that the achieved
maximum number of transactions per second
presented in this work come from both the client
and the server processes running on the same
system, and that in the ideal case, if the
operating system and the network conditions
allow, the potential absolute maximum could be
twice as much transactions per second, close to
1,000,000 transactions/s on the same hardware.
Performance measurements involving the
client and the server on separate computers
have shown that there currently exists
significant bottlenecking in the FreeBSD's TCP
network stack that hinders SMP operation.
Future work might include improving it.
7. Conclusion
We have investigated the performance
characteristics of five multithreading models for
IO driven network applications. The models
differ in the ways they assign classes of tasks
commonly found in network applications to
CPU threads. These models are: Single-process
event-driven (SPED), Staged event-driven
architecture (SEDA), Asymmetric multi-process
event-driven (AMPED), Symmetric multiprocess event driven (SYMPED) and a variation
of the SEDA model we call SEDA-S. We have
evaluated the multithreading models as applied
to a memory database called mdcached we
created,
which
allows
selecting
the
configuration of threads and its code paths at
program start. For the purpose of this work we
have configured the server and the benchmark
client to maximize the number of transactions
per second.
Our investigation results in quantifying the
differences between the multithreading models.
Their relative performance are apparent in their
performance curves and in the maximum
performance that can be achieved with each of
them. The best performing among the evaluated
multithreading models was SYMPED, which, for
the implementation on which it was tested,
offers the best scalability and the best overall
performance result. We are surprised by the low
performance of the SEDA model and will
investigate it a future work. Another candidate
for future research is the limits of scalability of
the SYMPED model on more complex databases
and server hardware.
The described work was done mostly in the
years 2007 and 2008 of the author's postgraduate
studies on the Faculty of Electrical Engineering
and Computing of the University of Zagreb,
Croatia.
8. References
[1]
Intel Corp., “Intel ® Multi-Core Technology”
[Online] Available: http://www.intel.com/multicore/index.htm [Accessed: September 15 2008]
[2]
Intel Corp., “Teraflops Research Chip” [Online]
Available:
http://techresearch.intel.com/articles/Tera-Scale/
1449.htm [Accessed: September 15 2008]
[3]
Sun Microsystems Inc. “Multithreaded Application
Acceleration with Chip Multithreading (CMT),
Multicore/Multithread UltraSPARC ® Processors”,
White
paper
[Online]
Available:
http://www.sun.com/products/microelectronics/
cmt_wp.pdf [Accessed: September 15, 2008]
[4]
OpenSPARC
T2
Core
Microarchitecture
Specification,
Sun
Microystems,
Inc.,
http://opensparc.sunsource.net/specs/UST1UASuppl-current-draft-P-EXT.pdf, Jan 2007.
13
[5]
G. V. Wilson, “The History of the Development of
Parallel
Computing”
[Online]
Available:
http://ei.cs.vt.edu/~history/Parallel.html
[Accessed: September 15, 2008]
[18] M. Crovella, R. Frangioso, and M. Harchol-Balter,
"Connection scheduling in web servers" in
Proceedings of the 2nd USENIX Symposyum on
Internet Technologies and Systems, 1999.
[6]
F. Dabek, N. Zeldovich, F. Kaashoek, D. Mazieres,
R. Morris, “Event-driven programming for robust
software”, Proceedings of the 10th ACM SIGOPS
European Workshop , 2002.
[7]
J. K. Ousterhout, “Why Threads are a Bad Idea (for
most purposes)”, presentation given at the 1996
Usenix Technical Conference [Online] Available:
http://home.pacbell.net/ouster/threads.ppt
[Accessed: September 16, 2008]
[19] M. Welsh, D. E. Culler, and E.A. Brewer, "SEDA:
An architecture for well-conditioned, scalable
Internet services" in Proceedings of the 18th ACM
Symposium on Operating System Principles, Banff,
2001.
[8]
G. Banga, J. C. Mogul, P. Druschel, “A scalable and
explicit
delivery
mechanism
for
UNIX”,
Proceedings of the 1999 Annual Conference on
USENIX Annual Technical Conference USENIX
Association, Berkeley, CA, 1999
[21] J. Zhou, J. Cieslewicz, K. Ross, and M. Shah,
"Improving database performance on simultaneous
multithreading processors", in Proceedings of the
31st international Conference on Very Large Data
Bases, 2005.
[9]
R. von Behren, J. Condit, E. Brewer, “Why Events
Are A Bad Idea (for high-concurrency servers)“ ,
Proceedings of the 9th conference on Hot Topics in
Operating Systems - Volume 9, Hawaii 2003
[22] P. Garcia and H. F. Korth, "Database hash-join
algorithms
on
multithreaded
computer
architectures", in Proceedings of the 3rd Conference
on Computing Frontiers, 2006.
[10] P. Li, and S. Zdancewic, "Combining events and
threads
for
scalable
network
services
implementation and evaluation of monadic,
application-level
concurrency
primitives",
SIGPLAN Notices 42, 2007.
[23] J. Lemon, “Kqueue: A generic and scalable event
notification facility”, presented at the BSDCon,
Ottawa, CA, 2000.
[Online] Available:
http://people.freebsd.org/~jlemon/papers/kqueu
e.pdf [Accessed September 29, 2008]
[11] I. Voras, M. Žagar, "Web-enabling cache daemon
for complex data", Proceedings of the ITI 2008 30th
International
Conference
on
Information
Technology Interfaces, Zagreb 2008
[12] Multi-core processors – the next evolution in
computing, AMD White Paper, 2005.
[13] D. Pariag , T. Brecht, A. Harji, P. Buhr, A. Shukla,
and D. R. Cheriton, "Comparing the performance
of web server architectures," SIGOPS Operating
System Review 41, 3, 2007.
[14] T. Brecht, D. Pariag, and L. Gammo "Acceptable
strategies for improving web server performance"
in Proceedings of the USENIX Annual Technical
Conference, 2004
[15] A. Ailamaki, D. J. DeWitt, M. D Hill, D. A. Wood,
"DBMSs on a Modern Processor: Where Does Time
Go?", in Proceedings of the 25th international
Conference on Very Large Data Bases, 1999.
[16] L. K. McDowell, S. J. Eggers and S. D. Gribble,
"Improving
server
software
support
for
simultaneous
multithreaded
processors"
in
Proceedings of the Ninth ACM SIGPLAN
Symposium on Principles and Practice of Parallel
Programming, 2003.
[17] Rudolf Bayer, “Symmetric Binary B-Trees: Data
Structures and Maintenance Algorithms”, Acta
Informatica, 1:290-306, 1972.
[20] Y. Ruan, and V. Pai, "The origins of network server
latency & the myth of connection scheduling",
SIGMETRICS Performance Evaluation, Rev. 32, 1,
2004.
Contact address:
Ivan Voras
University of Zagreb
Faculty of Electrical Engineering and Computing
Unska 3
Zagreb
Croatia
e-mail: ivan.voras@fer.hr
I. Voras was born in Slavonski Brod, Croatia. He
received Dipl.ing. in Computer Engineering (2006) from
the Faculty of Electrical Engineering and Computing
(FER) at the University of Zagreb, Croatia. Since 2006 he
has been employed by the Faculty as an Internet Services
Architect and is a graduate student (PhD) at the same
Faculty, where he has participated in research projects at
the Department of Control and Computer Engineering.
His current research interests are in the fields of
distributed systems and network communications, with
a special interest in performance optimizations. He is an
active member of several Open source projects and is a
regular contributor to the FreeBSD project, an
international project developing an open-source Unixlike operating system.
Download