Optimising Intensive Interprocess Communication in a Parallelised Telecommunication Traffic Simulator

advertisement
Optimising Intensive Interprocess Communication in a
Parallelised Telecommunication Traffic Simulator
Mikhail Chalabine, PELAB, Linköping University, Sweden
Christoph Kessler, PELAB, Linköping University, Sweden
Staffan Wiklund, Ericsson AB, Sweden
Keywords: intensive, telecommunication, parallel,
interprocess, optimisation.
Abstract
This paper focuses on an efficient user-level handling
of intensive interprocess communication in distributed
parallel applications that are characterised by a high rate
of data exchange. A common feature of such systems is
that any parallelisation strategy focusing on the largest
parallelisable fraction results in the highest possible rate
of interprocess communication, compared to other parallelisation strategies.
An example of such applications is the class
of telecommunication traffic simulators, where the
partition-communication phenomenon reveals itself due
to the strong data interdependencies among the major
parallelisable tasks, namely encoding of messages, decoding of messages, and interpretation of messages.
1 Introduction
When building a parallel software architecture one attempts to balance the functional and domain decomposition as to minimise the resulting interprocess communication and to maximise the utilisation per processor unit
available in the system. The efficiency of interprocess
communication in shared memory multiprocessor servers
is generally limited by two factors: by the performance of
the kernel-based synchronised data passing mechanisms,
and also by the algorithmic solutions on the user level.
Various optimisation attempts try to overcome these limitations through, for example, reducing communication
latency in highly concurrent programs [9], or by developing a user-level interprocess communication layer for
contemporary operating systems, or more closely to our
ideas, dealing with shared memory architectures as in the
case of the User-Level Remote Procedure Call [2].
Despite many successful approaches to bootstrap
the kernel-level, for many classified applications, such
as the telecommunication traffic simulators, user-level
optimisation stays vital as the exploitation of massive
parallelism is unavoidable in the area, given the throughput requirements of the third generation mobile communication devices [3]. This paper adds to the discussion on finding optimal parallelisation strategy of
the interpretation-based telecommunication traffic simulators.
Amdahl’s law [1] says that maximum possible speedup
is generally limited by the amount of parallelisable code.
Accordingly, for interpretation-based telecommunication
traffic simulators, we pinpoint decoding of messages, encoding of messages and interpretation of messages by execution of an action code 1 as the main candidates for
parallelisation [3]. But being parallelised, these tasks
involve large amount of interprocess communication,
where no speedup can be achieved because of the related overhead, as the kernel invocation by existing communication abstractions magnifies synchronisation delays to some unreasonable level [4]. While the traditional way is simply to discard parallelisation in such
cases, we propose a way to tackle this problem by an
algorithmic user-level optimisation. We support our discussion by empirical measurements obtained in the parallelisation of the Test and Simulation System (TSS) - a
telecommunication traffic simulator developed by Ericsson AB, Sweden [11]. As the major result, with the minimal interprocess communication rate of 20000 msg./s
and with a runtime restriction on the message size, we
achieve an absolute speedup of 1.3, with 1.5 being the
upper bound predicted by Amdahl’s law for the parallelisable fraction of 0.3. In these tests our environment
parameterisation technique secures at least 4% in performance improvement, given 25% in divergence with the
non-optimised parameterisation (see Section 4.3.2). Furthermore, the access aggregation technique (see Section
4.3.1) scales memory access efficiency by a factor of ,
where is the block size of the segmented storage.
The remainder of this paper is orginised as follows.
In Section 2, we define a telecommunication traffic simulator and introduce the Test and Simulation System. In
Section 3 we define a mathematical model as background
for our discussion and elaborate on the problem formu1 The studied simulator uses byte code interpretation technique to
handle protocol messages. Building hardcoded solutions is an alternative to this.
/01
lation. In Section 4, we introduce different approaches
that we have investigated, and sketch the implementation
ideas and the optimisation methods. In Section 5, we
present the results. Section 6 discusses future work and
concludes.
2 3 4 5 6 78 9 :; <
= >?
! " #$ % &' (
,-.
)*+
@R SA B T UC D V E W F X G Y Z H I [ J \ K] L ^M _ N` O P Q
abc d ef g
2 TSS
Definition 2.1 A telecommunication traffic simulator is
a tool
Figure 1: A test case with two test programs simulating
a Mobile Switching Center (MSC) and Base Tranceiver
1. capable to generate formatted data flows
Stations with Mobile Stations attached. The Base Station
specific to the set of nodes
that comprise the given telecommunication network, Controller (BSC) here is the System Under Test.
and
2. containing mechanisms for analysis of data exchanges between the real and simulated nodes.
the decoding and interpretation of protocol primitives, as
motivated by Amdahl’s law. We discuss both strategies
Consequently, the Test and Simulation System based on a general mathematical model. See [3] for an
is a telecommunication traffic simulator for mobile extended introduction to the topic.
telecommunication components for different types of
mobile networks (GSM 900/1800/1900, European &
American GSM, TDMA & PDC (Japaneause), 3G) (see 3 Architecture
[10]). In the GSM (Global System for Mobile Comconsists of In [3] we described the TSS logical software architecture
munications) case, for instance, the set
Mobile Stations, Base Transceiver Stations, Base Station and identified the Interpreter, belonging to the Execution
Controllers and Mobile Switching Centres [3]. Thus for function block in the TSS software, as the main perforGSM TSS can functionally represent any of the main mance bottleneck (see Figure 2 ).
intrinsic nodes, emulating, for instance, ten thousand
The functional architecture of the Interpreter is as folmobile stations executing simultaneous calls. The real lows. Consider an action space representing the set
node is called System Under Test (SUT). The simula- of all possible actions taken in a test program. Consider
tor is controlled by a set of instructions written in a also an instance space comprising all test program inhigh-level programming language, which forms an event- stances in a test case, and a state space being the set
driven Test Program (TP) describing the behaviour of the of all reachable states a test program can take residence
simulated node. Note that there can be several test pro- in. Then, an observable output, generated by the system
grams that run simultaneously and that a test program can in response to a received protocol primitive, is formalised
be active within several instances [3]. Such a set of test through an event (event action), an element of . We will
programs and test program instances is said to form a test refer to such a formalisation process that is particularly
case (see Figure 1).
caused by a protocol primitive as decoding.
The decoding and encoding components in the interA received protocol primitive in TSS is handled as folpreter handle traffic exchanges, the other part of the in- lows. As the first step the interpreter performs decoding
terpreter organises the general execution stream. The in- of the pair (event, test program instance) embedded in
terpreter can be described by a number of finite state ma- the message, and then routes the event to the addressed
chines with events being the basic control mechanism. instance for execution. We note here, that one could view
When executed, such events force the system to undergo decoding as a reversible function
, where
state changes and produce actions that can generate new
is the set of all adopted protocol primitives and is the
events and new actions, and so on. The system is run on collection of all events defined in a test program. Which
its own hardware under the VxWorks real-time operating in more details can be rewritten as follows:
system. Currently the software comprises about 3.5 million lines of code. In our tests we used a test skeleton
(1)
operating under the Solaris operating system to clean the
tests from delays of external communications and hardIn fact,
TSS implements a state matrix that
ware setups.
. Thus for the
We identify two parallelisation strategies [3]. The first provides a mapping
, in response to
one parallelises the incorporated automata by building a system being present in state
, determines an action code
distributed architecture. The other one concentrates on an event
h
i
j
h
kmlonqpsrutvi
n
km
wxkyz{k}|}~
†ˆ‡w‰i
’“”’•‘r Š
r
k lonpsr€or‚ƒh
k„|lonp…i
Š‹lŒrqtiŽpshti
}w‘‰j
ŠŒ–’z—
˜“™›š{œ
% &'
¼½¾ ¿À¾ ÁxÂ Ã Ä Å ÆÈǨÃ
!
"#$
( )*
U VW
+ ,.- / 01,.2.35467
8 9.: ; <9.= >
? @.A B C1@.D.EGF H I J
K L.M N O1LQP RGS T
X
Ÿž ž¢¡¢£¥¤›“¦x£¨§©
á¢â›ã äÈåxæ ç è¥ézê éxæ ã ë ì íÈä¨ã
Figure 3: The general view on functional and domain decomposition. DCR stands for Decoder, with functionality
defined by . SCH stands for Scheduler (Executor) that
executes the received event in the addressed test program
instance.
k
´ µ´“¶Ÿ·¸x¹¨º»
ÉËÊ Ì Í Î Ï¥Ð ÑÓÒ›Ô Õ Ö–× Ø¨Ô³Ù¥ÚxÛ Ü Ý Þ ßÈà¨Ü
ª¢«­¬z®–¯°±³²
k
î¢ïxð ñ ò ó›ô õ›ö óxð ñ ò ÷ ôÈø¨ñ{ð ùúxû›ü ð ñ ýÿþó –ñ ò ­ñ ü ð
Figure 2: TSS Logical Software Architecture.
‘ j
that takes the system to a new state
producing a set of actions . This reflects the second step performed by the interpreter (interpretation of
messages) in traffic handling. Thus
we have defined a deterministic finite state transducer (DFST) [7]
, and showed that the interpreter is essentially defined by functions and .
Parallelisation of the decoding functionality, that is,
the function , means that we separate the identification
of the addressed instance and of the event to be executed in that instance from the actual execution. The
general functional decomposition in this case is shown in
Figure 3.
Another possible parallelisation strategy is to distribute the functionality of the transduction function
over a number of parallel streams. This implies partitioning of the transducer such that each of the streams takes
control over a set of states and also over the transitions
among them. This method can possibly provide reasonable improvements but requires creation of a partitioning
algorithm. In this paper we focus on the first strategy as
motivated by Amdahl’s law.
v
 h
†„‡ ‘i
Ÿi ~ r€Ÿh ŠŒ
k
Š
k
‡
Š
4 Implementations
putational load. By pipelining the decoder and the rest of
the interpreter this 30% could be performed in parallel.
Subsequently this implies that for every activation of
the results of computations are to be transferred to the
corresponding scheduler, which creates an intensive data
flow in the system, caused by the high data rate required
to perform adequate load tests on existing network components [3].
We have tested two applicable technical platforms for
parallelisation: Posix threads and Solaris operating system processes. The general structural decision implies
at least one decoder and one executor in the system, interconnected by a communication channel, which corresponds to a producer-consumer model. In the case
of process-based implementation we adopted a shared
memory multiprocessor architecture (4 processor Sun
server, 450 MHz) [3].
4.1 Threaded Implementation
The functional decomposition implies two parallel tasks,
namely decoding and execution. The communication
channel is implemented as a dynamically sized queue of
elements interconnected as a doubly-linked list. Both decoder and executor operate on the same address space
(see Figure 3). The threaded solution gives very promising results with partitioning granularity of size two. Unfortunately the current implementation of the Interpreter
in TSS does not allow simple incrementation of the number of threaded decoders and executors because of possible deadlocks on memory access. These may arrise due
to the incorporated principles of obtaining and releasing
of memory [3]. We however propose a solution for this
problem in Section 6. Comparative results are given in
Figure 7. For implementation details see [3].
Our approach to functional decomposition implies that 4.2 Process Based Implementation
for every received protocol primitive the decoder identifies an addressed instance and an event. Our measure- The process-based implementation allows more proments [3] show that decoding accounts for 30% of com- ducers and consumers in the system, compared to the
Â Ã Ä Å Æ Ã Ç ÈÉ Ê È Ã Å É Ë É Ç Æ Å É Ì ÃÍ Å Î Ï È Æ Ê Ð
 Ž.
Ñ Ò Ó Ô Õ1Ö × Ó Ô Õ Ö Ø Ö Ù Ú Õ Ö Û Ô
ÜÝ ÞÛ Ù Ó ßßà
€‚ƒ„
‰.Š1‹Œ
s t u v w x y z {|  } ~
Y Z [ Z [ \ [ ] ^ _ ` a b[
c d e f g h i j kl m l n h o p q r l
…†‡ˆ
¶· ¸ ¹
‘ ’
“ ” • – — ” ˜ ™š
› œ  ž Ÿ œ ¡¢
£ ¤ ¥ ¦ § ¤ ¨ ©«ª ¬ ­
® ¯ ° ± ² ¯ ³ ´«µ
Á
á.â ã ä å æ ç ä è«é ê ã è ä ë å ì
º»
º¼½¾¿À
Figure 4: Here the Decoder (DCR-1) performs only partial decoding of incoming messages. The execution unit
(Executor-1) completes the decoding of the addressed
test program instance and executes the decoded event
there.
threaded implementation. Two cases were studied for
Solaris operating system processes. The first one is an
analogue to the threaded solution: we again built the system with a partition granularity of size 2 (see Figure 4).
The other solution is an extension to a one-to-many architecture with multiple executors (see Figure 5). We
use semaphores for synchronisation and shared memory
to implement the communication channel. For the latter there exist a number of alternatives, such as message passing and pipes. In our case, however, the use
of queues in shared memory is expected to give better
control over the data flows [3]. Based on that we developed a framework to reduce the synchronisation delays,
as discussed in the following subsection.
atomic semaphore operation, where the best case is as
follows. Let ñ denote the time overhead required to lock
a single element in shared memory. Let ò be the number of failed requests for the semaphore performed by
the executor until access is granted. Such requests could,
for instance, check whether a decoding queue is available
for service (availability check). Let us also assume that
the time for an availability check is equal to the time of a
locking operation ( ñ seconds). 2
Then the time-efficiency for memory access synchronisation in terms of locking could be presented as the ratio of the time required to process individual accesses
to the time of a single lock operation increased by the
number of availability requests (see Equation 2).
óõôö
÷ø
x úù
ñüû
ñýû òÿþ
î
ò þ
î
(2)
Thus, the efficiency can be maximized by minimizing
the number of unsuccessful availability checks. Equa, the efficiency
tion (2) shows that in the best case ò
of locking is
times higher than in one-by-one locking which corresponds to the case when elements are
handled at a cost of one semaphore operation. The idea
is that an executor could read from decoding queue ,
belonging to decoding queue collection , while the decoder fills in decoding queue õþ î from the same collection .
A minimisation of the number of unsuccessful availability requests performed in the system could be
achieved by identifying the right size of the decoding
queue . Furthermore, this could be extended so that
we take into account other parameters, such as the deand the dimension of the
coding queue collection size
4.3 Optimisation Methods
architecture, which is basically the number of executors
We build a queue in shared memory between producer running in the system. Another possibility is to adopt lo(decoder) and consumers (executors). To reduce the syn- cal queues in the decoder and the scheduler and to pass a
chronisation overhead we aggregate access to this queue group of elements through the shared memory at a time.
This also could be handled at a cost of one semaphore
in bunches of elements.
operation. However, in comparison, the adopted solution allows to perform decoding directly into the shared
4.3.1 Memory Access Method
memory, which is more efficient than copying elements
and keeping additional data structures for local queues on
Let us assume a general case where a system consists
both sides.
of one decoder and í executors. For every executor
ïî
ðí
a group of shared memory segments
is allocated. Such a segment contains
elements, each 4.3.2 Modelling Resource Allocation
forming a basic transportation entity implementing the
data flow. Let us call such an array a Decoding Queue.
The size of the group is the total number of segments.
Our goal is to optimize retrieval of messages by schedLet us refer to it as
in the general case. Consequently
uler.
Consequently, whatever the number of available
a group of decoding queues is said to form a Decoding
Queue Collection of size . The structures discussed are processors is, from the point of view of timing we would
like a scheduler to start processing the messages as soon
shown in Figure 6.
This allows us to utilize one semaphore per decoding as all the initializations are done. Thus, we require at
queue collection for a bulk access to elements of a de- least one decoding queue to be filled during the time the
2 This assumption naturally follows from our implementation, where
coding queue. This means that, in the best case, access
to elements within a decoding queue is handled by an an availability check is a semaphore operation itself.

x’

ƒ‘ “ 
’
’”“•’—–™˜™ššœ›bœžpŸ ˜¡ £¢
:ž Ÿ 8¡¢£8¤*ž ¥ £¦ ¦ §
¨ © ª « ¬ ­ ® ¯° ª ± ² © ª ³µ´ ª ©
¶ ³ ³² · ®µ¸ ¹ ® ¹ ® ¯
º»½¼¿¾
Z [ \*] ^ _ ^ _b ` a
´ µ¶
"!#$%'&
˜™š›
¯¡°±³²
œ
‹ Œ  Ž   ‘ ’ “ ”*— • –
c d e d e f e g h i j k le
m n o p q r s t uv w v x r y z { | v
} ~ *€  ‚  ‚ ƒ „ … † ‡ ˆ ‰ Š „
*( ) +-,
S T"UVWX'Y
Ü ÝÞ ß
· ¸p¹
"! #$
 Ž ‘
Z[\]^
M N O P Q R S T U VY W X
%/ 0 1 2 & 3' 4& '5 6 ( 7 8 ' )9 8* :+ , 4 - ; . <' = > 8
? @ AB C D C D E F G H I J K L F
ðñbòópò"óhôõlö÷ópø ópù—ópú û ôüóhýhþœÿ
#
Figure 5: General one-to-many architecture based on processes. The
decoder (DCR) handles decoding
of incoming traffic and passes decoded data to the correct executor that handles the addressed instance.
#
#
c d"e f ghd"q r"i s jlt ubknr mpv"w o
x y"z { |hy"} ~€  ‚ ƒ
„ …"† ‡ ˆb… ‰ Š€‹ Œ
_ `ba
FG
¤—¥•¤”¦™§¡¨œ¨©bªh«b¬ §¡­®
æ£ç
à á àpâhã ä™å
Ã8Ä Ç
Å Æ-ȽÉÊÇËÌ*Í
À
Á
Â
. /1023 456487:9 ; <= > ? @A;B B = ? C D ; E
º » ¼ ½ ¾b»" Ã"¿ Ä À—Å ÆhÁ Ã"Ç ÈÉ
Ê Ë Ì Í ÎbË"Ï ÐnÑ Ò Ó
Ô Õ Ö × ØhÕ"Ù ÚÛ
HI
KJ
è é èêœë ì
íîlï
LMMNOPQR
Î Ï'Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú:Ñ ÛÒ Ñ Ü ÝÒ Ó Ô Þ Ý Ö:Ñ ØµÙ:ß Þ Ô Û Ý Ò Ò à Ò Ð Ô Û Ù Ð ÚÝ ÚÔ Þ × á
â ã:ä å æ ç è é ê ë ì í ì é å î ì íæ å ï ìæ ç è ë ì í:å îµðñ ò ð ë ì í'ó*ì ôè ë õ*ñ ì ö ôì î ç ÷ ñ ó*ñ ø ù
úµû6üþýÿ ÿ:ÿ ÿ ÿ ! ! "
uwvyx
\ ] ^ ^ _ `_ ^ _ ab_ c d e
ppppp
‘ ‘’“”
qqqqq
•—–i˜ ™š
–› –
)* +
$ % &('
rrrr
QR S T
ssss
,- .
/ 0 1 2 3 07 8 4 9 5 : ; 6 8 < = >
?@ A BC @ D E F G H
IJ K LM J N O P
UV U WX
tttt
f(gih j kl m l gl n j o
YZ[
Let us assume that the time to append one element to
the decoding queue is constant and independent of the
current
system state. ÷Let
us refer to it as ¨ . Let also
ö­¬¯® ÷±°
ö
®´³µ°
œ©«ª
b£§ ¦¥
and œ ²
iŸ¶ ¢¡
reflect the time required to produce one element by the decoder and to
consume one element by the scheduler. From this, the
following relation holds:
z{ | {b} ~  €} ‚  „
ƒ | … †‡ † ˆ  ‡ † ‚| | … †
† ‰ † ˆ Š |  ‚| … ‚  Š ‹ …{Œ‡ † ˆ  ‡  Ž ‹i Š † Š †
ˆ  ~ ~† ˆ |  Ž
ø
œ
i£¤ ¦¥
û ±¨
þ
ö­¬¯® ÷±°
œ©«ª
b£§ ¦¥
(3)
Our goal is to find the size for the decoding queue
Figure 6: Modelling resource allocation. We determine to provide elements to the scheduling process upon its
in a decoding startup. Consequently the following relation should be
for which size of the decoding queue
queue collection the synchronisation delays are minimal. fulfilled:
related scheduler is uploaded. We will build our model
for the case of a (1, 1) architecture and try to find the
size of the decoding queue that answers the purpose. In
this case our model could be considered as an analog to
the general transportation problem (Figure 6). The difference is that instead of finding the flow of least cost
that ships from supply sources to consumer destinations,
we are looking for the ways to optimize the operations so
that the consumer is always kept busy and has the components to handle from the very first request it makes.
This basic modelling could provide us with the means to
adapt to the granularity and the extent of the architecture
and proceed with further environment parameterization.
Let œž reflect the time required to start up a process.
Consequently let !œž bŸy ¢¡ and œž i£¤ ¦¥ be the time required
to start up a scheduler and a decoder
respectively.
ø
Let œ designate the
time
to
handle
elements.
In acø
cordance
to
this,
œ
bŸy ¦¡
is
the
time
to
read
and
process
ø
elements from the decoding queue and !œ £§ ¦¥ is the
time to fill a decoding queue.
“
“
!œ ž
Ÿy ¢¡
‚ !œ ž
£¤ ¢¥
þ
œ
ø
£¤ ¢¥
þ·œ
ôQöG÷ø
(4)
ôö
÷ø
where œ
is the time to perform an atomic semaphore
operation. Hence,
!œ
ø
ƒ
u “
b£§ ¦¥
œ¸ž bŸy ¦¡º¹»œ
ôQöG÷ø
“
¹
!œž b£¤ ¢¥
(5)
From Equations 3 and 5 we arrive at
œ ¸ž
¨
Ÿy ¦¡
þ
ôQöG÷ø
¹»œ
œ ¸ž £¤ ¢¥
ö­¬¯® ÷±°
œ©«ª
b§
£ ¦¥
(6)
Let us provide an example, based on the test measurements for the (1, 1) process-based implementation: Note
that these values were obtained in a highly loaded system.
For a dedicated system the figures may change accordingly. Note also that the measurements given are based
on average values obtained within a sequence of tests.
In all of these, the first operation on a first message was
found much heavier than the others, probably caused by
Ê Ë Ì Í Î Ï ÐÑ Ò Ó Ô Õ Ö ×Ø Ù ÚÛ
a cache miss. We consider this type of delay in the measurements.
!œ ©ª
“
“
ö¬¯® ÷±°
£¤ ¦¥½¼¿¾
o
o
o
o
û î
ÁÀÃÂ
î„ÄÆÅ ñ
ÆÇ
!œž b£¤ ¢¥
¼
¾ ñ
ôQöG÷ø
ÀÃÈ
Ç
œ
î
û î
ñ
¼
ÉÀÃÈ
¨
î¯Å û î
ñ
¼
!œž bŸy ¦¡
¼
Consequently, we obtain from (6):
à á
ÿ
o
ý þ
ñ
ûü
æ ç èé
ù ú
(7)
÷ ø
Ü Ý Þß
õ ö
¼
î¯Ä
ê ë ìí
â ã äå
ó ô
ñ ò
ïð
î
Thus we have obtained a state-specific size for a decoding queue. It gives the approximate order of magnitude based on the measured parameters. It could serve
È
Figure 7: Time to handle î
messages. Case 1 correas a starting point for the specification of the rest
ofthe
sponds to the sequential case. Case 2 represents the (1,2)
environment. As an example, if the choice was
î
process based implementation. Case 3 corresponds to
by applying this method the performance gain is 4%.
the (1, 1) threaded implementation. Case 4 corresponds
to the (1, 1) processed implementation. Case 5 reflects
the (1, 2) processed implementation with optimisations
5 Results
applied.
We can achieve up to 30% increased performance at
the current level of investigations, see Figure 7. The test program instances. An additional step is to impletests show stable improvement and identify a (1, 2) ar- ment an efficient data retrieval mechanism; broadcasting
chitecture (1 decoder and 2 executors) as the optimum. of messages could be a solution.
With the optimisations we propose we achieve a wellbalanced execution with negligible synchronisation costs
(see Case 5 in Figure 7).
References
In general the threaded implementation shows the best
[1] G.M Amdahl: Validity of the single-processor approach to
efficiency at the lowest implementation costs due to the
achieving large scale computing capabilities. In AFIPS Confermore efficient memory access, processor utilisation, and
ence proceedings vol. 30 (Atlantic City, N.J., Apr. 18-20). AFIPS
Press, Reston, Va., 1967, pp.483-485.
speed of synchronisation. However, no straightforward
possibility has been found to increase the granularity of [2] B.N. Bershad, T.E. Anderson, E.D. Lazowska, H.M. Levy: Userlevel Interprocess Communication for Shared Memory Multiprothe parallelisation based on threads because of the limitacessors, ACM Transactions on Computer Systems, Vol. 9, no. 2,
tions of the studied implementation. The absolute speedMay 1991, pp. 175-198.
up in the threaded case on 2 CPUs is 1.4 that is an ab[3] M. Chalabine: Parallelisation of the Interpreter in a Test System
solute efficiency of 0.7. For the optimal process-based
for Mobile Telecommunication Components, MSc thesis, Departarchitecture the absolute speed-up is 1.3 on 3 CPUs, with
ment of Computer Science, Linköping University, 2002.
an absolute efficiency of 0.4. Note that, given the paral- [4] E.W. Dijkstra: Notes on Structured Programming in Structured
lelisable fraction of 30%, Amdahl’s law identifies a maxProgramming by Dahl, Dijkstra, and Hoare, Academic Press,
1972.
imum possible speed-up of 1.5.
6 Future Work
One way to improve the performance is to trade-off the
general efficiency with replicated computations. It implies a new agglomeration schema with partial replication (namely of the decoding work) based on the implemented threaded solution. The parallel software architecture then would consist of í decoder-executor pairs,
each being a clone of the threaded implementation discussed in Section 4.1. The decoder part is replicated and
the executor part is distributed across the processes. Every clone (process) thus would decode all messages but
interpret only those events that are directed to its local
[5] T. Dean, L. Greenwald: A formal description of the transportation
problem., Technical Report CS-92-14, Department of Computer
Science, Brown University, March 1992.
[6] I. Foster: Designing and Building Parallel Programs, AddisonWesley, 1995.
[7] H.R. Lewis, C.H. Papadimitriou: Elements of The Theory of Computation, PrenticeHall, 1981.
[8] J. Misra: Distributed discrete-event simulation, ACM Computing
Surveys, vol. 18, no. 1, 1986, pp. 39–65.
[9] E. Stenman, K. Sagonas: On Reducing Interprocess Communication Overhead in Concurrent Programs, ACM SIGPLAN Erlang
Workshop, 2002.
[10] http://www.gsmworld.com
[11] Ericsson AB Center for Radio Network Control, Simulations
Products and Solutions Box 1248 581 12 Linköping Sweden
Download