Optimising Intensive Interprocess Communication in a Parallelised Telecommunication Traffic Simulator Mikhail Chalabine, PELAB, Linköping University, Sweden Christoph Kessler, PELAB, Linköping University, Sweden Staffan Wiklund, Ericsson AB, Sweden Keywords: intensive, telecommunication, parallel, interprocess, optimisation. Abstract This paper focuses on an efficient user-level handling of intensive interprocess communication in distributed parallel applications that are characterised by a high rate of data exchange. A common feature of such systems is that any parallelisation strategy focusing on the largest parallelisable fraction results in the highest possible rate of interprocess communication, compared to other parallelisation strategies. An example of such applications is the class of telecommunication traffic simulators, where the partition-communication phenomenon reveals itself due to the strong data interdependencies among the major parallelisable tasks, namely encoding of messages, decoding of messages, and interpretation of messages. 1 Introduction When building a parallel software architecture one attempts to balance the functional and domain decomposition as to minimise the resulting interprocess communication and to maximise the utilisation per processor unit available in the system. The efficiency of interprocess communication in shared memory multiprocessor servers is generally limited by two factors: by the performance of the kernel-based synchronised data passing mechanisms, and also by the algorithmic solutions on the user level. Various optimisation attempts try to overcome these limitations through, for example, reducing communication latency in highly concurrent programs [9], or by developing a user-level interprocess communication layer for contemporary operating systems, or more closely to our ideas, dealing with shared memory architectures as in the case of the User-Level Remote Procedure Call [2]. Despite many successful approaches to bootstrap the kernel-level, for many classified applications, such as the telecommunication traffic simulators, user-level optimisation stays vital as the exploitation of massive parallelism is unavoidable in the area, given the throughput requirements of the third generation mobile communication devices [3]. This paper adds to the discussion on finding optimal parallelisation strategy of the interpretation-based telecommunication traffic simulators. Amdahl’s law [1] says that maximum possible speedup is generally limited by the amount of parallelisable code. Accordingly, for interpretation-based telecommunication traffic simulators, we pinpoint decoding of messages, encoding of messages and interpretation of messages by execution of an action code 1 as the main candidates for parallelisation [3]. But being parallelised, these tasks involve large amount of interprocess communication, where no speedup can be achieved because of the related overhead, as the kernel invocation by existing communication abstractions magnifies synchronisation delays to some unreasonable level [4]. While the traditional way is simply to discard parallelisation in such cases, we propose a way to tackle this problem by an algorithmic user-level optimisation. We support our discussion by empirical measurements obtained in the parallelisation of the Test and Simulation System (TSS) - a telecommunication traffic simulator developed by Ericsson AB, Sweden [11]. As the major result, with the minimal interprocess communication rate of 20000 msg./s and with a runtime restriction on the message size, we achieve an absolute speedup of 1.3, with 1.5 being the upper bound predicted by Amdahl’s law for the parallelisable fraction of 0.3. In these tests our environment parameterisation technique secures at least 4% in performance improvement, given 25% in divergence with the non-optimised parameterisation (see Section 4.3.2). Furthermore, the access aggregation technique (see Section 4.3.1) scales memory access efficiency by a factor of , where is the block size of the segmented storage. The remainder of this paper is orginised as follows. In Section 2, we define a telecommunication traffic simulator and introduce the Test and Simulation System. In Section 3 we define a mathematical model as background for our discussion and elaborate on the problem formu1 The studied simulator uses byte code interpretation technique to handle protocol messages. Building hardcoded solutions is an alternative to this. /01 lation. In Section 4, we introduce different approaches that we have investigated, and sketch the implementation ideas and the optimisation methods. In Section 5, we present the results. Section 6 discusses future work and concludes. 2 3 4 5 6 78 9 :; < = >? ! " #$ % &' ( ,-. )*+ @R SA B T UC D V E W F X G Y Z H I [ J \ K] L ^M _ N` O P Q abc d ef g 2 TSS Definition 2.1 A telecommunication traffic simulator is a tool Figure 1: A test case with two test programs simulating a Mobile Switching Center (MSC) and Base Tranceiver 1. capable to generate formatted data flows Stations with Mobile Stations attached. The Base Station specific to the set of nodes that comprise the given telecommunication network, Controller (BSC) here is the System Under Test. and 2. containing mechanisms for analysis of data exchanges between the real and simulated nodes. the decoding and interpretation of protocol primitives, as motivated by Amdahl’s law. We discuss both strategies Consequently, the Test and Simulation System based on a general mathematical model. See [3] for an is a telecommunication traffic simulator for mobile extended introduction to the topic. telecommunication components for different types of mobile networks (GSM 900/1800/1900, European & American GSM, TDMA & PDC (Japaneause), 3G) (see 3 Architecture [10]). In the GSM (Global System for Mobile Comconsists of In [3] we described the TSS logical software architecture munications) case, for instance, the set Mobile Stations, Base Transceiver Stations, Base Station and identified the Interpreter, belonging to the Execution Controllers and Mobile Switching Centres [3]. Thus for function block in the TSS software, as the main perforGSM TSS can functionally represent any of the main mance bottleneck (see Figure 2 ). intrinsic nodes, emulating, for instance, ten thousand The functional architecture of the Interpreter is as folmobile stations executing simultaneous calls. The real lows. Consider an action space representing the set node is called System Under Test (SUT). The simula- of all possible actions taken in a test program. Consider tor is controlled by a set of instructions written in a also an instance space comprising all test program inhigh-level programming language, which forms an event- stances in a test case, and a state space being the set driven Test Program (TP) describing the behaviour of the of all reachable states a test program can take residence simulated node. Note that there can be several test pro- in. Then, an observable output, generated by the system grams that run simultaneously and that a test program can in response to a received protocol primitive, is formalised be active within several instances [3]. Such a set of test through an event (event action), an element of . We will programs and test program instances is said to form a test refer to such a formalisation process that is particularly case (see Figure 1). caused by a protocol primitive as decoding. The decoding and encoding components in the interA received protocol primitive in TSS is handled as folpreter handle traffic exchanges, the other part of the in- lows. As the first step the interpreter performs decoding terpreter organises the general execution stream. The in- of the pair (event, test program instance) embedded in terpreter can be described by a number of finite state ma- the message, and then routes the event to the addressed chines with events being the basic control mechanism. instance for execution. We note here, that one could view When executed, such events force the system to undergo decoding as a reversible function , where state changes and produce actions that can generate new is the set of all adopted protocol primitives and is the events and new actions, and so on. The system is run on collection of all events defined in a test program. Which its own hardware under the VxWorks real-time operating in more details can be rewritten as follows: system. Currently the software comprises about 3.5 million lines of code. In our tests we used a test skeleton (1) operating under the Solaris operating system to clean the tests from delays of external communications and hardIn fact, TSS implements a state matrix that ware setups. . Thus for the We identify two parallelisation strategies [3]. The first provides a mapping , in response to one parallelises the incorporated automata by building a system being present in state , determines an action code distributed architecture. The other one concentrates on an event h i j h kmlonqpsrutvi n km wxkyz{k}|}~ wi r r k lonpsrorh k|lonp i lrqtipshti }wj z { % &' ¼½¾ ¿À¾ ÁxÂ Ã Ä Å ÆÈǨà ! "#$ ( )* U VW + ,.- / 01,.2.35467 8 9.: ; <9.= > ? @.A B C1@.D.EGF H I J K L.M N O1LQP RGS T X ¢¡¢£¥¤¦x£¨§© á¢âã äÈåxæ ç è¥ézê éxæ ã ë ì íÈä¨ã Figure 3: The general view on functional and domain decomposition. DCR stands for Decoder, with functionality defined by . SCH stands for Scheduler (Executor) that executes the received event in the addressed test program instance. k ´ µ´¶·¸x¹¨º» ÉËÊ Ì Í Î Ï¥Ð ÑÓÒÔ Õ Ö× Ø¨Ô³Ù¥ÚxÛ Ü Ý Þ ßÈà¨Ü ª¢«­¬z®¯°±³² k î¢ïxð ñ ò óô õö óxð ñ ò ÷ ôÈø¨ñ{ð ùúxûü ð ñ ýÿþó ñ ò ­ñ ü ð Figure 2: TSS Logical Software Architecture. j that takes the system to a new state producing a set of actions . This reflects the second step performed by the interpreter (interpretation of messages) in traffic handling. Thus we have defined a deterministic finite state transducer (DFST) [7] , and showed that the interpreter is essentially defined by functions and . Parallelisation of the decoding functionality, that is, the function , means that we separate the identification of the addressed instance and of the event to be executed in that instance from the actual execution. The general functional decomposition in this case is shown in Figure 3. Another possible parallelisation strategy is to distribute the functionality of the transduction function over a number of parallel streams. This implies partitioning of the transducer such that each of the streams takes control over a set of states and also over the transitions among them. This method can possibly provide reasonable improvements but requires creation of a partitioning algorithm. In this paper we focus on the first strategy as motivated by Amdahl’s law. v h i i ~ rh k k 4 Implementations putational load. By pipelining the decoder and the rest of the interpreter this 30% could be performed in parallel. Subsequently this implies that for every activation of the results of computations are to be transferred to the corresponding scheduler, which creates an intensive data flow in the system, caused by the high data rate required to perform adequate load tests on existing network components [3]. We have tested two applicable technical platforms for parallelisation: Posix threads and Solaris operating system processes. The general structural decision implies at least one decoder and one executor in the system, interconnected by a communication channel, which corresponds to a producer-consumer model. In the case of process-based implementation we adopted a shared memory multiprocessor architecture (4 processor Sun server, 450 MHz) [3]. 4.1 Threaded Implementation The functional decomposition implies two parallel tasks, namely decoding and execution. The communication channel is implemented as a dynamically sized queue of elements interconnected as a doubly-linked list. Both decoder and executor operate on the same address space (see Figure 3). The threaded solution gives very promising results with partitioning granularity of size two. Unfortunately the current implementation of the Interpreter in TSS does not allow simple incrementation of the number of threaded decoders and executors because of possible deadlocks on memory access. These may arrise due to the incorporated principles of obtaining and releasing of memory [3]. We however propose a solution for this problem in Section 6. Comparative results are given in Figure 7. For implementation details see [3]. Our approach to functional decomposition implies that 4.2 Process Based Implementation for every received protocol primitive the decoder identifies an addressed instance and an event. Our measure- The process-based implementation allows more proments [3] show that decoding accounts for 30% of com- ducers and consumers in the system, compared to the Â Ã Ä Å Æ Ã Ç ÈÉ Ê È Ã Å É Ë É Ç Æ Å É Ì ÃÍ Å Î Ï È Æ Ê Ð . Ñ Ò Ó Ô Õ1Ö × Ó Ô Õ Ö Ø Ö Ù Ú Õ Ö Û Ô ÜÝ ÞÛ Ù Ó ßßà .1 s t u v w x y z {| } ~ Y Z [ Z [ \ [ ] ^ _ ` a b[ c d e f g h i j kl m l n h o p q r l ¶· ¸ ¹ ¡¢ £ ¤ ¥ ¦ § ¤ ¨ ©«ª ¬ ­ ® ¯ ° ± ² ¯ ³ ´«µ Á á.â ã ä å æ ç ä è«é ê ã è ä ë å ì º» º¼½¾¿À Figure 4: Here the Decoder (DCR-1) performs only partial decoding of incoming messages. The execution unit (Executor-1) completes the decoding of the addressed test program instance and executes the decoded event there. threaded implementation. Two cases were studied for Solaris operating system processes. The first one is an analogue to the threaded solution: we again built the system with a partition granularity of size 2 (see Figure 4). The other solution is an extension to a one-to-many architecture with multiple executors (see Figure 5). We use semaphores for synchronisation and shared memory to implement the communication channel. For the latter there exist a number of alternatives, such as message passing and pipes. In our case, however, the use of queues in shared memory is expected to give better control over the data flows [3]. Based on that we developed a framework to reduce the synchronisation delays, as discussed in the following subsection. atomic semaphore operation, where the best case is as follows. Let ñ denote the time overhead required to lock a single element in shared memory. Let ò be the number of failed requests for the semaphore performed by the executor until access is granted. Such requests could, for instance, check whether a decoding queue is available for service (availability check). Let us also assume that the time for an availability check is equal to the time of a locking operation ( ñ seconds). 2 Then the time-efficiency for memory access synchronisation in terms of locking could be presented as the ratio of the time required to process individual accesses to the time of a single lock operation increased by the number of availability requests (see Equation 2). óõôö ÷ø x úù ñüû ñýû òÿþ î ò þ î (2) Thus, the efficiency can be maximized by minimizing the number of unsuccessful availability checks. Equa, the efficiency tion (2) shows that in the best case ò of locking is times higher than in one-by-one locking which corresponds to the case when elements are handled at a cost of one semaphore operation. The idea is that an executor could read from decoding queue , belonging to decoding queue collection , while the decoder fills in decoding queue õþ î from the same collection . A minimisation of the number of unsuccessful availability requests performed in the system could be achieved by identifying the right size of the decoding queue . Furthermore, this could be extended so that we take into account other parameters, such as the deand the dimension of the coding queue collection size 4.3 Optimisation Methods architecture, which is basically the number of executors We build a queue in shared memory between producer running in the system. Another possibility is to adopt lo(decoder) and consumers (executors). To reduce the syn- cal queues in the decoder and the scheduler and to pass a chronisation overhead we aggregate access to this queue group of elements through the shared memory at a time. This also could be handled at a cost of one semaphore in bunches of elements. operation. However, in comparison, the adopted solution allows to perform decoding directly into the shared 4.3.1 Memory Access Method memory, which is more efficient than copying elements and keeping additional data structures for local queues on Let us assume a general case where a system consists both sides. of one decoder and í executors. For every executor ïî ðí a group of shared memory segments is allocated. Such a segment contains elements, each 4.3.2 Modelling Resource Allocation forming a basic transportation entity implementing the data flow. Let us call such an array a Decoding Queue. The size of the group is the total number of segments. Our goal is to optimize retrieval of messages by schedLet us refer to it as in the general case. Consequently uler. Consequently, whatever the number of available a group of decoding queues is said to form a Decoding Queue Collection of size . The structures discussed are processors is, from the point of view of timing we would like a scheduler to start processing the messages as soon shown in Figure 6. This allows us to utilize one semaphore per decoding as all the initializations are done. Thus, we require at queue collection for a bulk access to elements of a de- least one decoding queue to be filled during the time the 2 This assumption naturally follows from our implementation, where coding queue. This means that, in the best case, access to elements within a decoding queue is handled by an an availability check is a semaphore operation itself. x bp ¡ £¢ : 8¡¢£8¤* ¥ £¦ ¦ § ¨ © ª « ¬ ­ ® ¯° ª ± ² © ª ³µ´ ª © ¶ ³ ³² · ®µ¸ ¹ ® ¹ ® ¯ º»½¼¿¾ Z [ \*] ^ _ ^ _b ` a ´ µ¶ "!#$%'& ¯¡°±³² * c d e d e f e g h i j k le m n o p q r s t uv w v x r y z { | v } ~ * *( ) +-, S T"UVWX'Y Ü ÝÞ ß · ¸p¹ "! #$ Z[\]^ M N O P Q R S T U VY W X %/ 0 1 2 & 3' 4& '5 6 ( 7 8 ' )9 8* :+ , 4 - ; . <' = > 8 ? @ AB C D C D E F G H I J K L F ðñbòópò"óhôõlö÷ópø ópùópú û ôüóhýhþÿ # Figure 5: General one-to-many architecture based on processes. The decoder (DCR) handles decoding of incoming traffic and passes decoded data to the correct executor that handles the addressed instance. # # c d"e f ghd"q r"i s jlt ubknr mpv"w o x y"z { |hy"} ~ " b _ `ba FG ¤¥¤¦§¡¨¨©bªh«b¬ §¡­® æ£ç à á àpâhã äå Ã8Ä Ç Å Æ-ȽÉÊÇËÌ*Í À Á  . /1023 456487:9 ; <= > ? @A;B B = ? C D ; E º » ¼ ½ ¾b»" Ã"¿ Ä ÀÅ ÆhÁ Ã"Ç ÈÉ Ê Ë Ì Í ÎbË"Ï ÐnÑ Ò Ó Ô Õ Ö × ØhÕ"Ù ÚÛ HI KJ è é èêë ì íîlï LMMNOPQR Î Ï'Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú:Ñ ÛÒ Ñ Ü ÝÒ Ó Ô Þ Ý Ö:Ñ ØµÙ:ß Þ Ô Û Ý Ò Ò à Ò Ð Ô Û Ù Ð ÚÝ ÚÔ Þ × á â ã:ä å æ ç è é ê ë ì í ì é å î ì íæ å ï ìæ ç è ë ì í:å îµðñ ò ð ë ì í'ó*ì ôè ë õ*ñ ì ö ôì î ç ÷ ñ ó*ñ ø ù úµû6üþýÿ ÿ:ÿ ÿ ÿ ! ! " uwvyx \ ] ^ ^ _ `_ ^ _ ab_ c d e ppppp qqqqq i )* + $ % &(' rrrr QR S T ssss ,- . / 0 1 2 3 07 8 4 9 5 : ; 6 8 < = > ?@ A BC @ D E F G H IJ K LM J N O P UV U WX tttt f(gih j kl m l gl n j o YZ[ Let us assume that the time to append one element to the decoding queue is constant and independent of the current system state. ÷Let us refer to it as ¨ . Let also ö­¬¯® ÷±° ö ®´³µ° ©«ª b£§ ¦¥ and ² i¶ ¢¡ reflect the time required to produce one element by the decoder and to consume one element by the scheduler. From this, the following relation holds: z{ | {b} ~ } | | | | | { i ~ ~ | ø i£¤ ¦¥ û ±¨ þ ö­¬¯® ÷±° ©«ª b£§ ¦¥ (3) Our goal is to find the size for the decoding queue Figure 6: Modelling resource allocation. We determine to provide elements to the scheduling process upon its in a decoding startup. Consequently the following relation should be for which size of the decoding queue queue collection the synchronisation delays are minimal. fulfilled: related scheduler is uploaded. We will build our model for the case of a (1, 1) architecture and try to find the size of the decoding queue that answers the purpose. In this case our model could be considered as an analog to the general transportation problem (Figure 6). The difference is that instead of finding the flow of least cost that ships from supply sources to consumer destinations, we are looking for the ways to optimize the operations so that the consumer is always kept busy and has the components to handle from the very first request it makes. This basic modelling could provide us with the means to adapt to the granularity and the extent of the architecture and proceed with further environment parameterization. Let reflect the time required to start up a process. Consequently let ! by ¢¡ and i£¤ ¦¥ be the time required to start up a scheduler and a decoder respectively. ø Let designate the time to handle elements. In acø cordance to this, by ¦¡ is the time to read and process ø elements from the decoding queue and ! £§ ¦¥ is the time to fill a decoding queue. ! y ¢¡ ! £¤ ¢¥ þ ø £¤ ¢¥ þ· ôQöG÷ø (4) ôö ÷ø where is the time to perform an atomic semaphore operation. Hence, ! ø u b£§ ¦¥ ¸ by ¦¡º¹» ôQöG÷ø ¹ ! b£¤ ¢¥ (5) From Equations 3 and 5 we arrive at ¸ ¨ y ¦¡ þ ôQöG÷ø ¹» ¸ £¤ ¢¥ ö­¬¯® ÷±° ©«ª b§ £ ¦¥ (6) Let us provide an example, based on the test measurements for the (1, 1) process-based implementation: Note that these values were obtained in a highly loaded system. For a dedicated system the figures may change accordingly. Note also that the measurements given are based on average values obtained within a sequence of tests. In all of these, the first operation on a first message was found much heavier than the others, probably caused by Ê Ë Ì Í Î Ï ÐÑ Ò Ó Ô Õ Ö ×Ø Ù ÚÛ a cache miss. We consider this type of delay in the measurements. ! ©ª ö¬¯® ÷±° £¤ ¦¥½¼¿¾ o o o o û î ÁÀàîÄÆÅ ñ ÆÇ ! b£¤ ¢¥ ¼ ¾ ñ ôQöG÷ø ÀÃÈ Ç î û î ñ ¼ ÉÀÃÈ ¨ î¯Å û î ñ ¼ ! by ¦¡ ¼ Consequently, we obtain from (6): à á ÿ o ý þ ñ ûü æ ç èé ù ú (7) ÷ ø Ü Ý Þß õ ö ¼ î¯Ä ê ë ìí â ã äå ó ô ñ ò ïð î Thus we have obtained a state-specific size for a decoding queue. It gives the approximate order of magnitude based on the measured parameters. It could serve È Figure 7: Time to handle î messages. Case 1 correas a starting point for the specification of the rest ofthe sponds to the sequential case. Case 2 represents the (1,2) environment. As an example, if the choice was î process based implementation. Case 3 corresponds to by applying this method the performance gain is 4%. the (1, 1) threaded implementation. Case 4 corresponds to the (1, 1) processed implementation. Case 5 reflects the (1, 2) processed implementation with optimisations 5 Results applied. We can achieve up to 30% increased performance at the current level of investigations, see Figure 7. The test program instances. An additional step is to impletests show stable improvement and identify a (1, 2) ar- ment an efficient data retrieval mechanism; broadcasting chitecture (1 decoder and 2 executors) as the optimum. of messages could be a solution. With the optimisations we propose we achieve a wellbalanced execution with negligible synchronisation costs (see Case 5 in Figure 7). References In general the threaded implementation shows the best [1] G.M Amdahl: Validity of the single-processor approach to efficiency at the lowest implementation costs due to the achieving large scale computing capabilities. In AFIPS Confermore efficient memory access, processor utilisation, and ence proceedings vol. 30 (Atlantic City, N.J., Apr. 18-20). AFIPS Press, Reston, Va., 1967, pp.483-485. speed of synchronisation. However, no straightforward possibility has been found to increase the granularity of [2] B.N. Bershad, T.E. Anderson, E.D. Lazowska, H.M. Levy: Userlevel Interprocess Communication for Shared Memory Multiprothe parallelisation based on threads because of the limitacessors, ACM Transactions on Computer Systems, Vol. 9, no. 2, tions of the studied implementation. The absolute speedMay 1991, pp. 175-198. up in the threaded case on 2 CPUs is 1.4 that is an ab[3] M. Chalabine: Parallelisation of the Interpreter in a Test System solute efficiency of 0.7. For the optimal process-based for Mobile Telecommunication Components, MSc thesis, Departarchitecture the absolute speed-up is 1.3 on 3 CPUs, with ment of Computer Science, Linköping University, 2002. an absolute efficiency of 0.4. Note that, given the paral- [4] E.W. Dijkstra: Notes on Structured Programming in Structured lelisable fraction of 30%, Amdahl’s law identifies a maxProgramming by Dahl, Dijkstra, and Hoare, Academic Press, 1972. imum possible speed-up of 1.5. 6 Future Work One way to improve the performance is to trade-off the general efficiency with replicated computations. It implies a new agglomeration schema with partial replication (namely of the decoding work) based on the implemented threaded solution. The parallel software architecture then would consist of í decoder-executor pairs, each being a clone of the threaded implementation discussed in Section 4.1. The decoder part is replicated and the executor part is distributed across the processes. Every clone (process) thus would decode all messages but interpret only those events that are directed to its local [5] T. Dean, L. Greenwald: A formal description of the transportation problem., Technical Report CS-92-14, Department of Computer Science, Brown University, March 1992. [6] I. Foster: Designing and Building Parallel Programs, AddisonWesley, 1995. [7] H.R. Lewis, C.H. Papadimitriou: Elements of The Theory of Computation, PrenticeHall, 1981. [8] J. Misra: Distributed discrete-event simulation, ACM Computing Surveys, vol. 18, no. 1, 1986, pp. 39–65. [9] E. Stenman, K. Sagonas: On Reducing Interprocess Communication Overhead in Concurrent Programs, ACM SIGPLAN Erlang Workshop, 2002. [10] http://www.gsmworld.com [11] Ericsson AB Center for Radio Network Control, Simulations Products and Solutions Box 1248 581 12 Linköping Sweden