Measuring Media Gateway Software Efficiency Using Performance Monitor Counters Mikko Viitanen S-38.310 Thesis seminar on networking technology Helsinki University of Technology 12.05.2005 12.5.2005 1 Mikko Viitanen Basic Information • Thesis written at Oy L M Ericsson Ab, Finland • Supervisor: Professor Jörg Ott • Instructors: M. Sc. Stefan Blomqvist M. Sc. Dietmar Fiedler 12.5.2005 2 Mikko Viitanen Contents • • • • • • • • • Background Problem Description Objectives Scope Performance Monitor Counters Memory Hierarchy Measurement Environment Results Future Work 12.5.2005 3 Mikko Viitanen Background (1/2) • The Universal Mobile Telecommunications System (UMTS) is a third generation mobile network standard specified by the 3rd Generation Partnership Project (3GPP) • UMTS network is based on the GSM and GPRS • UMTS specifications and features grouped into releases – Enable vendors to make interoperable networks 12.5.2005 4 Mikko Viitanen Background (2/2) • Release 4 introduced the layered network architecture – The Mobile services Switching Centre (MSC) was divided into the MSC server and the Circuit-Switched Media Gateway (CSMGW). – The MSC server handles the call control. – The Media Gateway (MGW) handles the media and the bearer control. 12.5.2005 5 Mikko Viitanen Problem Description • The Media Gateway is a real-time multiprocessor system • A common problem in complex systems is how to verify and measure software performance • Performance monitor counters offers a way to monitor code efficiency on the processor level • The following problems are dealt with in this thesis: – Which kind of efficiency problems can be found by using the performance monitor counters? – Which kind of programming methods should be used to reach better results than before? 12.5.2005 6 Mikko Viitanen Objectives • The purpose is to get results that can be used to find efficiency problems in the MGW’s software • Find ways to improve the system performance 12.5.2005 7 Mikko Viitanen Scope • The MGW’s software will be introduced • The software development tools used in the MGW software development will be presented • Overall software performance issues will be discussed • Performance Monitor Counters measurement method is explained 12.5.2005 8 Mikko Viitanen Performance Monitor Counters (1/2) • Performance Monitor Counters (included into many PowerPC family processors) are special registers for the usage of performance measurement. • The measurements are implemented in runtime. The processor steps the registers when monitored events occur. • Due to the fact that the method uses special resources built into the processor in parallel with others, it does not affect system performance and that is why it can provide very realistic results. 12.5.2005 9 Mikko Viitanen Performance Monitor Counters (2/2) The following events can be measured: • Completed instructions per processor clock cycles • Memory hierarchy behavior (e.g. cache misses) • Usage of different execution units • Types of instructions dispatched • Branch predictions • etc 12.5.2005 10 Mikko Viitanen Memory Hierarchy • Fetching data from different parts of the memory system requires different amounts of time/cycles 0 cycles 7 cycles 18 cycles 70 cycles Registers L1 cache L2 cache Main memory Source for estimations: IBM PowerPC 740 / PowerPC 750 RISC Microprocessor User’s Manual. 12.5.2005 11 Mikko Viitanen Measurement Environment (1/2) • The first M-MGW (a complete node) is the System Under Test (SUT). The second M-MGW is a dummy one, not connected to any access networks. It just answers the SUT’s requests. • Several Catapult DCT2000s initiate all the traffic (act as UTRAN/GERAN simulators). • UPLoad generates user plane traffic according to Q.AAL2 signaling received from Catapults. • The MSC server is a real node, which is controlled by the Catapults. It manages both of the M-MGWs. • TTCN is used to initialize the PMC measurement procedure by activating the PMC registers and specifying the measured events. 12.5.2005 12 Mikko Viitanen Measurement Environment (2/2) Catapult Catapult Catapult DCT2000 DCT2000 DCT2000 RANAP MSC server GCP GCP Q.AAL2 UPLoad Q.AAL2 Nb (ATM, TDM) M-MGW (system under test) M-MGW AAL2 PMC control TTCN 12.5.2005 13 Mikko Viitanen Results (1/3) • L2 instruction cache misses affect quite severely to IPC (Instructions Per Clock cycles) 12.5.2005 80 70 60 IPC (%) – Most probably the main reason for the large delay is that when an L2 instruction cache miss occurs, the processor cannot execute the following instructions, because the missing instruction can affect the next ones. The processor has to wait until the missing instruction is available. 90 50 40 30 20 10 0 0 5 10 15 20 25 30 35 L2 instruction misses (%) 14 Mikko Viitanen Results (2/3) • Different amounts of load have quite small effect on the results when comparing the IPC values in general. However, there exist some measurement points that face a strong impact when increasing the load. • What is then common for these points that got a lot better IPC values during high load? They all contain data structure operations, such as searches, adds and removes. When the system is having a high load, the number of elements in these data structures is considerable and managing data structures can be done efficiently from the processor’s point of view. 12.5.2005 15 Mikko Viitanen Results (3/3) • The amount of code in the operation has an effect on the IPC value. The lengths of the measured pieces of code differ quite a lot. 12.5.2005 90 80 70 IPC (%) – The usage of complicated state machines is the main reason for low IPC values in short operations. When code is generated from a state machine with small pieces of code, the program is very fragmented (contains numerous small blocks). 100 60 50 40 30 20 10 0 0 20000 40000 60000 80000 100000 120000 Processor cycles 16 Mikko Viitanen Future Work • Topics for future work: – Comparing the results to some other pieces of software that are implemented using different development tools. – The comparison can also be done by using different processors. For instance, if there would be a similar processor that would have double sized L1 and L2 caches, the results would surely be different. 12.5.2005 17 Mikko Viitanen Thank you! Questions or comments? 12.5.2005 18 Mikko Viitanen