PERFORMANCE ANALYSIS OF A HARDWARE QUEUE IN SIMICS A Project Presented to the faculty of the Computer Engineering Program California State University, Sacramento Submitted in partial satisfaction of the requirements for the degree of MASTER OF SCIENCE in Computer Engineering by Mukta Siddharth Jain SUMMER 2012 © 2012 Mukta Siddharth Jain ALL RIGHTS RESERVED ii PERFORMANCE ANALYSIS OF A HARDWARE QUEUE IN SIMICS A Project by Mukta Siddharth Jain Approved by: __________________________________, Committee Chair Nikrouz Faroughi, Ph.D. __________________________________, Second Reader Behnam Arad, Ph.D. ____________________________ Date iii Student: Mukta Siddharth Jain I certify that this student has met the requirements for format contained in the University format manual, and that this project is suitable for shelving in the Library and credit is to be awarded for the project. __________________________, Graduate Coordinator Suresh Vadhva, Ph.D. Computer Engineering Program iv ___________________ Date Abstract of PERFORMANCE ANALYSIS OF A HARDWARE QUEUE IN SIMICS by Mukta Siddharth Jain A hardware queue is one way to facilitate inter-processor communication. Such a queue is used in a simulated dual-core system in this project. Simulation data is gathered and analysis is done to compare the performance of a dual-core system that uses a queue with the dual-core system that does not use a queue. The systems are simulated using Wind River Simics, a full-system simulator capable of functionally simulating a number of platforms based on different architectures. In this study, a software-controlled queue is modeled using Python scripting language and is used in the Simics simulation environment. A producer thread (written in C) is used to communicate its computed data either through memory or via a queue to a consumer thread, also written in C. The threads are executed in Simics simulation environment. The performance data and its analysis are reported using different values of computational delays and different queue sizes. v The results indicate that a hardware queue on an average will increase performance as long as computational delays are small. Queue becomes more efficient as the number of data items communicated via the queue increases. _______________________, Committee Chair Nikrouz Faroughi, Ph.D. _______________________ Date vi ACKNOWLEDGEMENTS I would like to thank Dr. Nikrouz Faroughi for his constant guidance and encouragement. Thank you for taking time from your busy schedule and patiently reviewing my work and report multiple times. I would also like to thank Dr. Behnam Arad, who was kind enough to be my second reader. To my husband Siddharth, who always stood by me and was my pillar of support. I could not have completed this project without him. To my mother Medha and father Sanjiv, who have always taught me to be patient, to persevere and to work hard in all my endeavors. To my cousin Vrushali, who encouraged me to pursue higher education in the United States. To my brother Kaushik, who in his own way gave me hope even though he was far away in India. To my mother-in-law Rita, father-in-law Susheel and brother-inlaw Saurabh for their encouragement and support during the stressful times. Lastly, special thanks to my friends Arti and Priyanka for all their help and support. Thank you for being there for me. vii TABLE OF CONTENTS Page Acknowledgments...................................................................................................... vii List of Tables ................................................................................................................ x List of Figures ............................................................................................................. xi List of Equations ........................................................................................................ xii Chapter 1. INTRODUCTION ....................................................................................................1 1.1 Literature Review........................................................................................ 1 1.1.1 Hardware queues in configurable processors .............................. 1 1.1.2 JPEG Encoding with queue ......................................................... 2 1.2 Objective and scope of the project .............................................................. 3 1.3 Introduction to Simics ..................................................................................5 1.4 Simulation Environment ..............................................................................6 1.5 Project Overview .........................................................................................9 2. QUEUE SIMULATION ........................................................................................10 2.1 Software queue model................................................................................10 2.2 Target C programs .....................................................................................11 2.3 Cache hierarchy in Simics simulations ......................................................14 3. SIMULATION DATA AND ANALYSIS ............................................................17 3.1 Speed-up Calculation .................................................................................19 3.2 Simulation Data and Analysis ....................................................................22 4. CONCLUSION.....................................................................................................27 Appendix A. Simics Simulation Steps ...................................................................... 29 Appendix B. Source Code Listing .............................................................................32 Appendix C. Simulation Data ....................................................................................46 viii Bibliography ................................................................................................................49 ix LIST OF TABLES Tables Page 1. List of queue interface parameters .................................................................... 11 2. Cache and memory latencies ............................................................................ 16 3. Read and write transaction penalties................................................................. 16 4. Code segments in memory-based and queue-based programs ......................... 18 5. Simulation data measurements ......................................................................... 19 6. Measurements for speed-up calculation............................................................ 21 7. Queue overhead for no caches and 0 memory latency ..................................... 22 8. Speed-up for D = 50, N = 100K and different queue sizes ............................... 23 9. Speed-up for D = 0, Q = 1K and different values of N..................................... 24 10. Speed-up for D = 50, Q = 1K and different values of N................................... 25 11. Speed-up for N = 100K, Q = 1K and different values of D .............................. 26 12. Simulation data with cache and memory latencies Part I ................................. 47 13. Simulation data with cache and memory latencies Part II ................................ 48 14. Simulation data without caches and with 0 memory latency............................ 48 x LIST OF FIGURES Figures Page 1. Multi-processor system with a queue ................................................................. 3 2. Multi-processor system without a queue ............................................................ 4 3. Hardware queue and its interface ports .............................................................. 4 4. Simics simulation without a simulated hardware queue ..................................... 6 5. Simics simulation with a simulated hardware queue .......................................... 8 6. Objdump of a delay for-loop used to simulate a computational delay ............. 13 7. Cache hierarchies in simulated systems ............................................................ 15 8. Speed-up for N = 100K, D = 50 and Q = 1K and 10K ..................................... 23 9. Speed-up for D = 0, Q = 1K queue and N = 100K and 1M .............................. 24 10. Speed-up for D = 50, Q = 1K and N = 100K, 500K and 1M............................ 25 11. Speed-up for N = 100K, Q = 1K and different values of D .............................. 26 xi LIST OF EQUATIONS Equations Page 1. Instructions executed by the delay for-loop ...................................................... 14 2. Cache and memory latencies in terms of CPU cycles ...................................... 15 3. Measurements per data item with cache and memory latencies ....................... 20 4. Measurements per data item without caches and with 0 memory latency........ 20 5. Q_mem overhead .............................................................................................. 20 6. CPU cycles without Q_mem ovehead .............................................................. 20 7. Speed-up calculation ......................................................................................... 22 xii 1 Chapter 1 INTRODUCTION One way to optimize system performance of multi-processor systems is to introduce hardware queues for direct processor-to-processor communication. It enables synchronization between the processors acting as data producer and data consumer. Accesses to caches and main memory, which tend to be comparatively slower, are thereby reduced. This project aims to model such a queue in Simics environment enabling a one-way communication between two processors. 1.1 Literature Review The following sections describe some relevant pieces of work done in the area of hardware queues, i.e. queues in configurable processors and an example where queue sizing has been done to determine the optimal queue size for a sample application. 1.1.1 Hardware queues in configurable processors Configurable processors maybe used as building blocks for System-on-Chip (SoC) designs [1]. Such systems have additions in the form of custom-tailored instruction sets, custom-tailored execution units and specialized communication interface ports like data queues for direct processor-to-processor communication. These additions contribute to higher system performance than is achieved with conventional fixed-instruction sets. With a queue, the processors’ execution units would exchange data directly via the 2 specialized data queues. These queues have the highest bandwidth as far as task-to-task communication is concerned and potentially can support data rates as high as one transfer per cycle. 1.1.2 JPEG Encoding with queue In a research study, the task of JPEG encoding was mapped onto a five-processor MPSoC (Multi-processor System-on-Chip) system. Two processors out of the five were part of the testbench for that system and acted as the source and sink for the JPEG encoding process [2]. The remaining three processors were arranged linearly and communicated with each other via hardware queues. The source processor converted input picture file i.e. pixel map data into stream data. It was fed to the three linearly connected processors. The sink processor ultimately converted the output of the linearly connected processors to the JPEG format. In order to determine the optimal queue size, 32x32, 64x64, 128x128 and 256x256 picture sizes were initially encoded using the hardware queues up to 20K deep in size. No significant processor stalling due to queues being full was observed due to the large size queue. However, significant stalling was seen for 128x128 and 256x256 picture sizes, when a 100-deep queue was used. To size the queues even more precisely, trace information was gathered from various picture sizes and statistical analysis was done. It indicated that the maximum fill depth for all the queues was little less than 500. Thus, with a 500-deep queue, all the resolutions were encoded without any significant processor stalling. 3 1.2 Objective and scope of the project The objective of this project is to model a software controlled hardware queue for processor-to-processor communication and compare its performance with the system that does not use a queue. In this project, we have coded generic C application programs with computational delays embedded before each push to a queue or write to an array operations and after each pop from a queue or read from the array operations. Figure 1 illustrates a multi-processor system that uses a queue for inter-processor communication, whereas Figure 2 illustrates a multi-processor system with no inter-processor communication queue. Hardware queue P0 with its L1 cache P1 with its L1 cache L2 cache of P0 L2 cache of P1 M Figure 1 Multi-processor system with a queue 4 P0 with its L1 cache P1 with its L1 cache L2 cache of P0 L2 cache of P1 M Figure 2 Multi-processor system without a queue A hardware queue has dedicated interface ports for queue data, queue status and queue control and is illustrated in Figure 3. In this project, a software controlled queue has been implemented such that the dedicated interface ports are simulated using memory. Processor P0 and its L1 cache Q push status port Control port Q_empty Q_full Hardware Q push_done pop_done push pop data_in data_out Q data port Figure 3 Hardware queue and its interface ports Processor P1 and its L1 cache Q pop status port Control port Q data port 5 1.3 Introduction to Simics “Wind River Simics is a fast, functionally-accurate, full system simulator. Simics creates a high-performance virtual environment in which any digital system – from a single board to complex, heterogeneous, multi-board, multi-processor, multi-core systems can be defined, simulated.” [3] Simics is an instruction-set simulator and not a processor simulator [4]. It can simulate systems with most modern processors. Software developers can simulate the target hardware platform and study the behavior of software applications on them [5]. It has a special single cycle no-operation instruction called “MAGIC(n)” which can be used to insert a breakpoint in the user program. Simics stops the simulation at the point where the magic instruction has executed and invokes a user installed callback function written in Python. From this callback function, various simulation performance data can be dumped or collected for later analysis. The simulation resumes once the execution of the callback function is complete. Simics usually runs in ‘normal’ mode unless it is changed to ‘stall’ mode or ‘micro-architecture’ mode. The normal mode is the fastest execution mode and has been optimized such that all the instructions including memory transactions complete in a single cycle. However, in the stall mode, when a Simics object receives a memory transaction, the timing of the transaction is modified by returning a stall time in terms of CPU cycles. Therefore, cache models can be correctly modeled in the stall mode. This project uses the 6 stall mode to correctly compare the system performances with and without the hardware queues. 1.4 Simulation Environment Figure 4 illustrates the simulated environment without the queue. Script A is only used to dump simulation data. Test multithreaded C program Python script A Target (x86 Linux RH) Simics environment Host (x86 Windows 7) Figure 4 Simics simulation without a simulated hardware queue; script A does not include the software queue model. The target C program creates two threads one as a producer and the other as a consumer. The producer uses delays to simulate computations and writes a mock data item into an array. The following code segment illustrates the pseudo-code for producer function. 7 Producer_function: Write to an array magic_instruction(1) //Dump CPU cycles count for (i=0; i < data_items; i++) { introduce_computational_delay; write_to_array; } magic_instruction(2) //Dump CPU cycles count The consumer reads these intermediate results one at a time, uses delays to simulate computations and generates its own results. The following code segment illustrates the pseudo-code for consumer function. Consumer_function: Read from an array magic_instruction(1) //Dump CPU cycles count for (i=0; i < data_items; i++) { read_from_array; introduce_computational_delay; } magic_instruction(2) //Dump CPU cycles count 8 Figure 5 illustrates the simulated environment with an inter-processor queue modeled in Python programming language. Along with dumping simulation data, script B also includes the queue model. Test multithreaded C program Python script B Q model in Python Target (x86 Linux RH) Simics environment Host (x86 Windows 7) Figure 5 Simics simulation with a simulated hardware queue; script B includes the software queue model. The target C program similarly creates two threads one as a producer and the other as a consumer. The producer in this case pushes a mock data item after using delays to simulate a computation and the consumer pops that mock data item and processes it using delays to simulate a computation. Producer_function: Push to the queue magic_instruction(1) //Dump CPU cycles count for (i=0; i < data_items; i++) { introduce_computational_delay; push_to_queue; } magic_instruction(2) //Dump CPU cycles count 9 Consumer_function: Pop from the queue magic_instruction(1) //Dump CPU cycles count for (i=0; i < data_items; i++) { pop_from_queue; introduce_computational_delay; } magic_instruction(2) //Dump CPU cycles count Simulation results are discussed in Chapter 3. 1.5 Project Overview Chapter 2 covers the Software Queue model; Chapter 3 reports simulation results and covers analysis of the simulation data; Chapter 4 includes a conclusion and discusses potential future work. Appendices A covers steps for performing Simics simulation, B contains the source codes listing, and C lists the consolidated simulation data. 10 Chapter 2 QUEUE SIMULATION For a one-cycle queue, it was determined that the queue must be designed using registers; a queue using memory proved to be slow and maybe unsuitable for data streaming. 2.1 Software Queue Model The software queue is modeled using Python scripting language and registered as a “hap” function attached to the magic instruction. A ‘hap’ indicates an event in Simics such as the execution of a magic instruction. The magic instruction inserts a breakpoint in the C code, halts the Simics simulation, and invokes its corresponding “hap” function. The target C programs communicate with the Python module via the CPU registers. The “hap” function in turn calls call_queue (Appendix B) which implements a software queue. The functionality of the software queue is implemented by writing specific values to the “edi” CPU register with the execution of MAGIC(n) where ‘n’ can have any value from 1 through 6. Table 1 lists these values and their corresponding operations. 11 Table 1 List of queue interface parameters Value written to edi register (n) 1 2 3 4 5 6 “hap” function / operation Print CPU statistics at the start of producer and consumer tasks Print CPU statistics at the end of producer and consumer tasks Invoke software queue for push operation. Adds 1 CPU cycle for push operation to the overall CPU cycles. Invoke software queue for pop operation. Adds 1 CPU cycle for pop operation to the overall CPU cycles. Check if the software queue is full. Check if the software queue is empty. In this project, only a single push or pop operation is simulated and thus simultaneous push and pop requests to the queue are not modeled. In a hardware queue, push and pop requests might simultaneously happen that will result in one request to stall while the other completes. This modeling is not possible with the software queue modeled in Python, which runs outside Simics simulation environment. Such modeling requires creating a Simics module in C that runs within the Simics environment. Therefore, checking for push_done and pop_done is not implemented. However, it is assumed that such simultaneous push and pop operations happen infrequently and thus will not greatly affect the performance. 2.2 Target C programs Two target C programs are written. One is designed to run on the simulated target system when a queue is used. The other is designed to run on the simulated target system where a queue is not used. The programs allow the user to configure parameters like 12 computational delays of the producer and the consumer and the number of data items exchanged between the two processors. The push and pop operations are initiated by calling their respective functions. These functions in turn call ASM macros, which will invoke Python queue to write to or read from the CPU registers. Initially, computational delays were added by inserting sleep() functions with sleep values in milliseconds and microseconds. These delays proved to be too big for the queue to be efficiently utilized and unsuitable for data streaming. Therefore, dummy forloops were used to introduce computational delays. They add delays by inserting CPU cycles. The dummy for-loop for count = 50 is shown below, for (i=0; i < 50; i++) {} //Introduce computational delay by inserting CPU cycles The code is compiled using GCC and the objdump of a.out is illustrated in Figure 6. Refer to Appendix A for steps to obtain objdump of a.out. 13 Figure 6 Objdump of a delay for-loop used to simulate a computational delay The instructions execute in the sequence as shown below, 1. movl $0x0, xfffffff4(%ebp) 2. cmpl $0x63, 0xfffffff4(%ebp) 3. jle 8048440 4. lea 8048448 5. incl (%eax) 6. jmp 8048438 7. nop 8. cmpl $0x63, 0xfffffff4(%ebp) 9. jle 8048440 10. lea 8048448 11. incl (%eax) 12. jmp 8048438 13. nop ….loop 48 times through instructions 2 through 7 14. cmpl $0x63, 0xfffffff4(%ebp) 15. jmp 14 Equation 1 shows the calculation of the number of instructions executed by the dummy for-loop for 50 iterations. Equation 1 Instructions executed by the delay for-loop Instructions executed = 1 + [6 * (count)] + 2 = 1 + (6 * 50) + 2 = 303 Given CPI = 1 (cycles per instruction), 303 instructions are estimated to require 303 cycles execution time. Thus in general, for count = n, CPU cycles introduced are (6*n) + 3. 2.3 Cache hierarchy in Simics simulations Simics uses its own memory system to achieve high-speed simulation, and a simulated system by itself does not have a cache model included by default. However, users can model their own memory systems. A Simics script introduces cache hierarchy in the simulated system as illustrated in Figure 7. Each processor in the dual-core system simulated for this project has a 32KB write-through L1 data cache, a 32KB L1 instruction cache and a 256KB L2 cache with write-back policy. Instruction and data accesses are separated out by id-splitters and are sent to the respective caches. The splitter allows the correctly aligned accesses to go through and splits the incorrectly aligned ones into two accesses. The transaction staller (trans-staller) simulates main memory latency. Refer to Appendix B for the Simics script to add cache hierarchy to a simulated system. 15 Figure 7 Cache hierarchies in simulated systems [6] We have assumed the target real machine to be an Intel i5-2400 (Sandy Bridge), 3.1 GHz CPU. Table 2 lists the latencies for accessing its L1 and L2 caches and memory [7]. The latencies are then expressed in terms of CPU cycles. Equation 2 shows an example where the equivalent latency for L1 cache in terms of CPU cycles is calculated. Equation 2 Cache and memory latencies in terms of CPU cycles Clock period of 3.1 GHz CPU = 1 / (3.1 GHz) = 0.322580645 ns For example, L1 cache latency = 4 ns = 4 / 0.322580645 cycles ~ 12 cycles 16 Table 2 Cache and memory latencies Type of memory Latency in ns L1 cache L2 cache RAM (assumed to be 64KB) Main memory 4ns 12ns 65ns L2 + RAM = 12 ns + 65 ns = 77 ns Equivalent no. of CPU cycles (approximate) 12 37 202 239 Table 3 lists all the cache related penalties. Table 3 Read and write transaction penalties Penalty Incoming read transaction for L1 cache Incoming write transaction for L1 cache Incoming read transaction for L2 cache Incoming write transaction for L2 cache Transaction issued by L2 cache to read from L1 cache Transaction issued by L2 cache to write to L1 cache Transaction issued by memory to read from L2 cache No. of CPU cycles (approximate) 12 12 37 37 9 9 22 Transaction issued by memory to write to L2 cache 22 17 Chapter 3 SIMULATION AND DATA ANALYSIS Test programs are executed for two simulated models: array-based and queuebased; and the performance data is reported in terms of CPU cycles. Each test program is divided into individual contributing segments in terms of CPU cycles used. For example, the array-based test program includes array program instruction access, instruction execution and array access. The queue-based program includes queue program access instructions, and instruction execution, which includes instructions to simulate queue access buffers. Table 4 outlines these contributing segments and provides a brief description for each. The following sections describe the details of the performance measurements. Refer to Figure 3 from page 4 that illustrates a hardware queue. It has buffers at both the ends, which are used as dedicated interface ports in direct processor-to-processor communication. Such dedicated interface ports are simulated through memory addresses instead. Therefore, during simulation, additional overhead in the form of memory latency is added to the total number of CPU cycles for accessing the simulated queue. This is referred to as Q_mem in Table 4 and needs to be removed in order to simulate the software queue as if it was a hardware queue with dedicated interface ports. The arraybased program is also divided into its contributing segments. 18 Table 4 Code segments in memory-based and queue-based programs Measurement values in CPU cycles Q_prog Q_prog_mem CPU cycles of queue_program Q_data Q_mem Q_stall array_prog array_prog_mem CPU cycles of array_program array_instructions arrays_mem array_stalls Description Cycles required to execute the queue-based program (withqueue.c). Cycles required to access instructions of withqueue.c from memory hierarchy i.e. L1 and L2 caches and memory. Cycles to access data from the simulated queue. Cycles required to access queue ports, simulated with memory addresses and 0 cache and memory latencies. Cycles executed due to either producer or consumer stalls through simulated queue interface ports, i.e. when there is either a push request and the queue is full or when there is a pop request and the queue is empty. Cycles required to execute the array-based program (witharray.c). Cycles required to access instructions of witharray.c from memory hierarchy i.e. L1 and L2 caches and memory. Cycles for accessing the array data Cycles required to access array elements from cache and memory hierarchy. Cycles executed due to consumer stalls i.e. when the producer has not yet produced data. 19 3.1 Speed-up Calculation Measurements A, B and X as indicated in Table 5 are expressed in terms of CPU cycles and are measured for the entire simulation run. They are calculated by taking a difference of the cycles from final CPU statistics and the cycles from initial CPU statistics. They are measured in two cases as follows and are listed in Table 6 and Tables 12 and 14 in Appendix C: I. When L1 and L2 caches and memory are included in the simulation. These measurements use suffix ‘c’ (with cache) after them. For example, “A” in variable “Ac” implies the number of CPU cycles for the items checked in Table 5 and “c” implies that non-zero latencies were used for the caches and memory (Tables 2 and 3). II. When L1 and L2 caches are not included in the simulation and the memory latency set to zero. These measurements use suffix ‘nc’ (no cache) after them. For example, “A” in variable “Anc” is the same as in I., except that no caches were used and memory latency was set to 0. Table 5 Simulation data measurements Measu rement Q_ pro g Q_pro g_me m Q_ dat a Q_ me m Q_ stal l A √ √ √ √ √ B √ √ X array _pro g array _prog _me m array_ instruc tions arrays _mem array _stalls √ √ √ √ √ 20 The CPU cycles per data item are calculated from measurements A, B and X and are denoted by lower case a, b and x respectively. The measurements taken when caches and memory are included in the simulation are denoted by ac, bc and xc. Equation 3 shows an example, Equation 3 Measurements per data item with cache and memory latencies ac = Ac N N signifies the number of data items. Similarly, the measurements taken without caches and memory are denoted by anc, bnc and xnc as shown in Table 6. Equation 4 shows an example, Equation 4 Measurements per data item without caches and with 0 memory latency anc = Anc N Measurement t indicates Q_mem overhead and is calculated using Equation 5. Equation 5 Q_mem overhead t = (ac – bc) – (anc – bnc) Measurement u signifies the CPU cycles after removing t cycles from ac (Equation 6); that is, as if it was a hardware queue with dedicated interface ports. Equation 6 CPU cycles without Q_mem overhead u = ac – t 21 Table 6 summarizes these measurements and their individual contributing segments. Table 6 Measurements for speed-up calculation Measurement in cycles per data item Formula Comment ac Q_prog + Q_prog_mem + Q_data + Q_mem + Q_stall Includes all the contributing segments of the queue-based program (withqueue.c) when caches and main memory are used in the simulation. bc Q_prog + Q_prog_mem Includes the segments of the queuebased program when caches and main memory are used in the simulation, except for those contributed to queue accesses. xc array_prog + array_prog_mem + array_instructions + array_mem + array_stalls Includes all the contributing segments of array-based program (witharray.c) when caches and main memory are used in the simulation. ac – bc Q_data + Q_mem + Q_stalls anc Q_prog + Q_data + Q_stalls Q overhead w/ cache and main memory delays for queue implemented in software. Includes all the contributing segments of queue-based program when cache are not used and memory latency is set to 0. Q_prog Includes the segments of queue-based program when caches are not used and memory latency is set to 0, except for those contributed to queue accesses. anc- bnc Q_data + Q_stalls Q overhead w/o caches and memory latency set to 0 for queue implemented in software (as if queue is accessed via ports within 1 cycle delay). t (ac – bc) – (anc – bnc) = Q_mem Q_mem overhead. u ac - t = (Q_prog + Q_prog_mem + Q_data + Q_stalls) Cycles per item after removing the Q_mem overhead from measurement ac. bnc 22 After removing the Q overhead, speed-up is calculated by taking a ratio of measurements xc over u as given in Equation 7. Equation 7 Speed-up calculation Speed-up = xc / u The following sections report the simulation data for all the simulation runs. Performance analysis is done and the results are discussed. 3.2 Simulation Data and Analysis Table 7 lists the different test parameters and simulation data for these runs. D signifies the computational delay count in terms of iterations in the delay for-loop. The Q overhead remains more or less constant at about 26 cycles even though both D and N are varied. Therefore, its value with no caches and no memory (anc - bnc) is considered as 26 cycles in all further calculations. Table 7 Queue overhead for no caches and 0 memory latency D 0 0 50 50 N Measurements in CPU cycles per data item anc 100,000 35.08308 1,000,000 35.12543 100,000 286.0951 1,000,000 286.1502 anc 9.02576 9.026061 259.7651 259.8502 anc - bnc 26.05732 26.09937 26.33001 26.29995 Increasing the queue size decreases the number of cycles required for Q_stalls, which is due to queue either being full during a push operation or empty during a pop 23 operation. This increases the speed-up slightly as indicated in Table 8 and illustrated in Figure 8. Table 8 Speed-up for D = 50, N = 100K and different queue sizes D N Q size Speed-up 50 50 100,000 100,000 1000 10,000 1.153829 1.162288 Speed-up 1.164 1.162 Speed-up 1.16 1.158 1.156 Speed-up 1.154 1.152 1.15 1.148 1000 10,000 Q size Figure 8 Speed-up for N = 100K, D = 50 and Q = 1K and 10K The number of data items N passing through the queue was increased from 100K to 1M. In this case, as N increases, the percentage of CPU cycles used to access the queue 24 decreases as compared to the percentage of CPU cycles used to execute the program. Whereas, in the case of the system without a queue, as N increases, more memory accesses for accessing array elements is expected, which could result in increased cache misses. In addition, the queue latency for push and pop operations requiring one cycle each does not change as N increases. This contributes to an increased speed-up as shown in Tables 9 (for D = 0) and 10 (for D = 50) and illustrated in Figures 9 and 10. Table 9 Speed-up for D = 0, Q = 1K and different values of N D N Q size 0 0 100,000 1000 1,000,000 1000 Speed-up 4.986199 5.032121 Speed-up 5.04 5.03 Speed-up 5.02 5.01 5 Speed-up 4.99 4.98 4.97 4.96 100,000 1,000,000 No. of data items (N) Figure 9 Speed-up for D = 0, Q = 1K and N = 100K and 1M 25 Table 10 Speed-up for D = 50, Q = 1K and different values of N D N Q size 50 50 50 100,000 1000 500,000 1000 1,000,000 1000 Speed-up 1.153829 1.167059 1.168999 Speed-up 1.175 1.17 Speed-up 1.165 1.16 Speed-up 1.155 1.15 1.145 100,000 500,000 No. of data items (N) 1,000,000 Figure 10 Speed-up for D = 50, Q = 1K and N = 100K, 500K and 1M Table 11 lists the speed-up values for different delay values. The results are illustrated in Figure 11. As the computational delays increase, the proportion of CPU cycles used for program execution vs. the cycles required for accessing the queue or array from memory changes. The results indicate that the queue is efficient as long as 26 computational delays are small. Larger computational delays result in decreased speedup. Table 11 Speed-up N = 100K, Q = 1K and different values of D D N Q size Speed-up 0 25 50 200 500 100,000 100,000 100,000 100,000 100,000 1000 1000 1000 1000 1000 4.986199 1.305367 1.153829 1.040669 1.019745 Speed-up 6 5 Speed-up 4 3 Speed-up 2 1 0 0 25 50 200 Computational delay (D) Figure 11 Speed-up for N = 100K, Q = 1K and different values of D 500 27 Chapter 4 CONCLUSION Simics simulation environment was used to evaluate and compare performances of a system that uses inter-processor queue with that of a system that does not. A software-controlled hardware queue was modeled in Python scripting language. Unlike a real hardware queue, dedicated interface ports were not implemented for queue accesses. Instead, memory addresses were used to simulate the queue accesses, and it thus added Q overhead to the simulations. The overhead was removed by dividing the queue-based and array-based programs into their individual contributing segments and separating out the CPU cycles that they contributed. A number of Simics simulations were performed to study the effect on performance by varying the queue size, number of data items exchanged between processors and computational delays of producer and consumer tasks. It was seen that unlike the increased memory latency due to increased cache misses, the queue latency for push and pop operations remained the same although the number of data items increased. Additionally, larger computational delays implied that the CPU cycles used to process data would be much greater than the cycles required to push and pop data from the queue. In conclusion, a queue should be used when tasks require small computational delays and many data items to be processed. This may be useful in applications requiring real-time data streaming. 28 Simulations using different computational delays for the producer and consumer resulted, as expected, in an increase in the number of stalls either on the producer side or the consumer side, whichever had shorter computational delays. Further study is needed to evaluate the performance of queue in such scenarios. In addition, since the softwarecontrolled queue did not support simultaneous push and pop requests, a C module running within the Simics simulation environment can be designed to overcome this shortcoming. 29 APPENDIX A. Simics Simulation Steps Two BASH scripts were used for automating the simulations runs in two cases: the first where a queue is used and the second where the queue is not used for inter-processor communication. Refer to Appendix B for these scripts. Below are the steps to perform Simics simulation using the simulated queue. 1. Use your Saclink username and password to connect to VPN. 2. Copy files add_cache_hierarchy.simics, pythonscript.py to location C:\simics3.x.x\workspace. 3. Copy files withqueue.c and magic_instruction.h to location C:\simicsworkfolder on the host disk. 4. Open file located at C:\simics-3.x.x\workspace\targets\x86-440bx\enterprisecommon.simics. For a 2-processor target system, enter $num_cpus = 2. 5. Launch Simics and run it in stall mode by clicking the ‘View’ tab in the Simics window. Then select ‘Preferences…’. Change the Execution mode by selecting ‘Stall’ from the drop-down menu. Load the enterprise-common.simics configuration file into Simics by selecting ‘New session’ from the File menu and selecting the specified file. Enterprise has Red Hat Linux 7.3 installed. The base configuration has a single 20 MHz Pentium 4 processor. 6. When OS completes booting i.e. when the console displays the login prompt on the Simics console window, set the cpu-switch-time to 1 by executing below command in the Simics Command Window. Its default value is 1000 to minimize simulation time. By setting it to 1, the simulator overhead increases. Nevertheless, more importantly, a perfectly synchronized simulation is set up and detailed study of the caches is made possible. simics> pselect cpu0 simics> cpu-switch-time 1 Repeat for cpu1. The current value of cpu-switch-time can be checked with the below command in Simics Command Window. Simics> @conf.sim.cpu_switch_time 30 7. Load the add_cache_hierarchy.simics file by selecting ‘Run Simics script file’ from the File menu. 8. Login to the target machine as root. 9. Go to the root folder by issuing, [root@enterprise root]# cd .. 10. Mount the host disk by entering below command, [root@enterprise /]# mount /host 11. Copy withqueue_script.txt to / by executing, [root@enterprise /]# cp /host/simicsworkfolder/withqueue_script.txt / 12. Convert the script file to UNIX format by executing, [root@enterprise /]# dos2unix withqueue.txt 13. Make the script file executable by, [root@enterprise /]# chmod +777 withqueue_script.txt 14. Execute the pythonscript.py script by select ‘Run Python script file’ from the file menu. 15. Execute withqueue.txt script file by, [root@enterprise /]# ./withqueue.txt 16. Get objdump of a.out by executing below command in the console window, [root@enterprise /]# objdump –d a.out > dump.txt [root@enterprise /]# vi dump.txt When the program runs, Python script will dump the CPU statistics on the Simics command window. Below are the steps to perform Simics simulation without using the simulated queue. 1. Repeat steps 1 and 2 as above. 2. Copy files witharray.c and magic_instruction.h to location C:\simicsworkfolder on the host disk. 3. Follow steps 4 through 10 as above. Step 4 need not be repeated if a simulation has been run earlier. 31 4. Copy witharray_script.txt to / by executing, [root@enterprise /]# cp /host/simicsworkfolder/witharray_script.txt / 5. Convert the script file to UNIX format by executing, [root@enterprise /]# dos2unix witharray.txt 6. Make the script file executable by, [root@enterprise /]# chmod +777 witharray_script.txt 7. Repeat step 14 as above. 8. Execute witharray.txt script file by, [root@enterprise /]# ./witharray.txt 9. Repeat step 16 as above. 32 APPENDIX B. Source Code Listing Add cache hierarchy to Simics Simulation (add_cache_hierarchy.simics) [6] ## Transaction staller for memory @staller = pre_conf_object("staller", "trans-staller") ##Stall instructions 239 cycles to simulate memory latency @staller.stall_time = 239 ##Latency of (L2 + RAM) in CPU cycles ############### g-cache configuration for cpu0 ################ ## Create L2 cache (l2c0) for cpu0: 256KB with write-back @l2c0 = pre_conf_object("l2c0", "g-cache") @l2c0.cpus = conf.cpu0 @l2c0.config_line_number = 4096 @l2c0.config_line_size = 64 ##64 blocks. Implies 4096 lines @l2c0.config_assoc = 8 @l2c0.config_virtual_index = 0 @l2c0.config_virtual_tag = 0 @l2c0.config_write_back = 1 @l2c0.config_write_allocate = 1 @l2c0.config_replacement_policy = 'lru' @l2c0.penalty_read = 37 ##Stall penalty (in cycles) for any incoming read transaction @l2c0.penalty_write = 37 ##Stall penalty (in cycles) for any incoming write transaction @l2c0.penalty_read_next = 22 ##Stall penalty (in cycles) for a read transaction issued by the cache to the next level cache. Rounding error, value should be 21. @l2c0.penalty_write_next = 22 ##Stall penalty (in cycles) for a write transactions issued by the cache to the next level cache. Rounding error, value should be 21. @l2c0.timing_model = staller ## L1 - Instruction Cache for cpu0: 32KB @ic0 = pre_conf_object("ic0", "g-cache") @ic0.cpus = conf.cpu0 @ic0.config_line_number = 512 @ic0.config_line_size = 64 ##64 blocks. Implies 512 lines @ic0.config_assoc = 8 @ic0.config_virtual_index = 0 @ic0.config_virtual_tag = 0 @ic0.config_write_back = 0 @ic0.config_write_allocate = 0 @ic0.config_replacement_policy = 'lru' @ic0.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read transaction @ic0.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write transaction @ic0.penalty_read_next = 9 ##Stall penalty (in cycles) for a read transaction issued by the cache to the next level cache. Rounding error, value should be 7. @ic0.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions issued by the cache to the next level cache. Rounding error, value should be 7. @ic0.timing_model = l2c0 ## L1 - Data Cache for cpu0: 32KB Write-through @dc0 = pre_conf_object("dc0", "g-cache") @dc0.cpus = conf.cpu0 33 @dc0.config_line_number = 512 @dc0.config_line_size = 64 ##64 blocks. Implies 512 lines @dc0.config_assoc = 8 @dc0.config_virtual_index = 0 @dc0.config_virtual_tag = 0 @dc0.config_write_back = 0 @dc0.config_write_allocate = 0 @dc0.config_replacement_policy = 'lru' @dc0.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read transaction @dc0.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write transaction @dc0.penalty_read_next = 9 ##Stall penalty (in cycles) for a read transaction issued by the cache to the next level cache. Rounding error, value should be 7. @dc0.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions issued by the cache to the next level cache. Rounding error, value should be 7. @dc0.timing_model = l2c0 ## Transaction splitter for L1 instruction cache for cpu0 @ts_i0 = pre_conf_object("ts_i0", "trans-splitter") @ts_i0.cache = ic0 @ts_i0.timing_model = ic0 @ts_i0.next_cache_line_size = 64 ## Transaction splitter for L1 data cache for cpu0 @ts_d0 = pre_conf_object("ts_d0", "trans-splitter") @ts_d0.cache = dc0 @ts_d0.timing_model = dc0 @ts_d0.next_cache_line_size = 64 ## ID splitter for L1 cache for cpu0 @id0 = pre_conf_object("id0", "id-splitter") @id0.ibranch = ts_i0 @id0.dbranch = ts_d0 ############### g-cache configuration for cpu1 ################ ## Create L2 cache (l2c1) for cpu1: 256KB with write-back @l2c1 = pre_conf_object("l2c1", "g-cache") @l2c1.cpus = conf.cpu1 @l2c1.config_line_number = 4096 @l2c1.config_line_size = 64 ##64 blocks. Implies 4096 lines @l2c1.config_assoc = 8 @l2c1.config_virtual_index = 0 @l2c1.config_virtual_tag = 0 @l2c1.config_write_back = 1 @l2c1.config_write_allocate = 1 @l2c1.config_replacement_policy = 'lru' @l2c1.penalty_read = 37 ##Stall penalty (in cycles) for any incoming read transaction @l2c1.penalty_write = 37 ##Stall penalty (in cycles) for any incoming write transaction @l2c1.penalty_read_next = 22 ##Stall penalty (in cycles) for a read transaction issued by the cache to the next level cache. Rounding error, value should be 21. @l2c1.penalty_write_next = 22 ##Stall penalty (in cycles) for a write transactions issued by the cache to the next level cache. Rounding error, value should be 21. @l2c1.timing_model = staller 34 ## L1 - Instruction Cache for cpu1: 32KB @ic1 = pre_conf_object("ic1", "g-cache") @ic1.cpus = conf.cpu1 @ic1.config_line_number = 512 @ic1.config_line_size = 64 ##64 blocks. Implies 512 lines @ic1.config_assoc = 8 @ic1.config_virtual_index = 0 @ic1.config_virtual_tag = 0 @ic1.config_write_back = 0 @ic1.config_write_allocate = 0 @ic1.config_replacement_policy = 'lru' @ic1.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read transaction @ic1.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write transaction @ic1.penalty_read_next = 9 ##Stall penalty (in cycles) for a read transaction issued by the cache to the next level cache. Rounding error, value should be 7. @ic1.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions issued by the cache to the next level cache. Rounding error, value should be 7. @ic1.timing_model = l2c1 ## L1 - Data Cache for cpu1: 32KB Write-through @dc1 = pre_conf_object("dc1", "g-cache") @dc1.cpus = conf.cpu1 @dc1.config_line_number = 512 @dc1.config_line_size = 64 ##64 blocks. Implies 512 lines @dc1.config_assoc = 8 @dc1.config_virtual_index = 0 @dc1.config_virtual_tag = 0 @dc1.config_write_back = 0 @dc1.config_write_allocate = 0 @dc1.config_replacement_policy = 'lru' @dc1.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read transaction @dc1.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write transaction @dc1.penalty_read_next = 9 ##Stall penalty (in cycles) for a read transaction issued by the cache to the next level cache. Rounding error, value should be 7. @dc1.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions issued by the cache to the next level cache. Rounding error, value should be 7. @dc1.timing_model = l2c1 ## Transaction splitter for L1 instruction cache for cpu1 @ts_i1 = pre_conf_object("ts_i1", "trans-splitter") @ts_i1.cache = ic1 @ts_i1.timing_model = ic1 @ts_i1.next_cache_line_size = 64 ## Transaction splitter for L1 data cache for cpu1 @ts_d1 = pre_conf_object("ts_d1", "trans-splitter") @ts_d1.cache = dc1 @ts_d1.timing_model = dc1 @ts_d1.next_cache_line_size = 64 ## ID splitter for L1 cache for cpu1 @id1 = pre_conf_object("id1", "id-splitter") @id1.ibranch = ts_i1 35 @id1.dbranch = ts_d1 ##Add Configuration @SIM_add_configuration([staller, l2c0, ic0, dc0, ts_i0, ts_d0, id0 , l2c1, ic1, dc1, ts_i1, ts_d1, id1], None); ## Timing model for cpu0_space and cpu1_space @conf.cpu0_mem.timing_model = conf.id0 @conf.cpu1_mem.timing_model = conf.id1 Python script and SW Queue (pscript.py) #=========================================================================== ======================================== #edi = 1 for printing initial statistics # = 2 for printing final statistics # = 3 for pushing data into software queue # = 4 for popping data from software queue # = 5 for checking the value of q_full flag # = 6 for checking the value of q_empty flag #esi = data to be pushed into software queue from C program #ebx = data popped from software queue and passed to C program #ecx = 1 if software queue is full, else 0 #edx = 1 if software queue is empty, else 0 # #Semaphores to access queue are not implemented as MAGIC instruction executes. #So it acts as a semaphore itself. It stops simulation and either push or pop #will execute, and not both. #=========================================================================== ======================================== def call_queue(cpu): global QUEUE_DEPTH QUEUE_DEPTH = 1024 global NO_DATA_ITEMS NO_DATA_ITEMS = 100000 #Push data into the queue if cpu.edi == 3: ##value 3 indicates a push operation name = SIM_get_attribute(cpu, "name") data_in = cpu.esi ##Get logical address of push data global len_queue if len(queue) < QUEUE_DEPTH: queue.append(data_in) ##Since this is not the first item, append all other items to the queue current = SIM_get_attribute(cpu, "cycles") ##Get current number of cycles executed SIM_set_attribute(cpu, "cycles", current+1) ##Add 1 to the current number of cycles for push operation len_queue = len_queue + len(queue) else: print "Queue is full!!!" #Pop data from the queue elif cpu.edi == 4: ##value 4 indicates a pop operation 36 name = SIM_get_attribute(cpu, "name") if len(queue) != 0: data_out = queue.pop(0) ##Always pop from location 0 of the queue log_addr_data_out = cpu.ebx ##Get logical address of pop data phy_addr_data_out = SIM_logical_to_physical(cpu, 1, log_addr_data_out) ##Get physical address of pop data SIM_write_phys_memory(cpu, phy_addr_data_out, data_out, 4) ##Write 4 bytes to phy_addr_data_out current = SIM_get_attribute(cpu, "cycles") ##Get current number of cycles executed SIM_set_attribute(cpu, "cycles", current+1) ##Add 1 to the current number of cycles for pop operation else: print "Queue is empty!!! #Check status of qfull flag and pass its status to the target C program elif cpu.edi == 5: log_addr_qfull = cpu.ecx phy_addr_qfull = SIM_logical_to_physical(cpu, 1, log_addr_qfull) if len(queue) < QUEUE_DEPTH: SIM_write_phys_memory(cpu, phy_addr_qfull, 0, 4) ##Value 0 written to ecx register indicates queue is not full else: SIM_write_phys_memory(cpu, phy_addr_qfull, 1, 4) ##Value 1 written to ecx register indicates queue is full #Check status of qempty flag and pass its status to the target C program elif cpu.edi == 6: log_addr_qempty = cpu.edx phy_addr_qempty = SIM_logical_to_physical(cpu, 1, log_addr_qempty) if len(queue) != 0: SIM_write_phys_memory(cpu, phy_addr_qempty, 0, 4) ##Value 0 written to edx register indicates queue is not empty else: SIM_write_phys_memory(cpu, phy_addr_qempty, 1, 4) ##Value 1 written to edx register indicates queue is empty else: print "Illegal operation" #Hap callback function def hap_callback(user_arg, cpu, arg): #Print initial statistics if cpu.edi == 1: name = SIM_get_attribute(cpu, "name") print "Callback 1 for initial stats", name eval_cli_line("%s" % name + ".ptime") #Print final statistics elif cpu.edi == 2: name = SIM_get_attribute(cpu, "name") print "Callback 2 for final stats", name eval_cli_line("%s" % name + ".ptime") print 'QUEUE_DEPTH=', QUEUE_DEPTH 37 avg_queue_len = 0 print 'Total queue length = ', len_queue avg_queue_len = len_queue / NO_DATA_ITEMS print 'Avg queue length =', avg_queue_len #Push data into the queue OR Pop data from the queue OR check if queue is full OR check if queue is empty elif (cpu.edi == 3 or cpu.edi == 4 or cpu.edi == 5 or cpu.edi ==6): call_queue(cpu) #Unknown callback else: print "Unknown callback" SIM_break_simulation("snore") #MAIN function #------------from collections import deque queue = [] len_queue = 0 SIM_hap_add_callback("Core_Magic_Instruction", hap_callback, 100) #100 is user_arg, ignored here Target C program that uses the software queue (withqueue.c) #include<stdio.h> #include<pthread.h> #include "magic_instruction.h" /*find it in local directory*/ #define COMPUTATIONAL_DELAY_PRODUCER 50 //producer for-loop count #define COMPUTATIONAL_DELAY_CONSUMER 50 //consumer for-loop count #define NO_DATA_ITEMS 100000 //Number of data items exchanged by processors //--------------------------------------------Explanation of ASM code [8] ---------------------------------------------// //asm volatile ("movl %0, %%edi" \ // : /*no outputs*/ \ // : "g" (a) \ // : "edi"); \ // MAGIC(0); //As seen in the above ASM code, the value of the output operand which is referred by %0 is to be moved to the //edi register. //Operands have a single'%', whereas registers have '%%'. This helps GCC in distinguishing between operands and registers. //The keyword "volatile" is added to the ASM if the memory affected is not listed in the inputs and outputs of the ASM. //There are no outputs specified. //"a" is the input operand and "g" is the constraint on operand "a". It tells GCC that it is allowed to use any //register, memory or immediate integer operand, except for registers that are not general registers. //"edi" is the clobbered register and we will use and modify it by writing the value of "a" to it. So in this case, //GCC does not assume that the value held in this register is valid. 38 //MAGIC(0) is the single NOP instruction. //----------------------------------------------------------------------------------------------------------------------- // //---------------------------------------------------Register values-----------------------------------------------------// //edi = 1 for printing initial stats // = 2 for printing final stats // = 3 for pushing data into Python queue // = 4 for popping data from Python queue // = 5 for checking the value of q_full flag // = 6 for checking the value of q_empty flag // = 99 for calling dummy macro //esi = data to be pushed into Python queue from C program //ebx = data popped from Python queue and passed to C program //ecx = 1 if Python queue is full, else 0 //edx = 1 if Python queue is empty, else 0 //----------------------------------------------------------------------------------------------------------------------- // //Print initial and final stats by writing values 1 and 2 to edi register #define MAGIC_INSTRUCTION(a) { \ asm volatile ("movl %0, %%edi" \ : /*no outputs*/ \ : "g" (a) \ : "edi"); \ MAGIC(0); \ } \ //For push operation, write value 3 to edi register and the data value to be pushed into esi register #define QPUSH(f, k) \ { \ asm volatile ("movl %0, %%edi" \ : /*no outputs*/ \ : "g" (f) \ : "edi"); \ asm volatile ("movl %0, %%esi" \ : /*no outputs*/ \ : "g" (k) \ : "esi"); \ MAGIC(0); \ } //For pop operation, write value 4 to edi register and read the value written by Python script/queue from ebx register #define QPOP(f, k) \ { \ asm volatile ("movl %0, %%edi" \ : /*no outputs*/ \ : "g" (f) \ : "edi"); \ asm volatile ("movl %0, %%ebx" \ : /*no outputs*/ \ : "g" (k) \ : "ebx"); \ MAGIC(0); \ 39 } //If queue is full, ecx=1, else ecx=0 #define GET_QFULL_STATUS(f, k) \ { \ asm volatile ("movl %0, %%edi" \ : /*no outputs*/ \ : "g" (f) \ : "edi"); \ asm volatile ("movl %0, %%ecx" \ : /*no outputs*/ \ : "g" (k) \ : "ecx"); \ MAGIC(0); \ } //If queue is empty, edx=1, else edx=0 #define GET_QEMPTY_STATUS(f, k) \ { \ asm volatile ("movl %0, %%edi" \ : /*no outputs*/ \ : "g" (f) \ : "edi"); \ asm volatile ("movl %0, %%edx" \ : /*no outputs*/ \ : "g" (k) \ : "edx"); \ MAGIC(0); \ } void consumer(); void *producer(void *); void queue_push(long int); int queue_pop(); int prod = 0, cons = 0; //Main is the second thread int main(void) { pthread_t threadID1; void *exit_status; int delay_PRODUCER, delay_CONSUMER; long int data_items; delay_PRODUCER = COMPUTATIONAL_DELAY_PRODUCER; delay_CONSUMER = COMPUTATIONAL_DELAY_CONSUMER; data_items = NO_DATA_ITEMS; printf("----------------------WITH QUEUE measurement A----------------------------------\n"); printf("delay_PRODUCER = %d, delay_CONSUMER = %d, no. of data items= %ld\n", delay_PRODUCER, delay_CONSUMER, data_items); 40 pthread_create(&threadID1, NULL, producer, NULL); //create producer thread consumer(); //Function call for consumer pthread_join(threadID1, &exit_status); printf("--------------------------------------------------------------------------------\n"); return 0; } //Function for producer (application level) void *producer(void *arg) { long int i; int r; prod = 1; while (cons == 0) {} //Stall until the other thread starts executing MAGIC_INSTRUCTION(1); for (i=0; i < NO_DATA_ITEMS; i++) { for (r=0; r < COMPUTATIONAL_DELAY_PRODUCER; r++) {} //Introduce computational delay by inserting CPU cycles queue_push(i); } MAGIC_INSTRUCTION(2); } //Function for consumer (application level) void consumer() { long int j; int p; cons = 1; while (prod == 0) {} //Stall until the other thread starts executing MAGIC_INSTRUCTION(1); for (j=0; j < NO_DATA_ITEMS; j++) { queue_pop(); for (p=0; p < COMPUTATIONAL_DELAY_CONSUMER; p++) {} //Introduce computational delay by inserting CPU cycles } MAGIC_INSTRUCTION(2); } //Library function for push operation void queue_push(long int data_for_push) { int qfull; int *qfull_ptr = &qfull; 41 do { GET_QFULL_STATUS(5, qfull_ptr); //read value from ecx register, ecx=1 if queue is full, else ecx=0 if (qfull == 0) //if Q not full { QPUSH(3, data_for_push); //3 means push break; } } while(1); } //Library function for pop operation int queue_pop() { int qempty, data_from_queue; int *data_from_queue_ptr = &data_from_queue; int *qempty_ptr = &qempty; do { GET_QEMPTY_STATUS(6, qempty_ptr); //read value from edx register, edx=1 if queue is empty, else edx=0 if (qempty == 0) //if Q not empty { QPOP(4, data_from_queue_ptr); //4 means pop break; } } while(1); return (data_from_queue); } Target C program that uses an array for inter-processor communication (witharray.c) #include<stdio.h> #include<pthread.h> #include "magic_instruction.h" /*find it in local directory*/ #define COMPUTATIONAL_DELAY_PRODUCER 50 //producer for-loop count #define COMPUTATIONAL_DELAY_CONSUMER 50 //consumer for-loop count #define NO_DATA_ITEMS 100000 //Number of data items exchanged by processors //--------------------------------------------Explanation of ASM code [8] ---------------------------------------------// //asm volatile ("movl %0, %%edi" \ // : /*no outputs*/ \ // : "g" (a) \ // : "edi"); \ // MAGIC(0); //As seen in the above ASM code, the value of the output operand which is referred by %0 is to be moved to the //edi register. 42 //Operands have a single'%', whereas registers have '%%'. This helps GCC in distinguishing between operands and registers. //The keyword "volatile" is added to the ASM if the memory affected is not listed in the inputs and outputs of the ASM. //There are no outputs specified. //"a" is the input operand and "g" is the constraint on operand "a". It tells GCC that it is allowed to use any //register, memory or immediate integer operand, except for registers that are not general registers. //"edi" is the clobbered register and that we will use and modify it by writing the value of "a" to it. So in this case, //GCC does not assume that the value held in this register is valid. //MAGIC(0) is the single NOP instruction. //-----------------------------------------------------------------------------------------------------------------------// //Print initial and final stats by writing values 1 and 2 to edi register #define MAGIC_INSTRUCTION(a) { \ asm volatile ("movl %0, %%edi" \ : /*no outputs*/ \ : "g" (a) \ : "edi"); \ MAGIC(0); \ } \ //Global arrays long int array[NO_DATA_ITEMS]; //Array to store intermediate results produced by the producer long int flag[NO_DATA_ITEMS] = {NO_DATA_ITEMS * 0}; //Used for synchronization between the two threads //Global variables int prod = 0, cons = 0; void *producer(void *); void consumer(); void write_to_array(int); int read_from_array(int); //Main is the second thread int main(void) { long int i; pthread_t threadID1; void *exit_status; int delay_CONSUMER, delay_PRODUCER; long int data_items; delay_PRODUCER = COMPUTATIONAL_DELAY_PRODUCER; delay_CONSUMER = COMPUTATIONAL_DELAY_CONSUMER; data_items = NO_DATA_ITEMS; printf("----------------------NO QUEUE measurement X------------------------------------\n"); printf("delay_PRODUCER=%d, delay_CONSUMER=%d, no. of data items= %ld\n", delay_PRODUCER, delay_CONSUMER, data_items); pthread_create(&threadID1, NULL, producer, NULL); //create producer thread 43 consumer(); //Function call for consumer pthread_join(threadID1, &exit_status); printf("--------------------------------------------------------------------------------\n"); return 0; } //Function for producer void *producer(void *arg) { long int j; int r; prod = 1; while (cons == 0) {} //Stall until the other thread starts executing MAGIC_INSTRUCTION(1); for (j=0; j < NO_DATA_ITEMS; j++) { for(r=0; r < COMPUTATIONAL_DELAY_PRODUCER; r++) {} //Introduce computational delay by inserting CPU cycles write_to_array(j); } MAGIC_INSTRUCTION(2); } //Function for consumer void consumer() { long int i; int p; cons = 1; while (prod == 0) {} //Stall until the other thread starts executing MAGIC_INSTRUCTION(1); for(i=0; i < NO_DATA_ITEMS; i++) { do { if (flag[i] == 1) { read_from_array(i); break; } } while(1); for(p=0; p < COMPUTATIONAL_DELAY_CONSUMER; p++) {} //Introduce computational delay by inserting CPU cycles } MAGIC_INSTRUCTION(2); } 44 void write_to_array(int w_data) { array[w_data] = w_data; //write data to array flag[w_data] = 1; //Set corresponding flag to 1 } int read_from_array(int r_index) { int r_data; r_data = array[r_index]; //read data from array return (r_data); } BASH script for withqueue.c (withqueue_script.txt) The BASH script is used for automating the simulation runs when queue is used for interprocessor communication. #!/bin/bash mkdir temp cp "/host/simicsworkfolder/magic_instruction.h" "/temp/" cp "/host/simicsworkfolder/withqueue.c" "/temp/" cd /temp dos2unix * setterm -powersave off -blank 0 rm -f /etc/cron.daily/*.* rm -f /etc/cron.daily/* rm -f /etc/cron.monthly/* rm -f /etc/cron.weekly/*.* rm -f /etc/cron.weekly/* gcc -lpthread withqueue.c ./a.out BASH script for witharray.c (witharray_script.txt) The BASH script is used for automating the simulation runs when array is used. #!/bin/bash mkdir temp cp "/host/simicsworkfolder/magic_instruction.h" "/temp/" cp "/host/simicsworkfolder/witharray.c" "/temp/" cd /temp dos2unix * setterm -powersave off -blank 0 rm -f /etc/cron.daily/*.* rm -f /etc/cron.daily/* 45 rm -f /etc/cron.monthly/* rm -f /etc/cron.weekly/*.* rm -f /etc/cron.weekly/* gcc -lpthread witharray.c ./a.out 46 APPENDIX C. Simulation data Table 12 Simulation data with cache and memory latencies Part I Measurements in CPU cycles for the entire run Sim. run no. Delay 1 0 100,000 1000 Y 2 0 1,000,000 1000 3 25 100,000 4 50 5 No. of data items Queue size Caches included Memory included Ac Bc Ac - Bc Xc Y 67,899,725 17,146,242 50,753,483 98,458,700 Y Y 673,758,770 171,142,026 502,616,744 992,042,457 1000 Y Y 300,854,920 247,964,040 52,890,880 327,078,011 100,000 1000 Y Y 529,072,813 480,841,354 48,231,459 557,808,436 50 100,000 10,000 Y Y 528,490,824 477,322,688 51,168,136 557,808,436 6 50 500,000 1000 Y Y 2,638,793,116 2,401,134,267 237,658,849 2,817,437,526 7 50 1,000,000 1000 Y Y 5,282,622,343 4,767,108,493 515,513,850 5,603,138,554 8 200 100,000 1000 Y Y 1,914,129,845 1,869,587,274 44,542,571 1,948,326,967 9 500 100,000 1000 Y Y 4,691,356,632 4,618,560,411 72,796,221 4,712,407,468 47 Table 13 Simulation data with cache and memory latencies Part II Measurements in CPU cycles per data item Sim. run no. Speedup = (xc / u) ac bc ac - bc xc yc xc - yc t = ac – bc -26 1 678.99725 171.46242 507.53483 984.587 208.63032 775.95668 481.53483 197.46242 4.9861994 2 673.75877 171.142026 502.616744 992.042457 172.792609 819.249848 476.616744 197.14203 5.03212063 3 3008.5492 2479.6404 528.9088 3270.78011 2484.17966 786.60045 502.9088 2505.6404 1.30536693 4 5290.72813 4808.41354 482.31459 5578.08436 4812.4994 765.58496 456.31459 4834.4135 1.15382855 5 5284.90824 4773.22688 511.68136 5578.08436 4812.4994 765.58496 485.68136 4799.2269 1.16228811 6 5277.586232 4802.26853 475.317698 5634.87505 4810.35179 824.523264 449.317698 4828.2685 1.16705917 7 5282.622343 4767.10849 515.51385 5603.13855 4804.17757 798.960989 489.51385 4793.1085 1.1689989 8 19141.29845 18695.8727 445.42571 19483.2697 18606.1998 877.0699 419.42571 18721.873 1.04066884 9 46913.56632 46185.6041 727.96221 47124.0747 46552.1881 571.88661 701.96221 46211.604 1.01974549 u = ac – t Table 14 Simulation data without caches and with 0 memory latency Sim. run no. Delay No. of data items Queue size Caches included Memory included Measurements in CPU cycles for the entire run Measurements in CPU cycles per data item Anc Bnc Anc - Bnc anc bnc anc - bnc 10 0 100,000 1000 N N 3,508,308 902,576 2,605,732 35.08308 9.02576 26.05732 11 0 1,000,000 1000 N N 35,125,431 9,026,061 26,099,370 35.125431 9.026061 26.09937 12 50 100,000 1000 N N 28,609,507 25,976,506 2,633,001 286.09507 259.76506 26.33001 13 50 1,000,000 1000 N N 286,150,185 259,850,237 26,299,948 286.150185 259.850237 26.299948 Assumed as 26 cycles for all calculations. 48 49 Bibliography [1] S. Leibson and J. Kim, “Configurable Processors: A new Era in Chip Design”, Tensilica, July 2005, IEEE Computer Society, pp. 51–59. [2] G. Martin, “Multi-processor SoC-based Design Methodologies Using Configurable and Extensible Processors”, Journal of Signal Processing Systems 53, 2008, pp. 113-127. [3] Wind River Simics, URL: http://www.simics.net. [4] Wind River Simics, ‘Simics User Guide for Windows.pdf’, Simics version 3.0, Revision 1406, Date 2008-02-20, pp. 207, URL: http://www.simics.net. [5] Wind River, URL: www.windriver.com/products/simics. [6] Wind River Simics, ‘Simics User Guide for Windows.pdf’, Simics version 3.0, Revision 1406, Date 2008-02-20, pp. 209-213, URL: http://www.simics.net. [7] Performance measurements for Intel SandyBridge, URL: http://www.7cpu.com/cpu/SandyBridge.html. [8] Ibiblio, Public’s Library and digital archive, URL: http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html.