PERFORMANCE ANALYSIS OF A HARDWARE QUEUE IN SIMICS
A Project
Presented to the faculty of the Computer Engineering Program
California State University, Sacramento
Submitted in partial satisfaction of
the requirements for the degree of
MASTER OF SCIENCE
in
Computer Engineering
by
Mukta Siddharth Jain
SUMMER
2012
© 2012
Mukta Siddharth Jain
ALL RIGHTS RESERVED
ii
PERFORMANCE ANALYSIS OF A HARDWARE QUEUE IN SIMICS
A Project
by
Mukta Siddharth Jain
Approved by:
__________________________________, Committee Chair
Nikrouz Faroughi, Ph.D.
__________________________________, Second Reader
Behnam Arad, Ph.D.
____________________________
Date
iii
Student: Mukta Siddharth Jain
I certify that this student has met the requirements for format contained in the University
format manual, and that this project is suitable for shelving in the Library and credit is to
be awarded for the project.
__________________________, Graduate Coordinator
Suresh Vadhva, Ph.D.
Computer Engineering Program
iv
___________________
Date
Abstract
of
PERFORMANCE ANALYSIS OF A HARDWARE QUEUE IN SIMICS
by
Mukta Siddharth Jain
A hardware queue is one way to facilitate inter-processor communication. Such a
queue is used in a simulated dual-core system in this project. Simulation data is gathered
and analysis is done to compare the performance of a dual-core system that uses a queue
with the dual-core system that does not use a queue. The systems are simulated using
Wind River Simics, a full-system simulator capable of functionally simulating a number
of platforms based on different architectures.
In this study, a software-controlled queue is modeled using Python scripting
language and is used in the Simics simulation environment. A producer thread (written in
C) is used to communicate its computed data either through memory or via a queue to a
consumer thread, also written in C. The threads are executed in Simics simulation
environment. The performance data and its analysis are reported using different values of
computational delays and different queue sizes.
v
The results indicate that a hardware queue on an average will increase performance
as long as computational delays are small. Queue becomes more efficient as the number
of data items communicated via the queue increases.
_______________________, Committee Chair
Nikrouz Faroughi, Ph.D.
_______________________
Date
vi
ACKNOWLEDGEMENTS
I would like to thank Dr. Nikrouz Faroughi for his constant guidance and
encouragement. Thank you for taking time from your busy schedule and patiently
reviewing my work and report multiple times. I would also like to thank Dr. Behnam
Arad, who was kind enough to be my second reader.
To my husband Siddharth, who always stood by me and was my pillar of support.
I could not have completed this project without him. To my mother Medha and father
Sanjiv, who have always taught me to be patient, to persevere and to work hard in all my
endeavors. To my cousin Vrushali, who encouraged me to pursue higher education in the
United States. To my brother Kaushik, who in his own way gave me hope even though he
was far away in India. To my mother-in-law Rita, father-in-law Susheel and brother-inlaw Saurabh for their encouragement and support during the stressful times.
Lastly, special thanks to my friends Arti and Priyanka for all their help and
support. Thank you for being there for me.
vii
TABLE OF CONTENTS
Page
Acknowledgments...................................................................................................... vii
List of Tables ................................................................................................................ x
List of Figures ............................................................................................................. xi
List of Equations ........................................................................................................ xii
Chapter
1. INTRODUCTION ....................................................................................................1
1.1 Literature Review........................................................................................ 1
1.1.1 Hardware queues in configurable processors .............................. 1
1.1.2 JPEG Encoding with queue ......................................................... 2
1.2 Objective and scope of the project .............................................................. 3
1.3 Introduction to Simics ..................................................................................5
1.4 Simulation Environment ..............................................................................6
1.5 Project Overview .........................................................................................9
2. QUEUE SIMULATION ........................................................................................10
2.1 Software queue model................................................................................10
2.2 Target C programs .....................................................................................11
2.3 Cache hierarchy in Simics simulations ......................................................14
3. SIMULATION DATA AND ANALYSIS ............................................................17
3.1 Speed-up Calculation .................................................................................19
3.2 Simulation Data and Analysis ....................................................................22
4.
CONCLUSION.....................................................................................................27
Appendix A. Simics Simulation Steps ...................................................................... 29
Appendix B. Source Code Listing .............................................................................32
Appendix C. Simulation Data ....................................................................................46
viii
Bibliography ................................................................................................................49
ix
LIST OF TABLES
Tables
Page
1.
List of queue interface parameters .................................................................... 11
2.
Cache and memory latencies ............................................................................ 16
3.
Read and write transaction penalties................................................................. 16
4.
Code segments in memory-based and queue-based programs ......................... 18
5.
Simulation data measurements ......................................................................... 19
6.
Measurements for speed-up calculation............................................................ 21
7.
Queue overhead for no caches and 0 memory latency ..................................... 22
8.
Speed-up for D = 50, N = 100K and different queue sizes ............................... 23
9.
Speed-up for D = 0, Q = 1K and different values of N..................................... 24
10.
Speed-up for D = 50, Q = 1K and different values of N................................... 25
11.
Speed-up for N = 100K, Q = 1K and different values of D .............................. 26
12.
Simulation data with cache and memory latencies Part I ................................. 47
13.
Simulation data with cache and memory latencies Part II ................................ 48
14.
Simulation data without caches and with 0 memory latency............................ 48
x
LIST OF FIGURES
Figures
Page
1.
Multi-processor system with a queue ................................................................. 3
2.
Multi-processor system without a queue ............................................................ 4
3.
Hardware queue and its interface ports .............................................................. 4
4.
Simics simulation without a simulated hardware queue ..................................... 6
5.
Simics simulation with a simulated hardware queue .......................................... 8
6.
Objdump of a delay for-loop used to simulate a computational delay ............. 13
7.
Cache hierarchies in simulated systems ............................................................ 15
8.
Speed-up for N = 100K, D = 50 and Q = 1K and 10K ..................................... 23
9.
Speed-up for D = 0, Q = 1K queue and N = 100K and 1M .............................. 24
10.
Speed-up for D = 50, Q = 1K and N = 100K, 500K and 1M............................ 25
11.
Speed-up for N = 100K, Q = 1K and different values of D .............................. 26
xi
LIST OF EQUATIONS
Equations
Page
1.
Instructions executed by the delay for-loop ...................................................... 14
2.
Cache and memory latencies in terms of CPU cycles ...................................... 15
3.
Measurements per data item with cache and memory latencies ....................... 20
4.
Measurements per data item without caches and with 0 memory latency........ 20
5.
Q_mem overhead .............................................................................................. 20
6.
CPU cycles without Q_mem ovehead .............................................................. 20
7.
Speed-up calculation ......................................................................................... 22
xii
1
Chapter 1
INTRODUCTION
One way to optimize system performance of multi-processor systems is to
introduce hardware queues for direct processor-to-processor communication. It enables
synchronization between the processors acting as data producer and data consumer.
Accesses to caches and main memory, which tend to be comparatively slower, are thereby reduced. This project aims to model such a queue in Simics environment enabling a
one-way communication between two processors.
1.1 Literature Review
The following sections describe some relevant pieces of work done in the area of
hardware queues, i.e. queues in configurable processors and an example where queue
sizing has been done to determine the optimal queue size for a sample application.
1.1.1
Hardware queues in configurable processors
Configurable processors maybe used as building blocks for System-on-Chip
(SoC) designs [1]. Such systems have additions in the form of custom-tailored instruction
sets, custom-tailored execution units and specialized communication interface ports like
data queues for direct processor-to-processor communication. These additions contribute
to higher system performance than is achieved with conventional fixed-instruction sets.
With a queue, the processors’ execution units would exchange data directly via the
2
specialized data queues. These queues have the highest bandwidth as far as task-to-task
communication is concerned and potentially can support data rates as high as one transfer
per cycle.
1.1.2 JPEG Encoding with queue
In a research study, the task of JPEG encoding was mapped onto a five-processor
MPSoC (Multi-processor System-on-Chip) system. Two processors out of the five were
part of the testbench for that system and acted as the source and sink for the JPEG
encoding process [2]. The remaining three processors were arranged linearly and
communicated with each other via hardware queues. The source processor converted
input picture file i.e. pixel map data into stream data. It was fed to the three linearly
connected processors. The sink processor ultimately converted the output of the linearly
connected processors to the JPEG format.
In order to determine the optimal queue size, 32x32, 64x64, 128x128 and
256x256 picture sizes were initially encoded using the hardware queues up to 20K deep
in size. No significant processor stalling due to queues being full was observed due to the
large size queue. However, significant stalling was seen for 128x128 and 256x256 picture
sizes, when a 100-deep queue was used. To size the queues even more precisely, trace
information was gathered from various picture sizes and statistical analysis was done. It
indicated that the maximum fill depth for all the queues was little less than 500. Thus,
with a 500-deep queue, all the resolutions were encoded without any significant
processor stalling.
3
1.2 Objective and scope of the project
The objective of this project is to model a software controlled hardware queue for
processor-to-processor communication and compare its performance with the system that
does not use a queue. In this project, we have coded generic C application programs with
computational delays embedded before each push to a queue or write to an array
operations and after each pop from a queue or read from the array operations. Figure 1
illustrates a multi-processor system that uses a queue for inter-processor communication,
whereas Figure 2 illustrates a multi-processor system with no inter-processor
communication queue.
Hardware queue
P0 with its
L1 cache
P1 with its
L1 cache
L2 cache of
P0
L2 cache of
P1
M
Figure 1 Multi-processor system with a queue
4
P0 with its
L1 cache
P1 with its
L1 cache
L2 cache of
P0
L2 cache of
P1
M
Figure 2 Multi-processor system without a queue
A hardware queue has dedicated interface ports for queue data, queue status and
queue control and is illustrated in Figure 3. In this project, a software controlled queue
has been implemented such that the dedicated interface ports are simulated using
memory.
Processor P0 and its
L1 cache
Q push
status port
Control
port
Q_empty
Q_full
Hardware Q
push_done
pop_done
push
pop
data_in
data_out
Q data port
Figure 3 Hardware queue and its interface ports
Processor P1 and its
L1 cache
Q pop
status port
Control
port
Q data port
5
1.3 Introduction to Simics
“Wind River Simics is a fast, functionally-accurate, full system simulator. Simics
creates a high-performance virtual environment in which any digital system – from a
single board to complex, heterogeneous, multi-board, multi-processor, multi-core systems
can be defined, simulated.” [3]
Simics is an instruction-set simulator and not a processor simulator [4]. It can
simulate systems with most modern processors. Software developers can simulate the
target hardware platform and study the behavior of software applications on them [5]. It
has a special single cycle no-operation instruction called “MAGIC(n)” which can be used
to insert a breakpoint in the user program. Simics stops the simulation at the point where
the magic instruction has executed and invokes a user installed callback function written
in Python. From this callback function, various simulation performance data can be
dumped or collected for later analysis. The simulation resumes once the execution of the
callback function is complete.
Simics usually runs in ‘normal’ mode unless it is changed to ‘stall’ mode or
‘micro-architecture’ mode. The normal mode is the fastest execution mode and has been
optimized such that all the instructions including memory transactions complete in a
single cycle.
However, in the stall mode, when a Simics object receives a memory transaction,
the timing of the transaction is modified by returning a stall time in terms of CPU cycles.
Therefore, cache models can be correctly modeled in the stall mode. This project uses the
6
stall mode to correctly compare the system performances with and without the hardware
queues.
1.4 Simulation Environment
Figure 4 illustrates the simulated environment without the queue. Script A is only
used to dump simulation data.
Test
multithreaded C
program
Python
script A
Target (x86 Linux RH)
Simics environment
Host (x86 Windows 7)
Figure 4 Simics simulation without a simulated hardware queue; script A does not include the
software queue model.
The target C program creates two threads one as a producer and the other as a
consumer. The producer uses delays to simulate computations and writes a mock data
item into an array. The following code segment illustrates the pseudo-code for producer
function.
7
Producer_function: Write to an array
magic_instruction(1) //Dump CPU cycles count
for (i=0; i < data_items; i++) {
introduce_computational_delay;
write_to_array;
}
magic_instruction(2) //Dump CPU cycles count
The consumer reads these intermediate results one at a time, uses delays to
simulate computations and generates its own results. The following code segment
illustrates the pseudo-code for consumer function.
Consumer_function: Read from an array
magic_instruction(1) //Dump CPU cycles count
for (i=0; i < data_items; i++) {
read_from_array;
introduce_computational_delay;
}
magic_instruction(2) //Dump CPU cycles count
8
Figure 5 illustrates the simulated environment with an inter-processor queue
modeled in Python programming language. Along with dumping simulation data, script B
also includes the queue model.
Test
multithreaded C
program
Python script
B
Q model in
Python
Target (x86 Linux RH)
Simics environment
Host (x86 Windows 7)
Figure 5 Simics simulation with a simulated hardware queue; script B includes the software queue
model.
The target C program similarly creates two threads one as a producer and the
other as a consumer. The producer in this case pushes a mock data item after using delays
to simulate a computation and the consumer pops that mock data item and processes it
using delays to simulate a computation.
Producer_function: Push to the queue
magic_instruction(1) //Dump CPU cycles count
for (i=0; i < data_items; i++) {
introduce_computational_delay;
push_to_queue;
}
magic_instruction(2) //Dump CPU cycles count
9
Consumer_function: Pop from the queue
magic_instruction(1) //Dump CPU cycles count
for (i=0; i < data_items; i++) {
pop_from_queue;
introduce_computational_delay;
}
magic_instruction(2) //Dump CPU cycles count
Simulation results are discussed in Chapter 3.
1.5 Project Overview
Chapter 2 covers the Software Queue model; Chapter 3 reports simulation results
and covers analysis of the simulation data; Chapter 4 includes a conclusion and discusses
potential future work. Appendices A covers steps for performing Simics simulation, B
contains the source codes listing, and C lists the consolidated simulation data.
10
Chapter 2
QUEUE SIMULATION
For a one-cycle queue, it was determined that the queue must be designed using
registers; a queue using memory proved to be slow and maybe unsuitable for data
streaming.
2.1 Software Queue Model
The software queue is modeled using Python scripting language and registered as
a “hap” function attached to the magic instruction. A ‘hap’ indicates an event in Simics
such as the execution of a magic instruction. The magic instruction inserts a breakpoint
in the C code, halts the Simics simulation, and invokes its corresponding “hap” function.
The target C programs communicate with the Python module via the CPU registers. The
“hap” function in turn calls call_queue (Appendix B) which implements a software
queue.
The functionality of the software queue is implemented by writing specific values
to the “edi” CPU register with the execution of MAGIC(n) where ‘n’ can have any value
from 1 through 6. Table 1 lists these values and their corresponding operations.
11
Table 1 List of queue interface parameters
Value written to
edi register (n)
1
2
3
4
5
6
“hap” function / operation
Print CPU statistics at the start of producer and consumer tasks
Print CPU statistics at the end of producer and consumer tasks
Invoke software queue for push operation. Adds 1 CPU cycle for push
operation to the overall CPU cycles.
Invoke software queue for pop operation. Adds 1 CPU cycle for pop
operation to the overall CPU cycles.
Check if the software queue is full.
Check if the software queue is empty.
In this project, only a single push or pop operation is simulated and thus
simultaneous push and pop requests to the queue are not modeled. In a hardware queue,
push and pop requests might simultaneously happen that will result in one request to stall
while the other completes. This modeling is not possible with the software queue
modeled in Python, which runs outside Simics simulation environment. Such modeling
requires creating a Simics module in C that runs within the Simics environment.
Therefore, checking for push_done and pop_done is not implemented. However, it is
assumed that such simultaneous push and pop operations happen infrequently and thus
will not greatly affect the performance.
2.2 Target C programs
Two target C programs are written. One is designed to run on the simulated target
system when a queue is used. The other is designed to run on the simulated target system
where a queue is not used. The programs allow the user to configure parameters like
12
computational delays of the producer and the consumer and the number of data items
exchanged between the two processors.
The push and pop operations are initiated by calling their respective functions.
These functions in turn call ASM macros, which will invoke Python queue to write to or
read from the CPU registers.
Initially, computational delays were added by inserting sleep() functions with
sleep values in milliseconds and microseconds. These delays proved to be too big for the
queue to be efficiently utilized and unsuitable for data streaming. Therefore, dummy forloops were used to introduce computational delays. They add delays by inserting CPU
cycles. The dummy for-loop for count = 50 is shown below,
for (i=0; i < 50; i++) {} //Introduce computational delay by
inserting CPU cycles
The code is compiled using GCC and the objdump of a.out is illustrated in Figure
6. Refer to Appendix A for steps to obtain objdump of a.out.
13
Figure 6 Objdump of a delay for-loop used to simulate a computational delay
The instructions execute in the sequence as shown below,
1. movl $0x0, xfffffff4(%ebp)
2. cmpl $0x63, 0xfffffff4(%ebp)
3. jle 8048440
4. lea 8048448
5. incl (%eax)
6. jmp 8048438
7. nop
8. cmpl $0x63, 0xfffffff4(%ebp)
9. jle 8048440
10. lea 8048448
11. incl (%eax)
12. jmp 8048438
13. nop
….loop 48 times through instructions 2 through 7
14. cmpl $0x63, 0xfffffff4(%ebp)
15. jmp
14
Equation 1 shows the calculation of the number of instructions executed by the
dummy for-loop for 50 iterations.
Equation 1 Instructions executed by the delay for-loop
Instructions executed = 1 + [6 * (count)] + 2
= 1 + (6 * 50) + 2
= 303
Given CPI = 1 (cycles per instruction), 303 instructions are estimated to require 303
cycles execution time.
Thus in general, for count = n, CPU cycles introduced are (6*n) + 3.
2.3 Cache hierarchy in Simics simulations
Simics uses its own memory system to achieve high-speed simulation, and a
simulated system by itself does not have a cache model included by default. However,
users can model their own memory systems.
A Simics script introduces cache hierarchy in the simulated system as illustrated
in Figure 7. Each processor in the dual-core system simulated for this project has a 32KB
write-through L1 data cache, a 32KB L1 instruction cache and a 256KB L2 cache with
write-back policy. Instruction and data accesses are separated out by id-splitters and are
sent to the respective caches. The splitter allows the correctly aligned accesses to go
through and splits the incorrectly aligned ones into two accesses. The transaction staller
(trans-staller) simulates main memory latency. Refer to Appendix B for the Simics script
to add cache hierarchy to a simulated system.
15
Figure 7 Cache hierarchies in simulated systems [6]
We have assumed the target real machine to be an Intel i5-2400 (Sandy Bridge),
3.1 GHz CPU. Table 2 lists the latencies for accessing its L1 and L2 caches and memory
[7]. The latencies are then expressed in terms of CPU cycles. Equation 2 shows an
example where the equivalent latency for L1 cache in terms of CPU cycles is calculated.
Equation 2 Cache and memory latencies in terms of CPU cycles
Clock period of 3.1 GHz CPU = 1 / (3.1 GHz)
= 0.322580645 ns
For example,
L1 cache latency = 4 ns
= 4 / 0.322580645 cycles
~ 12 cycles
16
Table 2 Cache and memory latencies
Type of memory
Latency in
ns
L1 cache
L2 cache
RAM (assumed to be
64KB)
Main memory
4ns
12ns
65ns
L2 + RAM =
12 ns + 65 ns
= 77 ns
Equivalent no. of
CPU cycles
(approximate)
12
37
202
239
Table 3 lists all the cache related penalties.
Table 3 Read and write transaction penalties
Penalty
Incoming read transaction for L1 cache
Incoming write transaction for L1 cache
Incoming read transaction for L2 cache
Incoming write transaction for L2 cache
Transaction issued by L2 cache to read from L1 cache
Transaction issued by L2 cache to write to L1 cache
Transaction issued by memory to read from L2 cache
No. of CPU cycles
(approximate)
12
12
37
37
9
9
22
Transaction issued by memory to write to L2 cache
22
17
Chapter 3
SIMULATION AND DATA ANALYSIS
Test programs are executed for two simulated models: array-based and queuebased; and the performance data is reported in terms of CPU cycles. Each test program is
divided into individual contributing segments in terms of CPU cycles used. For example,
the array-based test program includes array program instruction access, instruction
execution and array access. The queue-based program includes queue program access
instructions, and instruction execution, which includes instructions to simulate queue
access buffers. Table 4 outlines these contributing segments and provides a brief
description for each. The following sections describe the details of the performance
measurements.
Refer to Figure 3 from page 4 that illustrates a hardware queue. It has buffers at
both the ends, which are used as dedicated interface ports in direct processor-to-processor
communication. Such dedicated interface ports are simulated through memory addresses
instead. Therefore, during simulation, additional overhead in the form of memory latency
is added to the total number of CPU cycles for accessing the simulated queue. This is
referred to as Q_mem in Table 4 and needs to be removed in order to simulate the
software queue as if it was a hardware queue with dedicated interface ports. The arraybased program is also divided into its contributing segments.
18
Table 4 Code segments in memory-based and queue-based programs
Measurement values
in CPU cycles
Q_prog
Q_prog_mem
CPU cycles of
queue_program
Q_data
Q_mem
Q_stall
array_prog
array_prog_mem
CPU cycles of
array_program
array_instructions
arrays_mem
array_stalls
Description
Cycles required to execute the queue-based
program (withqueue.c).
Cycles required to access instructions of
withqueue.c from memory hierarchy i.e. L1 and
L2 caches and memory.
Cycles to access data from the simulated queue.
Cycles required to access queue ports, simulated
with memory addresses and 0 cache and memory
latencies.
Cycles executed due to either producer or
consumer stalls through simulated queue
interface ports, i.e. when there is either a push
request and the queue is full or when there is a
pop request and the queue is empty.
Cycles required to execute the array-based
program (witharray.c).
Cycles required to access instructions of
witharray.c from memory hierarchy i.e. L1 and
L2 caches and memory.
Cycles for accessing the array data
Cycles required to access array elements from
cache and memory hierarchy.
Cycles executed due to consumer stalls i.e. when
the producer has not yet produced data.
19
3.1 Speed-up Calculation
Measurements A, B and X as indicated in Table 5 are expressed in terms of CPU
cycles and are measured for the entire simulation run. They are calculated by taking a
difference of the cycles from final CPU statistics and the cycles from initial CPU
statistics. They are measured in two cases as follows and are listed in Table 6 and Tables
12 and 14 in Appendix C:
I.
When L1 and L2 caches and memory are included in the simulation. These
measurements use suffix ‘c’ (with cache) after them. For example, “A” in variable
“Ac” implies the number of CPU cycles for the items checked in Table 5 and “c”
implies that non-zero latencies were used for the caches and memory (Tables 2
and 3).
II.
When L1 and L2 caches are not included in the simulation and the memory
latency set to zero. These measurements use suffix ‘nc’ (no cache) after them. For
example, “A” in variable “Anc” is the same as in I., except that no caches were
used and memory latency was set to 0.
Table 5 Simulation data measurements
Measu
rement
Q_
pro
g
Q_pro
g_me
m
Q_
dat
a
Q_
me
m
Q_
stal
l
A
√
√
√
√
√
B
√
√
X
array
_pro
g
array
_prog
_me
m
array_
instruc
tions
arrays
_mem
array
_stalls
√
√
√
√
√
20
The CPU cycles per data item are calculated from measurements A, B and X and
are denoted by lower case a, b and x respectively. The measurements taken when caches
and memory are included in the simulation are denoted by ac, bc and xc. Equation 3
shows an example,
Equation 3 Measurements per data item with cache and memory latencies
ac =
Ac
N
N signifies the number of data items. Similarly, the measurements taken without
caches and memory are denoted by anc, bnc and xnc as shown in Table 6. Equation 4
shows an example,
Equation 4 Measurements per data item without caches and with 0 memory latency
anc =
Anc
N
Measurement t indicates Q_mem overhead and is calculated using Equation 5.
Equation 5 Q_mem overhead
t = (ac – bc) – (anc – bnc)
Measurement u signifies the CPU cycles after removing t cycles from ac (Equation 6);
that is, as if it was a hardware queue with dedicated interface ports.
Equation 6 CPU cycles without Q_mem overhead
u = ac – t
21
Table 6 summarizes these measurements and their individual contributing segments.
Table 6 Measurements for speed-up calculation
Measurement
in cycles per
data item
Formula
Comment
ac
Q_prog + Q_prog_mem +
Q_data + Q_mem + Q_stall
Includes all the contributing segments
of the queue-based program
(withqueue.c) when caches and main
memory are used in the simulation.
bc
Q_prog + Q_prog_mem
Includes the segments of the queuebased program when caches and main
memory are used in the simulation,
except for those contributed to queue
accesses.
xc
array_prog +
array_prog_mem +
array_instructions +
array_mem + array_stalls
Includes all the contributing segments
of array-based program (witharray.c)
when caches and main memory are
used in the simulation.
ac – bc
Q_data + Q_mem +
Q_stalls
anc
Q_prog + Q_data + Q_stalls
Q overhead w/ cache and main memory
delays for queue implemented in
software.
Includes all the contributing segments
of queue-based program when cache
are not used and memory latency is set
to 0.
Q_prog
Includes the segments of queue-based
program when caches are not used and
memory latency is set to 0, except for
those contributed to queue accesses.
anc- bnc
Q_data + Q_stalls
Q overhead w/o caches and memory
latency set to 0 for queue implemented
in software (as if queue is accessed via
ports within 1 cycle delay).
t
(ac – bc) – (anc – bnc) =
Q_mem
Q_mem overhead.
u
ac - t = (Q_prog +
Q_prog_mem + Q_data +
Q_stalls)
Cycles per item after removing the
Q_mem overhead from measurement
ac.
bnc
22
After removing the Q overhead, speed-up is calculated by taking a ratio of
measurements xc over u as given in Equation 7.
Equation 7 Speed-up calculation
Speed-up = xc / u
The following sections report the simulation data for all the simulation runs.
Performance analysis is done and the results are discussed.
3.2 Simulation Data and Analysis
Table 7 lists the different test parameters and simulation data for these runs. D
signifies the computational delay count in terms of iterations in the delay for-loop. The Q
overhead remains more or less constant at about 26 cycles even though both D and N are
varied. Therefore, its value with no caches and no memory (anc - bnc) is considered as 26
cycles in all further calculations.
Table 7 Queue overhead for no caches and 0 memory latency
D
0
0
50
50
N
Measurements in CPU cycles
per data item
anc
100,000
35.08308
1,000,000 35.12543
100,000
286.0951
1,000,000 286.1502
anc
9.02576
9.026061
259.7651
259.8502
anc - bnc
26.05732
26.09937
26.33001
26.29995
Increasing the queue size decreases the number of cycles required for Q_stalls,
which is due to queue either being full during a push operation or empty during a pop
23
operation. This increases the speed-up slightly as indicated in Table 8 and illustrated in
Figure 8.
Table 8 Speed-up for D = 50, N = 100K and different queue sizes
D
N
Q size
Speed-up
50
50
100,000
100,000
1000
10,000
1.153829
1.162288
Speed-up
1.164
1.162
Speed-up
1.16
1.158
1.156
Speed-up
1.154
1.152
1.15
1.148
1000
10,000
Q size
Figure 8 Speed-up for N = 100K, D = 50 and Q = 1K and 10K
The number of data items N passing through the queue was increased from 100K
to 1M. In this case, as N increases, the percentage of CPU cycles used to access the queue
24
decreases as compared to the percentage of CPU cycles used to execute the program.
Whereas, in the case of the system without a queue, as N increases, more memory
accesses for accessing array elements is expected, which could result in increased cache
misses. In addition, the queue latency for push and pop operations requiring one cycle
each does not change as N increases. This contributes to an increased speed-up as shown
in Tables 9 (for D = 0) and 10 (for D = 50) and illustrated in Figures 9 and 10.
Table 9 Speed-up for D = 0, Q = 1K and different values of N
D
N
Q size
0
0
100,000
1000
1,000,000 1000
Speed-up
4.986199
5.032121
Speed-up
5.04
5.03
Speed-up
5.02
5.01
5
Speed-up
4.99
4.98
4.97
4.96
100,000
1,000,000
No. of data items (N)
Figure 9 Speed-up for D = 0, Q = 1K and N = 100K and 1M
25
Table 10 Speed-up for D = 50, Q = 1K and different values of N
D
N
Q size
50
50
50
100,000
1000
500,000
1000
1,000,000 1000
Speed-up
1.153829
1.167059
1.168999
Speed-up
1.175
1.17
Speed-up
1.165
1.16
Speed-up
1.155
1.15
1.145
100,000
500,000
No. of data items (N)
1,000,000
Figure 10 Speed-up for D = 50, Q = 1K and N = 100K, 500K and 1M
Table 11 lists the speed-up values for different delay values. The results are
illustrated in Figure 11. As the computational delays increase, the proportion of CPU
cycles used for program execution vs. the cycles required for accessing the queue or array
from memory changes. The results indicate that the queue is efficient as long as
26
computational delays are small. Larger computational delays result in decreased speedup.
Table 11 Speed-up N = 100K, Q = 1K and different values of D
D
N
Q size
Speed-up
0
25
50
200
500
100,000
100,000
100,000
100,000
100,000
1000
1000
1000
1000
1000
4.986199
1.305367
1.153829
1.040669
1.019745
Speed-up
6
5
Speed-up
4
3
Speed-up
2
1
0
0
25
50
200
Computational delay (D)
Figure 11 Speed-up for N = 100K, Q = 1K and different values of D
500
27
Chapter 4
CONCLUSION
Simics simulation environment was used to evaluate and compare performances
of a system that uses inter-processor queue with that of a system that does not. A
software-controlled hardware queue was modeled in Python scripting language. Unlike a
real hardware queue, dedicated interface ports were not implemented for queue accesses.
Instead, memory addresses were used to simulate the queue accesses, and it thus added Q
overhead to the simulations. The overhead was removed by dividing the queue-based and
array-based programs into their individual contributing segments and separating out the
CPU cycles that they contributed. A number of Simics simulations were performed to
study the effect on performance by varying the queue size, number of data items
exchanged between processors and computational delays of producer and consumer
tasks.
It was seen that unlike the increased memory latency due to increased cache
misses, the queue latency for push and pop operations remained the same although the
number of data items increased. Additionally, larger computational delays implied that
the CPU cycles used to process data would be much greater than the cycles required to
push and pop data from the queue. In conclusion, a queue should be used when tasks
require small computational delays and many data items to be processed. This may be
useful in applications requiring real-time data streaming.
28
Simulations using different computational delays for the producer and consumer
resulted, as expected, in an increase in the number of stalls either on the producer side or
the consumer side, whichever had shorter computational delays. Further study is needed
to evaluate the performance of queue in such scenarios. In addition, since the softwarecontrolled queue did not support simultaneous push and pop requests, a C module
running within the Simics simulation environment can be designed to overcome this
shortcoming.
29
APPENDIX A. Simics Simulation Steps
Two BASH scripts were used for automating the simulations runs in two cases: the first
where a queue is used and the second where the queue is not used for inter-processor
communication. Refer to Appendix B for these scripts.
Below are the steps to perform Simics simulation using the simulated queue.
1. Use your Saclink username and password to connect to VPN.
2. Copy files add_cache_hierarchy.simics, pythonscript.py to location C:\simics3.x.x\workspace.
3. Copy files withqueue.c and magic_instruction.h to location
C:\simicsworkfolder on the host disk.
4. Open file located at C:\simics-3.x.x\workspace\targets\x86-440bx\enterprisecommon.simics. For a 2-processor target system, enter $num_cpus = 2.
5. Launch Simics and run it in stall mode by clicking the ‘View’ tab in the Simics
window. Then select ‘Preferences…’. Change the Execution mode by selecting
‘Stall’ from the drop-down menu. Load the enterprise-common.simics
configuration file into Simics by selecting ‘New session’ from the File menu and
selecting the specified file. Enterprise has Red Hat Linux 7.3 installed. The base
configuration has a single 20 MHz Pentium 4 processor.
6. When OS completes booting i.e. when the console displays the login prompt on
the Simics console window, set the cpu-switch-time to 1 by executing below
command in the Simics Command Window. Its default value is 1000 to minimize
simulation time. By setting it to 1, the simulator overhead increases. Nevertheless,
more importantly, a perfectly synchronized simulation is set up and detailed study
of the caches is made possible.
simics> pselect cpu0
simics> cpu-switch-time 1
Repeat for cpu1.
The current value of cpu-switch-time can be checked with the below command in
Simics Command Window.
Simics> @conf.sim.cpu_switch_time
30
7. Load the add_cache_hierarchy.simics file by selecting ‘Run Simics script file’
from the File menu.
8. Login to the target machine as root.
9. Go to the root folder by issuing,
[root@enterprise root]# cd ..
10. Mount the host disk by entering below command,
[root@enterprise /]# mount /host
11. Copy withqueue_script.txt to / by executing,
[root@enterprise /]# cp /host/simicsworkfolder/withqueue_script.txt /
12. Convert the script file to UNIX format by executing,
[root@enterprise /]# dos2unix withqueue.txt
13. Make the script file executable by,
[root@enterprise /]# chmod +777 withqueue_script.txt
14. Execute the pythonscript.py script by select ‘Run Python script file’ from the
file menu.
15. Execute withqueue.txt script file by,
[root@enterprise /]# ./withqueue.txt
16. Get objdump of a.out by executing below command in the console window,
[root@enterprise /]# objdump –d a.out > dump.txt
[root@enterprise /]# vi dump.txt
When the program runs, Python script will dump the CPU statistics on the Simics
command window.
Below are the steps to perform Simics simulation without using the simulated queue.
1. Repeat steps 1 and 2 as above.
2. Copy files witharray.c and magic_instruction.h to location C:\simicsworkfolder
on the host disk.
3. Follow steps 4 through 10 as above. Step 4 need not be repeated if a simulation
has been run earlier.
31
4. Copy witharray_script.txt to / by executing,
[root@enterprise /]# cp /host/simicsworkfolder/witharray_script.txt /
5. Convert the script file to UNIX format by executing,
[root@enterprise /]# dos2unix witharray.txt
6. Make the script file executable by,
[root@enterprise /]# chmod +777 witharray_script.txt
7. Repeat step 14 as above.
8. Execute witharray.txt script file by,
[root@enterprise /]# ./witharray.txt
9. Repeat step 16 as above.
32
APPENDIX B. Source Code Listing
Add cache hierarchy to Simics Simulation (add_cache_hierarchy.simics) [6]
## Transaction staller for memory
@staller = pre_conf_object("staller", "trans-staller")
##Stall instructions 239 cycles to simulate memory latency
@staller.stall_time = 239 ##Latency of (L2 + RAM) in CPU cycles
############### g-cache configuration for cpu0 ################
## Create L2 cache (l2c0) for cpu0: 256KB with write-back
@l2c0 = pre_conf_object("l2c0", "g-cache")
@l2c0.cpus = conf.cpu0
@l2c0.config_line_number = 4096
@l2c0.config_line_size = 64 ##64 blocks. Implies 4096 lines
@l2c0.config_assoc = 8
@l2c0.config_virtual_index = 0
@l2c0.config_virtual_tag = 0
@l2c0.config_write_back = 1
@l2c0.config_write_allocate = 1
@l2c0.config_replacement_policy = 'lru'
@l2c0.penalty_read = 37 ##Stall penalty (in cycles) for any incoming read transaction
@l2c0.penalty_write = 37 ##Stall penalty (in cycles) for any incoming write transaction
@l2c0.penalty_read_next = 22 ##Stall penalty (in cycles) for a read transaction issued by the cache to the
next level cache. Rounding error, value should be 21.
@l2c0.penalty_write_next = 22 ##Stall penalty (in cycles) for a write transactions issued by the cache to
the next level cache. Rounding error, value should be 21.
@l2c0.timing_model = staller
## L1 - Instruction Cache for cpu0: 32KB
@ic0 = pre_conf_object("ic0", "g-cache")
@ic0.cpus = conf.cpu0
@ic0.config_line_number = 512
@ic0.config_line_size = 64 ##64 blocks. Implies 512 lines
@ic0.config_assoc = 8
@ic0.config_virtual_index = 0
@ic0.config_virtual_tag = 0
@ic0.config_write_back = 0
@ic0.config_write_allocate = 0
@ic0.config_replacement_policy = 'lru'
@ic0.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read transaction
@ic0.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write transaction
@ic0.penalty_read_next = 9 ##Stall penalty (in cycles) for a read transaction issued by the cache to the
next level cache. Rounding error, value should be 7.
@ic0.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions issued by the cache to the
next level cache. Rounding error, value should be 7.
@ic0.timing_model = l2c0
## L1 - Data Cache for cpu0: 32KB Write-through
@dc0 = pre_conf_object("dc0", "g-cache")
@dc0.cpus = conf.cpu0
33
@dc0.config_line_number = 512
@dc0.config_line_size = 64 ##64 blocks. Implies 512 lines
@dc0.config_assoc = 8
@dc0.config_virtual_index = 0
@dc0.config_virtual_tag = 0
@dc0.config_write_back = 0
@dc0.config_write_allocate = 0
@dc0.config_replacement_policy = 'lru'
@dc0.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read transaction
@dc0.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write transaction
@dc0.penalty_read_next = 9 ##Stall penalty (in cycles) for a read transaction issued by the cache to the
next level cache. Rounding error, value should be 7.
@dc0.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions issued by the cache to the
next level cache. Rounding error, value should be 7.
@dc0.timing_model = l2c0
## Transaction splitter for L1 instruction cache for cpu0
@ts_i0 = pre_conf_object("ts_i0", "trans-splitter")
@ts_i0.cache = ic0
@ts_i0.timing_model = ic0
@ts_i0.next_cache_line_size = 64
## Transaction splitter for L1 data cache for cpu0
@ts_d0 = pre_conf_object("ts_d0", "trans-splitter")
@ts_d0.cache = dc0
@ts_d0.timing_model = dc0
@ts_d0.next_cache_line_size = 64
## ID splitter for L1 cache for cpu0
@id0 = pre_conf_object("id0", "id-splitter")
@id0.ibranch = ts_i0
@id0.dbranch = ts_d0
############### g-cache configuration for cpu1 ################
## Create L2 cache (l2c1) for cpu1: 256KB with write-back
@l2c1 = pre_conf_object("l2c1", "g-cache")
@l2c1.cpus = conf.cpu1
@l2c1.config_line_number = 4096
@l2c1.config_line_size = 64 ##64 blocks. Implies 4096 lines
@l2c1.config_assoc = 8
@l2c1.config_virtual_index = 0
@l2c1.config_virtual_tag = 0
@l2c1.config_write_back = 1
@l2c1.config_write_allocate = 1
@l2c1.config_replacement_policy = 'lru'
@l2c1.penalty_read = 37 ##Stall penalty (in cycles) for any incoming read transaction
@l2c1.penalty_write = 37 ##Stall penalty (in cycles) for any incoming write transaction
@l2c1.penalty_read_next = 22 ##Stall penalty (in cycles) for a read transaction issued by the cache to the
next level cache. Rounding error, value should be 21.
@l2c1.penalty_write_next = 22 ##Stall penalty (in cycles) for a write transactions issued by the cache to
the next level cache. Rounding error, value should be 21.
@l2c1.timing_model = staller
34
## L1 - Instruction Cache for cpu1: 32KB
@ic1 = pre_conf_object("ic1", "g-cache")
@ic1.cpus = conf.cpu1
@ic1.config_line_number = 512
@ic1.config_line_size = 64 ##64 blocks. Implies 512 lines
@ic1.config_assoc = 8
@ic1.config_virtual_index = 0
@ic1.config_virtual_tag = 0
@ic1.config_write_back = 0
@ic1.config_write_allocate = 0
@ic1.config_replacement_policy = 'lru'
@ic1.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read transaction
@ic1.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write transaction
@ic1.penalty_read_next = 9 ##Stall penalty (in cycles) for a read transaction issued by the cache to the
next level cache. Rounding error, value should be 7.
@ic1.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions issued by the cache to the
next level cache. Rounding error, value should be 7.
@ic1.timing_model = l2c1
## L1 - Data Cache for cpu1: 32KB Write-through
@dc1 = pre_conf_object("dc1", "g-cache")
@dc1.cpus = conf.cpu1
@dc1.config_line_number = 512
@dc1.config_line_size = 64 ##64 blocks. Implies 512 lines
@dc1.config_assoc = 8
@dc1.config_virtual_index = 0
@dc1.config_virtual_tag = 0
@dc1.config_write_back = 0
@dc1.config_write_allocate = 0
@dc1.config_replacement_policy = 'lru'
@dc1.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read transaction
@dc1.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write transaction
@dc1.penalty_read_next = 9 ##Stall penalty (in cycles) for a read transaction issued by the cache to the
next level cache. Rounding error, value should be 7.
@dc1.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions issued by the cache to the
next level cache. Rounding error, value should be 7.
@dc1.timing_model = l2c1
## Transaction splitter for L1 instruction cache for cpu1
@ts_i1 = pre_conf_object("ts_i1", "trans-splitter")
@ts_i1.cache = ic1
@ts_i1.timing_model = ic1
@ts_i1.next_cache_line_size = 64
## Transaction splitter for L1 data cache for cpu1
@ts_d1 = pre_conf_object("ts_d1", "trans-splitter")
@ts_d1.cache = dc1
@ts_d1.timing_model = dc1
@ts_d1.next_cache_line_size = 64
## ID splitter for L1 cache for cpu1
@id1 = pre_conf_object("id1", "id-splitter")
@id1.ibranch = ts_i1
35
@id1.dbranch = ts_d1
##Add Configuration
@SIM_add_configuration([staller, l2c0, ic0, dc0, ts_i0, ts_d0, id0 , l2c1, ic1, dc1, ts_i1, ts_d1, id1], None);
## Timing model for cpu0_space and cpu1_space
@conf.cpu0_mem.timing_model = conf.id0
@conf.cpu1_mem.timing_model = conf.id1
Python script and SW Queue (pscript.py)
#===========================================================================
========================================
#edi = 1 for printing initial statistics
# = 2 for printing final statistics
# = 3 for pushing data into software queue
# = 4 for popping data from software queue
# = 5 for checking the value of q_full flag
# = 6 for checking the value of q_empty flag
#esi = data to be pushed into software queue from C program
#ebx = data popped from software queue and passed to C program
#ecx = 1 if software queue is full, else 0
#edx = 1 if software queue is empty, else 0
#
#Semaphores to access queue are not implemented as MAGIC instruction executes.
#So it acts as a semaphore itself. It stops simulation and either push or pop
#will execute, and not both.
#===========================================================================
========================================
def call_queue(cpu):
global QUEUE_DEPTH
QUEUE_DEPTH = 1024
global NO_DATA_ITEMS
NO_DATA_ITEMS = 100000
#Push data into the queue
if cpu.edi == 3: ##value 3 indicates a push operation
name = SIM_get_attribute(cpu, "name")
data_in = cpu.esi ##Get logical address of push data
global len_queue
if len(queue) < QUEUE_DEPTH:
queue.append(data_in) ##Since this is not the first item, append all other items to the queue
current = SIM_get_attribute(cpu, "cycles") ##Get current number of cycles executed
SIM_set_attribute(cpu, "cycles", current+1) ##Add 1 to the current number of cycles for
push operation
len_queue = len_queue + len(queue)
else:
print "Queue is full!!!"
#Pop data from the queue
elif cpu.edi == 4: ##value 4 indicates a pop operation
36
name = SIM_get_attribute(cpu, "name")
if len(queue) != 0:
data_out = queue.pop(0) ##Always pop from location 0 of the queue
log_addr_data_out = cpu.ebx ##Get logical address of pop data
phy_addr_data_out = SIM_logical_to_physical(cpu, 1, log_addr_data_out) ##Get physical
address of pop data
SIM_write_phys_memory(cpu, phy_addr_data_out, data_out, 4) ##Write 4 bytes to
phy_addr_data_out
current = SIM_get_attribute(cpu, "cycles") ##Get current number of cycles executed
SIM_set_attribute(cpu, "cycles", current+1) ##Add 1 to the current number of cycles for pop
operation
else:
print "Queue is empty!!!
#Check status of qfull flag and pass its status to the target C program
elif cpu.edi == 5:
log_addr_qfull = cpu.ecx
phy_addr_qfull = SIM_logical_to_physical(cpu, 1, log_addr_qfull)
if len(queue) < QUEUE_DEPTH:
SIM_write_phys_memory(cpu, phy_addr_qfull, 0, 4) ##Value 0 written to ecx register
indicates queue is not full
else:
SIM_write_phys_memory(cpu, phy_addr_qfull, 1, 4) ##Value 1 written to ecx register
indicates queue is full
#Check status of qempty flag and pass its status to the target C program
elif cpu.edi == 6:
log_addr_qempty = cpu.edx
phy_addr_qempty = SIM_logical_to_physical(cpu, 1, log_addr_qempty)
if len(queue) != 0:
SIM_write_phys_memory(cpu, phy_addr_qempty, 0, 4) ##Value 0 written to edx register
indicates queue is not empty
else:
SIM_write_phys_memory(cpu, phy_addr_qempty, 1, 4) ##Value 1 written to edx register
indicates queue is empty
else:
print "Illegal operation"
#Hap callback function
def hap_callback(user_arg, cpu, arg):
#Print initial statistics
if cpu.edi == 1:
name = SIM_get_attribute(cpu, "name")
print "Callback 1 for initial stats", name
eval_cli_line("%s" % name + ".ptime")
#Print final statistics
elif cpu.edi == 2:
name = SIM_get_attribute(cpu, "name")
print "Callback 2 for final stats", name
eval_cli_line("%s" % name + ".ptime")
print 'QUEUE_DEPTH=', QUEUE_DEPTH
37
avg_queue_len = 0
print 'Total queue length = ', len_queue
avg_queue_len = len_queue / NO_DATA_ITEMS
print 'Avg queue length =', avg_queue_len
#Push data into the queue OR Pop data from the queue OR check if queue is full OR check if queue is
empty
elif (cpu.edi == 3 or cpu.edi == 4 or cpu.edi == 5 or cpu.edi ==6):
call_queue(cpu)
#Unknown callback
else:
print "Unknown callback"
SIM_break_simulation("snore")
#MAIN function
#------------from collections import deque
queue = []
len_queue = 0
SIM_hap_add_callback("Core_Magic_Instruction", hap_callback, 100) #100 is user_arg, ignored here
Target C program that uses the software queue (withqueue.c)
#include<stdio.h>
#include<pthread.h>
#include "magic_instruction.h" /*find it in local directory*/
#define COMPUTATIONAL_DELAY_PRODUCER 50 //producer for-loop count
#define COMPUTATIONAL_DELAY_CONSUMER 50 //consumer for-loop count
#define NO_DATA_ITEMS 100000 //Number of data items exchanged by processors
//--------------------------------------------Explanation of ASM code [8] ---------------------------------------------//
//asm volatile ("movl %0, %%edi" \
//
: /*no outputs*/ \
//
: "g" (a) \
//
: "edi"); \
//
MAGIC(0);
//As seen in the above ASM code, the value of the output operand which is referred by %0 is to be moved
to the
//edi register.
//Operands have a single'%', whereas registers have '%%'. This helps GCC in distinguishing between
operands and registers.
//The keyword "volatile" is added to the ASM if the memory affected is not listed in the inputs and outputs
of the ASM.
//There are no outputs specified.
//"a" is the input operand and "g" is the constraint on operand "a". It tells GCC that it is allowed to use any
//register, memory or immediate integer operand, except for registers that are not general registers.
//"edi" is the clobbered register and we will use and modify it by writing the value of "a" to it. So in this
case,
//GCC does not assume that the value held in this register is valid.
38
//MAGIC(0) is the single NOP instruction.
//----------------------------------------------------------------------------------------------------------------------- //
//---------------------------------------------------Register values-----------------------------------------------------//
//edi = 1 for printing initial stats
// = 2 for printing final stats
// = 3 for pushing data into Python queue
// = 4 for popping data from Python queue
// = 5 for checking the value of q_full flag
// = 6 for checking the value of q_empty flag
// = 99 for calling dummy macro
//esi = data to be pushed into Python queue from C program
//ebx = data popped from Python queue and passed to C program
//ecx = 1 if Python queue is full, else 0
//edx = 1 if Python queue is empty, else 0
//----------------------------------------------------------------------------------------------------------------------- //
//Print initial and final stats by writing values 1 and 2 to edi register
#define MAGIC_INSTRUCTION(a)
{
\
asm volatile ("movl %0, %%edi" \
: /*no outputs*/ \
: "g" (a) \
: "edi"); \
MAGIC(0);
\
}
\
//For push operation, write value 3 to edi register and the data value to be pushed into esi register
#define QPUSH(f, k)
\
{
\
asm volatile ("movl %0, %%edi" \
: /*no outputs*/ \
: "g" (f) \
: "edi"); \
asm volatile ("movl %0, %%esi" \
: /*no outputs*/ \
: "g" (k) \
: "esi"); \
MAGIC(0);
\
}
//For pop operation, write value 4 to edi register and read the value written by Python script/queue from ebx
register
#define QPOP(f, k)
\
{
\
asm volatile ("movl %0, %%edi" \
: /*no outputs*/ \
: "g" (f) \
: "edi"); \
asm volatile ("movl %0, %%ebx" \
: /*no outputs*/ \
: "g" (k) \
: "ebx"); \
MAGIC(0);
\
39
}
//If queue is full, ecx=1, else ecx=0
#define GET_QFULL_STATUS(f, k)
\
{
\
asm volatile ("movl %0, %%edi" \
: /*no outputs*/ \
: "g" (f) \
: "edi"); \
asm volatile ("movl %0, %%ecx" \
: /*no outputs*/ \
: "g" (k) \
: "ecx"); \
MAGIC(0);
\
}
//If queue is empty, edx=1, else edx=0
#define GET_QEMPTY_STATUS(f, k)
\
{
\
asm volatile ("movl %0, %%edi" \
: /*no outputs*/ \
: "g" (f) \
: "edi"); \
asm volatile ("movl %0, %%edx" \
: /*no outputs*/ \
: "g" (k) \
: "edx"); \
MAGIC(0);
\
}
void consumer();
void *producer(void *);
void queue_push(long int);
int queue_pop();
int prod = 0, cons = 0;
//Main is the second thread
int main(void)
{
pthread_t threadID1;
void *exit_status;
int delay_PRODUCER, delay_CONSUMER;
long int data_items;
delay_PRODUCER = COMPUTATIONAL_DELAY_PRODUCER;
delay_CONSUMER = COMPUTATIONAL_DELAY_CONSUMER;
data_items = NO_DATA_ITEMS;
printf("----------------------WITH QUEUE measurement A----------------------------------\n");
printf("delay_PRODUCER = %d, delay_CONSUMER = %d, no. of data items= %ld\n",
delay_PRODUCER, delay_CONSUMER, data_items);
40
pthread_create(&threadID1, NULL, producer, NULL); //create producer thread
consumer(); //Function call for consumer
pthread_join(threadID1, &exit_status);
printf("--------------------------------------------------------------------------------\n");
return 0;
}
//Function for producer (application level)
void *producer(void *arg)
{
long int i;
int r;
prod = 1;
while (cons == 0) {} //Stall until the other thread starts executing
MAGIC_INSTRUCTION(1);
for (i=0; i < NO_DATA_ITEMS; i++)
{
for (r=0; r < COMPUTATIONAL_DELAY_PRODUCER; r++) {} //Introduce
computational delay by inserting CPU cycles
queue_push(i);
}
MAGIC_INSTRUCTION(2);
}
//Function for consumer (application level)
void consumer()
{
long int j;
int p;
cons = 1;
while (prod == 0) {} //Stall until the other thread starts executing
MAGIC_INSTRUCTION(1);
for (j=0; j < NO_DATA_ITEMS; j++)
{
queue_pop();
for (p=0; p < COMPUTATIONAL_DELAY_CONSUMER; p++) {} //Introduce
computational delay by inserting CPU cycles
}
MAGIC_INSTRUCTION(2);
}
//Library function for push operation
void queue_push(long int data_for_push)
{
int qfull;
int *qfull_ptr = &qfull;
41
do {
GET_QFULL_STATUS(5, qfull_ptr); //read value from ecx register, ecx=1 if queue is full, else
ecx=0
if (qfull == 0) //if Q not full
{
QPUSH(3, data_for_push); //3 means push
break;
}
} while(1);
}
//Library function for pop operation
int queue_pop()
{
int qempty, data_from_queue;
int *data_from_queue_ptr = &data_from_queue;
int *qempty_ptr = &qempty;
do {
GET_QEMPTY_STATUS(6, qempty_ptr); //read value from edx register, edx=1 if queue is
empty, else edx=0
if (qempty == 0) //if Q not empty
{
QPOP(4, data_from_queue_ptr); //4 means pop
break;
}
} while(1);
return (data_from_queue);
}
Target C program that uses an array for inter-processor communication
(witharray.c)
#include<stdio.h>
#include<pthread.h>
#include "magic_instruction.h" /*find it in local directory*/
#define COMPUTATIONAL_DELAY_PRODUCER 50 //producer for-loop count
#define COMPUTATIONAL_DELAY_CONSUMER 50 //consumer for-loop count
#define NO_DATA_ITEMS 100000 //Number of data items exchanged by processors
//--------------------------------------------Explanation of ASM code [8] ---------------------------------------------//
//asm volatile ("movl %0, %%edi" \
//
: /*no outputs*/ \
//
: "g" (a) \
//
: "edi"); \
//
MAGIC(0);
//As seen in the above ASM code, the value of the output operand which is referred by %0 is to be moved
to the
//edi register.
42
//Operands have a single'%', whereas registers have '%%'. This helps GCC in distinguishing between
operands and registers.
//The keyword "volatile" is added to the ASM if the memory affected is not listed in the inputs and outputs
of the ASM.
//There are no outputs specified.
//"a" is the input operand and "g" is the constraint on operand "a". It tells GCC that it is allowed to use any
//register, memory or immediate integer operand, except for registers that are not general registers.
//"edi" is the clobbered register and that we will use and modify it by writing the value of "a" to it. So in
this case,
//GCC does not assume that the value held in this register is valid.
//MAGIC(0) is the single NOP instruction.
//-----------------------------------------------------------------------------------------------------------------------//
//Print initial and final stats by writing values 1 and 2 to edi register
#define MAGIC_INSTRUCTION(a)
{
\
asm volatile ("movl %0, %%edi" \
: /*no outputs*/ \
: "g" (a) \
: "edi"); \
MAGIC(0);
\
}
\
//Global arrays
long int array[NO_DATA_ITEMS]; //Array to store intermediate results produced by the producer
long int flag[NO_DATA_ITEMS] = {NO_DATA_ITEMS * 0}; //Used for synchronization between the
two threads
//Global variables
int prod = 0, cons = 0;
void *producer(void *);
void consumer();
void write_to_array(int);
int read_from_array(int);
//Main is the second thread
int main(void)
{
long int i;
pthread_t threadID1;
void *exit_status;
int delay_CONSUMER, delay_PRODUCER;
long int data_items;
delay_PRODUCER = COMPUTATIONAL_DELAY_PRODUCER;
delay_CONSUMER = COMPUTATIONAL_DELAY_CONSUMER;
data_items = NO_DATA_ITEMS;
printf("----------------------NO QUEUE measurement X------------------------------------\n");
printf("delay_PRODUCER=%d, delay_CONSUMER=%d, no. of data items= %ld\n",
delay_PRODUCER, delay_CONSUMER, data_items);
pthread_create(&threadID1, NULL, producer, NULL); //create producer thread
43
consumer(); //Function call for consumer
pthread_join(threadID1, &exit_status);
printf("--------------------------------------------------------------------------------\n");
return 0;
}
//Function for producer
void *producer(void *arg)
{
long int j;
int r;
prod = 1;
while (cons == 0) {} //Stall until the other thread starts executing
MAGIC_INSTRUCTION(1);
for (j=0; j < NO_DATA_ITEMS; j++)
{
for(r=0; r < COMPUTATIONAL_DELAY_PRODUCER; r++) {} //Introduce
computational delay by inserting CPU cycles
write_to_array(j);
}
MAGIC_INSTRUCTION(2);
}
//Function for consumer
void consumer()
{
long int i;
int p;
cons = 1;
while (prod == 0) {} //Stall until the other thread starts executing
MAGIC_INSTRUCTION(1);
for(i=0; i < NO_DATA_ITEMS; i++)
{
do {
if (flag[i] == 1)
{
read_from_array(i);
break;
}
} while(1);
for(p=0; p < COMPUTATIONAL_DELAY_CONSUMER; p++) {} //Introduce
computational delay by inserting CPU cycles
}
MAGIC_INSTRUCTION(2);
}
44
void write_to_array(int w_data)
{
array[w_data] = w_data; //write data to array
flag[w_data] = 1; //Set corresponding flag to 1
}
int read_from_array(int r_index)
{
int r_data;
r_data = array[r_index]; //read data from array
return (r_data);
}
BASH script for withqueue.c (withqueue_script.txt)
The BASH script is used for automating the simulation runs when queue is used for interprocessor communication.
#!/bin/bash
mkdir temp
cp "/host/simicsworkfolder/magic_instruction.h" "/temp/"
cp "/host/simicsworkfolder/withqueue.c" "/temp/"
cd /temp
dos2unix *
setterm -powersave off -blank 0
rm -f /etc/cron.daily/*.*
rm -f /etc/cron.daily/*
rm -f /etc/cron.monthly/*
rm -f /etc/cron.weekly/*.*
rm -f /etc/cron.weekly/*
gcc -lpthread withqueue.c
./a.out
BASH script for witharray.c (witharray_script.txt)
The BASH script is used for automating the simulation runs when array is used.
#!/bin/bash
mkdir temp
cp "/host/simicsworkfolder/magic_instruction.h" "/temp/"
cp "/host/simicsworkfolder/witharray.c" "/temp/"
cd /temp
dos2unix *
setterm -powersave off -blank 0
rm -f /etc/cron.daily/*.*
rm -f /etc/cron.daily/*
45
rm -f /etc/cron.monthly/*
rm -f /etc/cron.weekly/*.*
rm -f /etc/cron.weekly/*
gcc -lpthread witharray.c
./a.out
46
APPENDIX C. Simulation data
Table 12 Simulation data with cache and memory latencies Part I
Measurements in CPU cycles for the entire run
Sim.
run
no.
Delay
1
0
100,000
1000
Y
2
0
1,000,000
1000
3
25
100,000
4
50
5
No. of
data items
Queue
size
Caches
included
Memory
included
Ac
Bc
Ac - Bc
Xc
Y
67,899,725
17,146,242
50,753,483
98,458,700
Y
Y
673,758,770
171,142,026
502,616,744
992,042,457
1000
Y
Y
300,854,920
247,964,040
52,890,880
327,078,011
100,000
1000
Y
Y
529,072,813
480,841,354
48,231,459
557,808,436
50
100,000
10,000
Y
Y
528,490,824
477,322,688
51,168,136
557,808,436
6
50
500,000
1000
Y
Y
2,638,793,116
2,401,134,267
237,658,849
2,817,437,526
7
50
1,000,000
1000
Y
Y
5,282,622,343
4,767,108,493
515,513,850
5,603,138,554
8
200
100,000
1000
Y
Y
1,914,129,845
1,869,587,274
44,542,571
1,948,326,967
9
500
100,000
1000
Y
Y
4,691,356,632
4,618,560,411
72,796,221
4,712,407,468
47
Table 13 Simulation data with cache and memory latencies Part II
Measurements in CPU cycles per data item
Sim.
run no.
Speedup =
(xc / u)
ac
bc
ac - bc
xc
yc
xc - yc
t = ac – bc
-26
1
678.99725
171.46242
507.53483
984.587
208.63032
775.95668
481.53483
197.46242
4.9861994
2
673.75877
171.142026
502.616744
992.042457
172.792609
819.249848
476.616744
197.14203
5.03212063
3
3008.5492
2479.6404
528.9088
3270.78011
2484.17966
786.60045
502.9088
2505.6404
1.30536693
4
5290.72813
4808.41354
482.31459
5578.08436
4812.4994
765.58496
456.31459
4834.4135
1.15382855
5
5284.90824
4773.22688
511.68136
5578.08436
4812.4994
765.58496
485.68136
4799.2269
1.16228811
6
5277.586232
4802.26853
475.317698
5634.87505
4810.35179
824.523264
449.317698
4828.2685
1.16705917
7
5282.622343
4767.10849
515.51385
5603.13855
4804.17757
798.960989
489.51385
4793.1085
1.1689989
8
19141.29845
18695.8727
445.42571
19483.2697
18606.1998
877.0699
419.42571
18721.873
1.04066884
9
46913.56632
46185.6041
727.96221
47124.0747
46552.1881
571.88661
701.96221
46211.604
1.01974549
u = ac – t
Table 14 Simulation data without caches and with 0 memory latency
Sim.
run
no.
Delay
No. of
data items
Queue
size
Caches
included
Memory
included
Measurements in CPU cycles for the entire
run
Measurements in CPU cycles per data
item
Anc
Bnc
Anc - Bnc
anc
bnc
anc - bnc
10
0
100,000
1000
N
N
3,508,308
902,576
2,605,732
35.08308
9.02576
26.05732
11
0
1,000,000
1000
N
N
35,125,431
9,026,061
26,099,370
35.125431
9.026061
26.09937
12
50
100,000
1000
N
N
28,609,507
25,976,506
2,633,001
286.09507
259.76506
26.33001
13
50
1,000,000
1000
N
N
286,150,185
259,850,237
26,299,948
286.150185
259.850237
26.299948
Assumed as
26 cycles for
all
calculations.
48
49
Bibliography
[1] S. Leibson and J. Kim, “Configurable Processors: A new Era in Chip Design”,
Tensilica, July 2005, IEEE Computer Society, pp. 51–59.
[2] G. Martin, “Multi-processor SoC-based Design Methodologies Using Configurable
and Extensible Processors”, Journal of Signal Processing Systems 53, 2008, pp. 113-127.
[3] Wind River Simics, URL: http://www.simics.net.
[4] Wind River Simics, ‘Simics User Guide for Windows.pdf’, Simics version 3.0,
Revision 1406, Date 2008-02-20, pp. 207, URL: http://www.simics.net.
[5] Wind River, URL: www.windriver.com/products/simics.
[6] Wind River Simics, ‘Simics User Guide for Windows.pdf’, Simics version 3.0,
Revision 1406, Date 2008-02-20, pp. 209-213, URL: http://www.simics.net.
[7] Performance measurements for Intel SandyBridge, URL: http://www.7cpu.com/cpu/SandyBridge.html.
[8] Ibiblio, Public’s Library and digital archive, URL:
http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html.