Effective Dispatching for Simultaneous Multi

advertisement
COMPUTING SCIENCE AND TECHNOLOGY INTERNATIONAL JOURNAL, VOL. 1, NO. 2, DECEMBER 2011
5
Effective Dispatching for Simultaneous Multi-Threading (SMT) Processors by
Capping Per-Thread Resource Utilization
Tilak K. D. Nagaraju, Caleb Douglas, Wei-Ming Lin, Eugene John
Abstract
Simultaneous multithreading (SMT) provides a technique to improve resource utilization by sharing key datapath components among multiple independent threads.
When critical resources are shared by multiple threads,
effective use of these resources proves to be the most
important factor in fully exploiting the system potential.
Allowing any of the threads to overwhelm these shared
resources not only leads to unfair thread processing but
may also result in severely degraded overall performance.
How to prevent idling threads from clogging the critical
resources in the pipeline becomes a must in sustaining
system performance. In this paper, we show that, by simply
setting a cap on the number of the critical Issue Queue
(IQ) entries each thread is allowed to occupy, the system
performance is easily enhanced by a significant margin.
An even more pronounced advantage of the proposed
technique over other advanced dispatching algorithms is
that the performance gain is obtained with very minimal
additional hardware required.
Index Terms—Computer Architecture, Superscalar, Simultaneous Multi-threading
I. Introduction
Based on the traditional superscalar processors, Simultaneous Multi-Threading (SMT) offers an improved
mechanism to enhance overall system performance without
having to invest an proportional amount of extra hardware.
In a conventional superscalar processor, not only the functional units are not close to be fully utilized with only
one thread running at any time, switching from one thread
T. K. D. Nagaraju, C. Douglas, W.-M. Lin and E. John are with the
Department of Electrical and Computer Engineering, The University of
Texas at San Antonio, San Antonio, TX 78249-0669, USA
Manuscript received November 16, 2011.
(task) to another involves the intervention of the operating
system which leads to extra overhead in CPU power.
Simple multithreaded processors remove the necessity of
task switching by OS but still does not fully exploit the
capacity of the functional units since a single thread does
not usually have enough instruction-level parallelism. A
coarse-grained multithreaded processor does not interleave
instructions from different threads in processing, while a
fine-grained one allows for a cycle-to-cycle interleaving
from different threads’ instructions, but neither permits
instructions from different threads to be issued in the same
clock cycle. SMT takes this one step further in order to
exploit the full potential of the functional units. The most
common characteristic of SMT processors is the sharing
of key datapath components among multiple independent threads, which ensures improved resource utilization.
SMT also exploits not only thread-level parallelism (TLP)
among the various threads [10], [11], but also equally
concentrates on the advantages available at the instructionlevel parallelism (ILP) in each thread. Subsequently, due to
the sharing of resources, the amount of hardware required
in an SMT system is significantly less than employing
multiple superscalar machines while achieving similar
performance. Most common resources shared in an SMT
system include components that are control-complexitywise easier to share (such as cache memory and physical
register bank), and those that are cost-wise better to share
(such as Issue Queue (IQ), reservation stations and various
functional units). On the other hand, other more threadspecific component (such as Re-Order Buffer (ROB)) along
the datapath are assumed to remain per-thread ownership. Due to the requirement in resource sharing, these
hardware components tend to remain busy in order to
accommodate more instructions from all threads. Although
allowing these instructions to share these resources ensures
the full performance potential afforded by SMT [6], it
tends to induce extra control complexities in managing the
critical timing path and the processors cycle. To retain the
COMPUTING SCIENCE AND TECHNOLOGY INTERNATIONAL JOURNAL, VOL. 1, NO. 2, DECEMBER 2011
exploitation of both TLP and ILP, a necessary solution
must be introduced in order to minimize the complexity
among these shared resources without affecting the ILP
exploitation significantly. At the same time, proper intelligence has to be incorporated into the resource sharing
mechanism to ensure that threads share these components
in the most efficient and fair manner.
II. Instruction Dispatch
There have been many different terminologies adopted
for instruction pipelining stages (e.g. issue, dispatch, etc.)
in a superscalar system, and their references became even
more ambiguous in an SMT system in which more resource sharing is required than in a superscalar system. For
example, instruction “issuing” has been referred to different stages of processing by different articles. Throughout
this paper, we choose to adopt the terminologies used by
most SMT articles.
In a typical single-thread superscalar system, instructions are “dispatched” from ROB into the reservation stations (either centralized or functional unit-specific) when
space is available and then “issued” to the corresponding
functional unit whenever the issuing conditions are met,
i.e. operands are ready and the requested functional unit
becomes available. However, in a basic SMT system, each
thread has its own ROB and instructions from these threadspecific ROBs have to “compete” for a shared Issue Queue
(IQ) through a dispatching scheduling algorithm. This IQ
can be considered as Centralized Reservation Station not
only shared among the functional units but also shared
among the threads in real time. A basic functional block
diagram of this basic design is depicted in Figure 1.
Due to the significantly large size of each IQ entry, the
number of entries in this shared resource usually is much
smaller than the number of ROB entries. Having its output
sent into a tightly shared resource, the instruction dispatch
stage is considered one of the most critical stages that
dearly affect the overall system performance. There have
been many different techniques in scheduling instructions
to be dispatched from ROB intending to exploit ILP. Most
techniques rely on simple instruction-level readiness to
reduce the instructions’ waiting time in IQ [7]. To better
utilize the shared IQ entries, instructions from different
threads should be given different priorities depending on
the then threads’ “activeness” in real time. Threads have
been shown to demonstrate various types of transient
behaviors throughout their life span, including stretches
of time durations with fluctuating ILP. Instructions from
thread(s) of lower ILP (or simply “slower” in instruction
issuing) would clog the IQ if they are allowed to continue
6
rename
Issue Queue (IQ)
ROB-0
ROB-1
ROB-2
ROB-3
11
00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
11
00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
(Centralized RS)
11111
00000
00000
11111
00000
11111
000
111
111
000
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
11111
00000
0000
000001111
11111
0000
1111
00000
11111
0000
1111
1111
0000
0000
1111
0000
1111
issue
FU
11
00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
11
00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
FU
dispatch
reorder / write
Fig. 1. A Simple Four-Thread SMT Instruction
Processing Flow
to be dispatched into IQ. On the other hand, instructions
from higher ILP (or simply “faster” in instruction issuing)
should be given a higher priority in utilizing the IQ. There
have not been many research results on how to share the IQ
in a more time-adaptive manner allowing different threads
to utilize (occupy) the IQ in an “on-demand” basis. In [5],
an adaptive technique is proposed to allow all ROBs to be
“shared” among threads in order to accommodate threads
that are more active with extra ROB entries borrowed from
the ROBs from less active threads, in which “activeness”
of a thread is based on a ratio between number of issuebound and commit-bound instructions in a thread’s ROB.
Control complexities involved in sharing the ROBs may
be the most inhibitive factor in justifying the performance
gain from such an approach. Some other more advanced
scheduling techniques, such as the one presented in [12],
combine more information from different stages in the
pipeline to optimize the scheduling/dispatching result, albeit requiring significantly more hardware and control
logic.
In this paper, we choose to retain the per-thread ROBs
without any sharing among them, and rely on a simple
scheduling algorithm in dispatching instructions from different threads. Our analysis shows that activeness of a
COMPUTING SCIENCE AND TECHNOLOGY INTERNATIONAL JOURNAL, VOL. 1, NO. 2, DECEMBER 2011
thread can be fluctuating very unexpectedly in time, due to
several factors, e.g. cache misses, branch miss-predictions,
write port latencies, etc. A thread that has been active can
suddenly becomes “inactive” (or, more precisely, much less
active) due to any of these reasons and stays “idle” in
the pipeline for a long duration of time. A write cache
miss can easily delay the in-order commit stage while a
read miss can significantly slow down the pipeline due to
data dependencies. A branch miss-prediction requires all
speculative instructions to be flushed out and subsequently
the activeness of the thread suffers before the correct path
of instructions is re-established. To make it worse, these
threads that have been just recently more active (than
other threads) tend to occupy more shared resources (for
example, the IQ, reservation stations, among others) than
others. When these threads suddenly become inactive, the
shared resources typically cannot be released for other
threads to use, mainly due to various system or operating
limitations inhibiting it from doing so. Consequently, there
could be only a very limited amount of shared resources
for other threads to share, and thus limiting the potential performance. While there could be many different
intelligent ways to redistribute the resources, they tend to
be either very costly in hardware or simply infeasible in
implementation for real-time operation due to the excessive
logic required to realize the intelligence. To be able to
further improve the system resource utilization at this level
of high-speed instruction processing, one cannot afford
employing any design that involves too much intelligence
for real-time implementation.
There have been many non-adaptive dispatching algorithms proposed, including simple round-robin, withinclock-cycle-round robin [2], two-op-block [7], etc. None
of these algorithms are assigned with a priority among
threads, instead all are based on a pre-assigned thread
index order. The basis for performance comparison in
this paper will be the traditional simple round-robin dispatching which simply starts dispatching all dispatchable
instructions from a thread according the index order and
moves to the next thread if the dispatching bandwidth
allows for more instructions; the index order is then rotated
naturally to the next thread to start in order in the next
clock cycle.
III. Simulation Environment
The simulation environment including the simulator and
the workloads used for our simulations are first described
in this section.
7
A. Simulater
We used M-Sim [4], a multi-threaded micro architectural simulation environment model, to estimate performance and power analysis of the proposed scheme. Msim includes accurate models of the pipeline structures
such as explicit register renaming, concurrent execution of
multiple threads, detailed power estimation using Wattch
framework [8], separate Reorder Buffer, and register files
which are necessary for an SMT model. The Issue Queue,
execution functional units, Load-Store Queue (LSQ) are
shared among the threads, but branch predictor is exclusive
to each thread. The detailed processor’s configuration is
shown in Table I.
Parameter
Machine Width
Window size
Function Units &
Latency (total/issue)
Physical registers
L1 I-cache
L1 D-cache
L2 Cache unified
BTB
Branch Predictor
Pipeline Structure
Memory
Configuration
8 wide fetch/issue/commit
16-entry Issue Queue
48-entry Load/Store queue
128-entry ROB
8 Int Add (1/1)
4 Int Mult (3/1) / Div (20/19)
4 Load/Store (2/1), 8 FP Add (2)
4 FP Mult (4/1) / Div (12/12) /
Sqrt (24/24)
256 integer and floating point
64KB, 2-way set-associative
128 byte line
32 KB, 4-way set-associative
256 byte line
2 MB, 8-way set-associative
512 byte line
2048 entry, 2-way set-associative
2K entry gShare
10-bit global history per thread
5-stage front-end (fetch-dispatch)
scheduling (for register file access:
2 stages, execution, write back, commit)
64 bit wide, 200 cycles
access latency
TABLE I. Configuration of the Simulated Processor
B. Workloads
For multi-threaded workloads, we use the mixed SPEC
CPU 2000 benchmark suite [1], [3], [8] based on ILP
classification. Each of the benchmarks is initialized in
accordance with the procedure mentioned in Simpoints
tool [9] and then up to 100 million instructions are simulated. Once the first 100 million instructions are committed
from any of the threads, simulation is terminated.
In order to categorize multithreaded workloads, each
of the benchmarks is simulated individually in a simple
scalar environment. As shown in Table II, three types
COMPUTING SCIENCE AND TECHNOLOGY INTERNATIONAL JOURNAL, VOL. 1, NO. 2, DECEMBER 2011
of ILPs - low ILP (memory bound), medium ILP and
high ILP (execution bound) - are identified, and six mixes
of multi-threaded workloads are used based on different
combinations of ILP types. Mixes 5 and 6 are both used
Mix
Mix
Mix
Mix
Mix
Mix
Mix
1
2
3
4
5
6
Benchmarks
perlbmk, mesa, swim, crafty
mcf, equake, vpr, lucas
applu, ammp, mgrid, galgel
mcf, equake, mesa, crafty
perlbmk, lucas, galgel, gcc
mesa, swim, apsi, mgrid
Classification (ILP)
Low Med High
1
0
3
4
0
0
0
4
0
2
0
2
1
2
1
1
2
1
8
(R,q)=(128,32)
120
Mix-1
Mix-4
100
sle
cy 80
ck
col 60
cf
o 40
%
to demonstrate their discrepancies in various performance
comparisons even though they have the same ILP classification.
Mix-3
Mix-6
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
# of IQ slots
(R,q)=(64,32)
120
TABLE II. Simulated Multi-threaded Workload
Mix-2
Mix-5
Mix-1
Mix-4
100
sel
cy 80
ck
oclc 60
fo 40
%
Mix-2
Mix-5
Mix-3
Mix-6
20
IV. Motivation
In a typical SMT system, a shared IQ (Issue Queue)
usually is used as a set of centralized reservation stations,
for all functional units. When an instruction from a thread
is dispatched from its corresponding ROB, it is then
assigned an entry in the IQ waiting for its source operands
to become ready and an available functional unit for it
to issue to. Each entry in the IQ (or the corresponding
reservation station in the traditional model) is required
to hold all the necessary information for the instruction,
including the contents of all source operands, before it can
be issued. Thus, further increasing the number of entries in
the shared IQ usually is hampered by the cost factor, and
therefore the utilization of these limited resources becomes
very critical to the overall system performance.
The proposed technique is based on the conjectures that,
if IQ is not properly managed, its imbalanced occupancy
scenario among threads does exist and, it is also likely
that threads that are dominating IQ slots may frequently
become much less active or even stall. This section will
be devoted to the discussion to support these conjectures.
A. IQ Occupancy Analysis
We first look into how exactly the IQ may be dominated
by a single thread among the threads in SMT processing.
In one of our analytical simulations, Figure 2 shows the
percentage of time that at least one thread is occupying
at least the given number of IQ entries. R is used to
denote the number of ROB entries and q is the size of
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
# of IQ slots
Fig. 2. IQ Occupancy Dominance Rate
IQ used. When ROB size is relatively larger than the IQ
size (as shown in the figure with ((R, q) = (128, 32)), the
dominance tends to be more severe due to more available
instructions for dispatching from a thread. For example,
the results from running Mix 3, in over 50% of time there
is at least one thread occupying 27 out of the 32 IQ entries,
and in over 80% of time at least one thread occupies at
least half of the entries, clearly indicating the imbalance
of the resource usage. Such a single-thread usage dominance may easily lead to performance degradation when
the dominating thread suddenly stalls leaving very few
precious entries for other threads to compete for. For the
mixes (mixes 2 and 4) that are not as imbalanced, still in
roughly about 40% of time at least a thread is taking half
of the resources. Figure 3 shows the average percentage
among all 6 mixes. This clearly shows that on the average
in over 60% of clock cycles there is at least one thread
occupying more than 50% of IQ slots, and three quarters
of resources are tied up by one thread in over 30% of
time. When such a dominating thread somehow becomes
“inactive”, the consequence can be significant.
COMPUTING SCIENCE AND TECHNOLOGY INTERNATIONAL JOURNAL, VOL. 1, NO. 2, DECEMBER 2011
(R,q)=(128,32)
100
90
80
70
60
50
40
30
20
10
0
Inactive Dominating Thread Analysis
Avg-of-Mixes
sle
cy
ck
col
cf
o
%
120
veti 100
ca 80
ing
nie 60
bc
cf 40
o
% 20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
# of IQ slots
(R,q)=(64,32)
100
90
80
70
60
50
40
30
20
10
0
9
Avg-of-Mixes
sel
cy
ck
clo
cf
o
%
Mix-1
Mix-2
Mix-3
Mix-4
Mix-5
Mix-6
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
# of IQ slots occupied by a thread
Fig. 4. Percentage of Time When a Thread Becomes Inactive When Occupying a Specified
Number of IQ Slots
V. Capping Per-Thread IQ Usage
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
# of IQ slots
Fig. 3. Average IQ Occupancy Rate
B. Inactive Resource-Dominating Threads
Another simulation is performed to look into the likelihood that a thread becomes inactive when already occupying a significant amount of the IQ slots. Figure 4
shows, for each of the six mixes, out of all clock cycles
when a thread is taking up at least a specified number
of IQ slots the percentage of clock cycles that this thread
becomes “inactive”. A very lenient condition is given here
to classify a thread to be “inactive”: when a thread has not
been issuing any instruction for at least 10 consecutive
clock cycles. This result clearly shows that all mixes have
very prevalent situations that threads become inactive when
occupying a large amount of IQ slots. For example, in
Mix-2 when a thread takes up at least 19 of the 32 IQ
slots over 80% of time it is in the “inactive” mode as
defined. The inactiveness percentage goes up to 90% when
a thread occupies over 75% of slots. In general, when the
per-thread dominance is more prominent, the inactiveness
likelihood becomes higher. This scenario undoubtedly will
block the flow of other threads which can only compete
for the leftover of the IQ slots.
There are many different ways of controlling the usage
of IQ for the threads. One straightforward approach is to
simply cap the number of slots each thread is allowed to
occupy at any time. Let q and n denote the size of IQ
and the number of threads, respectively. Obviously any
cap value below q/n will not lead to any performance
gain, since with a cap value c < q/n there will be at
least q − n · c (> 0) slots constantly left unused, which is
essentially the same as using a smaller IQ. Thus, such a
cap value should never be set at lower than q/n, and the
performance with a cap value c1 will be always worse that
with a cap value c2 if
c1 < c2 < q/n
(1)
For a cap value c ≥ q/n, overall performance may vary
depending on transient behavior of threads. In general, if
the cap value is set too high, the benefits of allowing other
threads to use the IQ when the dominating thread becomes
idle will not be as prominent. On the other hand, if the
cap value is set too low, overall performance may suffer
if the amount of “concurrent” activities among threads is
not high; that is, not all threads are active at the same time
and thus utilization of IQ is sacrificed by setting a low cap
value. Another argument for a low cap value stems from
the situation when the thread ILP is low (thus out-of-order
issuing and execution is not prevalent), in which allowing
for a higher cap value does not benefit thread throughput
much, if any.
Plots in Figure 5 show the IPC (Instructions per Clock
Cycle) values for different cap values. The combinations
COMPUTING SCIENCE AND TECHNOLOGY INTERNATIONAL JOURNAL, VOL. 1, NO. 2, DECEMBER 2011
5
4.5
4
3.5
C 3
P
I
2.5
2
1.5
1
0.5
0
(R,q)=(128,32)
in Figure 6 and Figure 7 give the IPC comparison for
each individual mix with cap values set no lower than 8
using the combination of (R, q) = (128, 32). From these
Mix-1
Mix-2
Mix-3
Mix-4
Mix-5
Mix-6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Cap Value
5
4.5
4
3.5
C 3
IP
2.5
2
1.5
1
0.5
0
10
(R,q)=(96,32)
(R,q)=(128,32)
4.6
4.5
4.4
4.3
C
P
I 4.2
4.1
4
3.9
3.8
3.7
3.6
Mix-1
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
CapValue
Mix-1
Mix-2
Mix-3
Mix-4
Mix-5
Mix-6
(R,q)=(128,32)
4.4
4.3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Cap Value
C 4.2
P
I
4.1
5
4.5
4
3.5
C 3
P
I
2.5
2
1.5
1
0.5
0
(R,q)=(64,32)
4
Mix-2
3.9
3.8
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
CapValue
Mix-1
Mix-2
Mix-3
Mix-4
Mix-5
Mix-6
(R,q)=(128,32)
2.6
2.55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Cap Value
Fig. 5. IPC Performance Comparison With
Fixed Cap Values
C
IP
2.5
2.45
Mix-3
2.4
2.35
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
of ROB size and IQ size tested are (R, q) = (128, 32),
(96, 32) and (64, 32). As aforementioned, any cap value
below 8 (= q/n) will lead to detrimental effects, which
is clearly demonstrated in these figures. The claim about
the monotonously decreasing performance in Equation 1
when the cap value is reduced below the threshold of q/n
is also clearly verified. Note that when the cap value is set
at 32 (= q), it becomes the original default configuration
without any cap restriction.
In order to further illustrate how different cap values
would affect the performance of different mixes. Plots
CapValue
Fig. 6. IPC Performance Comparison for
Mixes 1-3 under (R, q) = (128, 32)
results, we can easily conclude that for some mixes the
proposed capping approach has a very significant impact
on performance; however, different mixes have their performance peak at different cap values. For example, for
mix 2, the performance gain peaks at the cap value of 13
COMPUTING SCIENCE AND TECHNOLOGY INTERNATIONAL JOURNAL, VOL. 1, NO. 2, DECEMBER 2011
(R,q)=(128,32)
11
adaptive approach. Here we first show the performance
gain from a fixed cap value. Figure 8 shows the average
performance gain among all 6 mixes when a fixed cap
value is used. Again here only cap values of at least 8
are used. When ROB sizes are smaller compared to the
3.46
3.44
3.42
C
IP 3.4
3.38
3.36
Average Improvement
3.34
4
Mix-4
3.32
3.3
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
CapValue
(R,q)=(128,32)
3.1
3.05
3
C
2.95
P
I
3
tn
em
ev
or
pm
ie
gar
ev
a
%
2
1
0
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-1
(R,q)=(128,32)
(R,q)=(96,32)
(R,q)=(64,32)
(R,q)=(48,32)
-2
-3
2.9
Cap Value
2.85
2.8
2.75
Mix-5
2.7
2.65
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
CapValue
(R,q)=(128,32)
3.5
3.4
C
IP
3.3
3.2
3.1
3
Mix-6
2.9
2.8
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
CapValue
Fig. 7. IPC Performance Comparison for
Mixes 4-6 under (R, q) = (128, 32)
with a margin of performance gain close to 9%, and mix 5
peaks at a very small cap value of 11 with a performance
gain close to 7%. For some other mixes, the gain is not
as significant and most of time for these mixes the peak
performance happens at a cap value very close to the q,
indicating the less effect from such a method.
Since it is impossible to prescribe a different cap value
for a given mix in advance in order to achieve the best
performance, either a fixed cap value has to be issued or
some more intelligence needs to be incorporated for an
Fig. 8. Average IPC Improvement with Various
(R, q) Combinations
IQ size, for example R = 48 and R = 64, the best cap
values are toward the lower range – R = 48 peaks at cap
value of 13 and R = 64 peaks at 15. On the other hand,
when ROB sizes are relatively larger compared to the IQ
size, the best cap value tends to move up to the higher
range – R = 96 peaks at 23 and R = 128 peaks at 25. In
general, from these mixes, if a cap value is set at anywhere
between 40% and 75% of the IQ size, a very dependable
performance gain can be expected.
Compared to other adaptive dispatching algorithms
which require extensive additional storage hardware and
control logic that may delay real-time processing, this
technique achieves a very respectable performance gain
with very little extra logic – a few extra counters for
retaining thread IQ occupancy information and a few
comparators for checking against the set cap value.
VI. Adaptive Capping
From the previous section, the capping approach does
show very promising results while an arbitrarily assigned
fixed cap value does not always lead to a desirable gain for
different mixes. It remains to see if the cap value can be
dynamically adjusted for different mix situation, or even,
for the same mix, be adjusted according to different realtime transient behaviors demonstrated by each individual
program.
The proposed adaptive capping method is based on a
notion that the more threads that are currently “actively”
moving through the system, the sharing of IQ slots among
COMPUTING SCIENCE AND TECHNOLOGY INTERNATIONAL JOURNAL, VOL. 1, NO. 2, DECEMBER 2011
threads should be enforced more; while on the other hand,
if there are fewer threads currently “active”, then each
thread should be allowed to occupy more IQ slots. This
leads to a simple conjecture that the cap value should be
dynamically adjusted according to the number of “active”
threads at that time.
There are two issues to address in order to effectively
implement this approach: how to dynamically classify
“activeness” of a thread, and how to set the corresponding cap values for different activeness scenario, that is,
different number of active threads . For the “activeness”
issue, we choose to set a time threshold – if a thread
is not issuing for a certain consecutive number of clock
cycles, then it is declared to be “inactive”. Note that true
activeness of a thread should always be a relative degreewise quantification instead of a boolean classification used
here. Such an overly simplified classification is adopted
for the sake of fast processing and minimal hardware
requirement. The value of this setting, the consecutive
number of clock cycle a thread has not been issuing
(denoted as cc nis), obviously will affect the performance
dearly since it directly determines how likely a thread is to
be declared “inactive”. If this number is set too small, then
it becomes more likely that more threads will be declared
“inactive” which leads to higher cap values, which in turn
may jeopardize the chances for some of the threads that
are actually active to dispatch due to high cap values. On
the other hand, if this threshold is set too high then it
becomes less likely for a thread to be declared “inactive”
(and with a longer delay before a truly inactive thread to
be declared so), leading to in general smaller cap values,
which may restrict truly active threads from fully utilizing
the IQ while waiting for the inactive threads to be declared
“inactive”. The number of “active” threads thus classified
is denoted as n acitve.
The second issue is on the exact setting of the various
cap values (denoted as n cap) for different number of
active threads. A systematic formula is adopted here in
our proposed technique to dynamically set this cap value:
n cap = q × (1 − n acitve × d)
where q is the total size of IQ and d represents a set
decrement percentage to further reduce the cap value for
each additional currently “active” thread to promote the
sharing of IQ among the “active” threads. For example,
if q = 32 then the the cap values will be dynamically
set according to Table III under different d values. Note
that cap values are supposed to be integral numbers only;
thus, the cap values shown in the table for d = 15% are
essentially treated as 32, 27, 22, 17, and 12, respectively.
For the purpose of a simple illustration, this dynamic set-
n acitve
d
5%
10%
15%
20%
12
0
1
2
3
4
32.0
32.0
32.0
32.0
30.4
29.0
27.2
25.6
28.8
26.0
22.4
19.2
27.2
23.0
17.6
12.8
25.6
20.0
12.8
6.4
TABLE III. Dynamic Cap Setting
ting follows a straightforward linear decrement percentage
with each additional “active” thread; however, a nonlinear
decrement scheme is likely to lead to a better overall
setting, which will not be discussed in this paper due to
its arbitrariness.
Setting of this decrement percentage value, d, is also
critical to an effective scheme. If this percentage is set too
small, the difference between caps values used for different
number of active threads becomes too small for the benefits
of the capping method to realize, while, on the other hand,
if this decrement is set to high, utilization of IQ may easily
suffer if activeness of threads is not declared properly. That
is, there exists a strong correlation between the setting of
cc nis and the setting of d to lead to an effective approach.
In our simulations, a spectrum of cc nis values from 1
to 20 are tested, while a set of three different d values
are used: 5%, 10% and 20%. All 6 mixes are tested
individually (see Figure 9 and Figure 10). We can see
that different mixes of benchmark programs react very
differently to the combinations of cc nis and d. Some
mixes, such as Mixes 2, 4 and 5, prefer a larger decrement
percentage value, all performing better with d = 20%,
while the other three mixes (1 3, and 6) under-perform
with d = 20% than with other smaller d values.
Average improvement from this technique on the 6
mixes is shown in Figure 11. Compared to the fixed-cap
approach in the previous section, where with (R, q) =
(128, 32) the best improvement lingers around 2%, this
adaptive approach further raise the performance to close
3% when the correct cc nis and d are selected. Again,
such an improvement does not require a significant increase in extra hardware – a few additional counters and
comparators will suffice.
VII. Conclusion
This paper clearly demonstrated that utilization of resources shared among the threads in an SMT system could
significantly affect the overall performance. By simply
capping the resource usage of any thread, utilization of the
COMPUTING SCIENCE AND TECHNOLOGY INTERNATIONAL JOURNAL, VOL. 1, NO. 2, DECEMBER 2011
C
P
I
4.55
4.5
4.45
4.4
4.35
4.3
4.25
4.2
4.15
4.1
Mix-1
C
P
I
d=5%
d=10%
d=20%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
3.46
3.44
3.42
3.4
3.38
3.36
3.34
3.32
3.3
3.28
Mix-4
d=5%
d=10%
d=20%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
cc_nis
cc_nis
C
P
I
4.3
4.25
4.2
4.15
4.1
4.05
4
3.95
3.9
3.85
3.8
Mix-2
Mix-5
3.05
3
C
P
I
2.95
2.9
2.85
2.8
d=5%
d=10%
d=20%
d=5%
d=10%
d=20%
2.75
2.7
2.65
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
cc_nis
cc_nis
C
P
I
2.57
2.56
2.55
2.54
2.53
2.52
2.51
2.5
2.49
2.48
2.47
13
Mix-3
C
P
I
d=5%
d=10%
d=20%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
3.55
3.5
3.45
3.4
3.35
3.3
3.25
3.2
3.15
3.1
Mix-6
d=5%
d=10%
d=20%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
cc_nis
Fig. 9. IPC Performance Comparison for
Mixes 1-3 under (R, q) = (128, 32) with Adaptive Capping
critical resources can be vastly improved. Further adjusting
such a cap according to real-time thread behaviors again
demonstrates additional performance gain. A nonlinear setting of the cap values should further improve the potential
by allowing a more flexible cap adjustment when the size
of IQ is changed. An autonomous adjustment scheme will
be even more appealing to allow the system to self-adjust
the cap value to reach the best performance by monitoring
cc_nis
Fig. 10. IPC Performance Comparison for
Mixes 4-6 under (R, q) = (128, 32) with Adaptive Capping
the transient behavior of the threads. These will be further
studied in the future research.
References
[1] D. Burger, T. Austin. “The SimpleScalar tool set: Version 2.0.” Tech.
Report, Dept. of CS, Univ. of Wisconsin-Madison, June 1997 and
documentation for all Simplescalar releases.
COMPUTING SCIENCE AND TECHNOLOGY INTERNATIONAL JOURNAL, VOL. 1, NO. 2, DECEMBER 2011
Average Improvement
3
tn
em
ev
rop
mI
gv
A
%
2.5
2
1.5
1
d=5%
0.5
Tilak K. D. Nagaraju Tilak Kumar Develapura
Nagaraju received his B.S. degree of Engineering in Electronics and Communications in 2005
from New Horizon College of Engineering,
Bangalore, India. He then received his M.S. degree in Electrical Engineering at the University
of Texas at San Antonio in December 2008 and
an M.S. degree in Computer Engineering also
from UTSA in August 2011.
d=10%
d=20%
0
-0.5
14
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
cc_nis
Fig. 11. Average IPC Improvement with Various d Values Using Adaptive Capping
[2] M. Debnath, B. Lee, and W.-M. Lin, “Prioritized Out-of-Order Instruction Dispatching Techniques for Simultaneous Multi-Threading
(SMT) Processors”, WIP session of 30th IEEE Real-Time Systems
Symposium (RTTS), Dec. 2009.
[3] J. Henning, “SPEC CPU2000: Measuring CPU Performance in the
New Millennium,” Transactions of IEEE Computer, 33(7):2835,
July 2000.
[4] J. Sharkey, “M-Sim: A Flexible, Multi-threaded Simulation Environment.” Tech. Report CS-TR-05-DP1, Department of Computer
Science, SUNY Binghamton, 2005.
[5] J. Sharkey, D. Balkan and D.Ponomarev, “Adaptive Reorder Buffers
for SMT Processors”, the 15th international Conference on Parallel
Architectures and Compilation Techniques, 2006.
[6] J. Sharkey and D. Ponomarev, “Efficient Instruction Schedulers
for SMT Processors,” Proc. 12th Intl Symp. High Performance
Computer Architecture (HPCA), 2006.
[7] J. Sharkey and D. Ponomarev, “Exploiting Operand Availability
for Efficient Simultaneous Multithreading,” IEEE Transactions on
Computers, vol. 56, no. 2, February 2007.
[8] T. Sherwood, et al. “Automatically Characterizing Large Scale
Program Behavior,” Proc. ASPLOS, 2002.
[9] Standard Performance Evaluation Corporation (SPEC) website,
http://www.spec.org/
[10] D. Tullsen et al., “Simultaneous Multithreading: Maximizing OnChip Parallelism,” Proc. Intl Symp. Computer Architecture, 1995.
[11] D. Tullsen et al., “Exploiting Choice: Instruction Fetch and Issue on
an Implementable Simultaneous Multithreading Processor,” Proc.
Intl Symp. Computer Architecture, 1996.
[12] H. Wang, R. Sangireddy, “Optimizing Instruction Scheduling
through Combined In-Order and O-O-O Execution in SMT Processors,” IEEE Transactions On Parallel And Distributed Systems,
vol. 20, no. 3, pp. 389-403, March 2009.
Caleb Douglas is a Senior currently pursuing
his B.S. degree in Electrical and Computer
Engineering at the University of Texas at San
Antonio. Caleb was a member of the NSF-REU
funded ESCAPE (Experimental Study on Computer Architecture and Performance Evaluation)
program at UTSA 2011. He plans to graduate in
Fall 2012 then either peruse a M.S. in Computer
Engineering or enter the industry. His research
interests are in Computer Architecture, specifically CPU Performance Evaluation.
Wei-Ming Lin received the BS degree in Electrical Engineering from National Taiwan University, Taipei, Taiwan, in 1982, the M.S. and
Ph.D. degrees in Electrical Engineering from
the University of Southern California, Los Angeles, in 1986 and 1991, respectively. He was
an assistant professor in the Department of Electrical and Computer Engineering at Mississippi
State University before joining the University
of Texas at San Antonio (UTSA) in 1993, and,
since 2004, he has been a professor of Electrical
Engineering there, and also the Associate Dean for Graduate Studies in
the College of Engineering since 2006. His research interests include
distributed and parallel computing, computer architecture, computer networks and internet security.
Eugene John received his Ph. D. in electrical engineering from the Pennsylvania State
University. He is currently a Professor in the
Department of Electrical and Computer Engineering at the University of Texas at San
Antonio. His research interests include, Low
Power VLSI Design, Computer Architecture,
Multi-core Systems, Nanoelectronic Systems,
Computer Performance Analysis, Multimedia
and Network Processing and Reconfigurable
Computing.
Download