An Elementary Processor Architecture

advertisement
An Elementary
Processor
Simultaneous
Instruction
Issuing
with
Hiroaki
Hirata,
Kozo
Akio
Kimura,
Nishimura,
Satoshi
Yoshimori
Media
1006
Nagamine,
Nakase
Research
Matsushita
Electric
Kadoma
this
paper,
our
we
processor
threads
a single
begin
execution
This
unless
parallel
the utilization
that
there
speed-up,
conventional
two
Our
unit
conflicts
bet ween
oped
a new
loop
execution
flow
mechanism
static
Simulation
loop
code
threads
be
loops
or VLIW
con-
in parallel
and
iterations,
by
a 3.72
tributed
tion
a
the
are
have
difficult
it
access.
On
units
thread
predicting
in
the
two
arises
target
sor architecture:
Introduction
nique
to
due to remote
We
are
currently
tion
system
which
them
through
ation
of high
power.
creates
computer
quality
Furthermore,
faithfully
are
developing
as possible,
also needed.
an
virtual
integrated
worlds
graphics.
visualiza-
and represents
However,
the
requires
great
in order
to model
the real world
In this
numerical
paper,
processing
nique
as
computations
we present
a processor
at
a thread
ther
a control
pipelined
Our
material
is
for dmect commercial
advantage, the ACM copyright notice and the title of the pubkatlon
and
its date appear, and notice is given that copying IS by perrmsslon of the
Association for Computing Machinery. To copy otherwise, or to republish,
requmes a fee and/or specific pernusslon,
provided
@ 1992
that
the
cop]es
are
not
made
or
distributed
ACM 0.89791 .509.7/92/0005/0136
$1.50
136
performance
does
instruction
of a renot
help
in
we introduced
into
When
other
our
proces-
and
parallel
tech-
long
latencies
a thread
encoun-
rapidly
switches
hand,
parallel
multi-
is a latency-hiding
to
When
an
be issued
dependency
technique
de-
from
because
within
the
another
is especially
tech-
instruction
thread
effective
of eithread,
is exe-
in a deeply
processor.
goal
processor
granted
able
data
inter-
of the single
during
level.
is not
or
from
operation
processor
a processor
an independent
This
the
instruction
from
cuted.
Permission to copy without fee all or part of this
the
result
multithreading
active
the
memof func-
multithreading
access.
On
within
utiliza-
utilization
problems,
concurrent
of data,
threads.
threading
processor
techniques
remain
ac-
case of a dis-
executions.
these
memory
an absence
between
gener-
images
intensive
ters
past
of future
The
attempts
the
branch
concurrent
multithreading.
1
if the
of multithreading
systems.
of data
optimization
conditional
the
threads(parallel
functional
to the
to overcome
kinds
low
can
and
obstacle
executed
In order
machines.
hand,
for
due to remote
a processor
execution
possible
low
latencies
dependencies
Another
peatedly
system,
other
within
time
a number
In
tests
coarse-grained
many
branches.
rare-
intersection
on multiprocessing
long
the
and
generating
inherent
create
memory
from
for
processing
has
executes
conditional
shared
control
to parallelize
however,
instruction
Another
of the
test
in a parallel
system.
ray-tracing
algorithms,
can easily
can result
lays.
graphics,
to be executed
and
tional
devel-
multiple
makes
ex-
functional
technique.
using
architecture,
which
we
and
processor
a graphics
algorithms
part
This
base
such
these
a large
thread,
over
to the efficient
to control
for
In
program.
cesses
run
famous
images.
Each
as the
of computer
very
processes)
results
achieved
are
parallelism
ory
scheduling
of our
could
account
improves
a 2.02
can
In order
scheme,
to parallelize
vector
unit.
is also applicable
loop.
used
which
whole
processor.
of a single
architecture[l]
machine
alistic
can
unit
greatly
processor,
architecture
ecution
instructions
functional
and four
respectively,
RISC
co.jp)
diosity
simultaneously
these
are
In
different
scheme
of the functional
on a nine-functional-unit
times
and
execution
by executing
from
are issued
units,
Ltd.
Japan
In the field
throughput.
instructions
thread)
functional
flicts.
show
machine
architecture,
(not
to multiple
improves
Mochizuki,
Nishizawa
Co.,
571
a multithreaded
propose
which
Teiji
Threads
Laboratory,
Osaka.
Abstract
architecture
Yoshiyuki
and
Industrial
(hirataQisl.mei.
In
Architecture
from
Multiple
is to
design
achieve
which
an elementary
processor
system
than
rat her
The
idea
ware
resource
for
of parallel
opt imizat
an efficient
is oriented
in a large
use in
and
cost-effective
specifically
scale
for use as
multiprocessor
an uniprocessor
multithreading
ion where
arose
several
syst em.
from
processors
hardare
united
and reorganized
fully
used.
system
each
Figure
organization.
physical
utilization
1(a)
and
of the
T represents
In this
busiest
improved
united
they
to
conflict
with
functional
unit.
The
programmer’s
;**O
;
of the
Physical
/
Processor
be united
to be
on
from
(a)
Conventional
unless
for the
the
processor
Figure
l(b)
is almost
cal processor
ganization
in Figure
putation,
parallel
assign
ble to our
In
Interconnection
loop
we
ourselves
will
field
of
physical
or-
techniques
developed
or doacross
techniques
are also
●
com-
The
In section
2, our
...........
..................................
4, preceding
outlining
paper
via
works
remarks
with
are made
Processor
In
An
section
architecture.
in section
Figure
are
5.
in
called
Figure
cache,
thread
a program
slots.
counter,
an instruction
physically
Each
fetch
shared
unit
among
average,
count
queue
some
instructions
succeeding
by the program
slots,
the
B=
and
counter.
S x C words,
C is the
instruction
slot,
associated
with
up a logical
processor,
while
and
unit
The
where
number
all
decode
functional
A
with
pairs,
logical
instruction
is provided
unit
thread
An
least
and
units
tion
a buffer
the
instruction
buffer
S is the
of cycles
processors
organization
decode
which
required
sued
saves
for
in
would
unit
and
unit.
The
the
that
The
So, on
unit
threads
en-
t bread
can
instruction
bottleneck
for
a case,
in-
is done
queue
one of the
however,
an instruction
decodes
and
it.
cache
a processor
snot her
data
the
from
The
cache
an instruc-
instruction
a load/store
instructions
using
if it is free
When
in
threads.
instruction
In such
type,
Branch
buffer
be needed.
gets
unit
B instruct,he instruc-
operation
multiple
one
become
slots.
on a RISC
are checked
size needs
number
decode
queue
the
operation.
might
t bread
unit
fill
from
fetching
instruction,
fetching
unit
assumed.
are
fashion
at most
cycles
to
in B cycles.
the
fetches
C
This
buffer
ers a branch
fetch
unit
attempts
the
once
based
processors.
has
elementary
system
every
unit.
an interleaved
is filled
processor
makes
and
wit h many
unit
thread
queue
and fetch
2, the
queue
fetch
one
struction
and
in
united
1: Multiprocessor
for
tion
Con-
Organization
instruction
using
instruction
tions
Model
Hardware
shown
set
architecture
architectures
our
Organization
preempt
As
*
●
multi-
Architecture
Machine
several
●
Functional
Units
effectiveness
results.
on multithreaded
(b)
as follows.
processor
the
ar-
the
2.1.1
Sequencer
and
concurrent
simulation
to be compared
of our
is organized
3 demonstrates
architecture
2.1
. . . . . .. . . . .+
. . . . . . . . . . . .. . . . . . . . . . . . . .-
Physical
~ ~
Processor ...
~j, .........
......... ...
,...,,,.,,,,.,,,,.,,:,
... ....,,:
.. ... ...,,..,,,..,,.,,
.. ... .. ...
.. .. ... ... ~e@ter
applica-
multithreading
multithreaded
Section
most
parallel
rest of this
is addressed.
2
and Memories
an Instruction
..................................
..................................
,::,:.:,,,,:::::::::::::::::,:.:::and a Data Caches
Processor
~,,
,:
,:
,:
:;
● 0::
,:
,!
::
,:
::
physi-
of numerical
concentrate
upon
to simply
threading.
cluding
the
to processors,
interest
presented
the
as doall
iterations
view
although
execution
such
paper,
chitectural
of our
In
as the
in
.Z1
Network
same
processor.
this
restrict
1(a),
is different.
for multiprocessor,
which
as same
logical
organization
the
different
executed
to compete
of
set
Wlte,isk?r
. ... .. .. ..
............ . .........
Logical
view
Functional
Llnits
the utilization
instructions
one another
sequencer
$
☛☛
of the
be expected
and
....J......
, . ... J.....
number
Consequently,
simultaneously
z
and Memon&
unit
cycles
could
so that
90~0.
multiple
are issued
-J
. .. . ..
instruction
latency
execution
could
Network
the
●
l(b),
30x3=
that
IV is the
processors
unit
Interconnection
of a processor
of the
where
total
in Figure
processor,
threads
the
functional
nearly
assume
unit
is an issue
L
case, three
one as shown
of the
unit,
be
on
of the functional
as U = A?# x 100[Yo],
program.
into
example,
functional
utilization
could
are executed
3070 because
The
units
multiprocessor
threads
For
busiest
is about
of invocations
unit,
a general
processor.
dependency.
is defined
functional
shows
Multiple
of the
in Figure
level
so that
l(a)
are executed
dependencies
within
is
the
of an instruction
scoreboarding
of dependencies.
set is
architecture
technique,
Otherwise,
and
is-
issuing
is
interlocked.
indicated
Issued
to be at
of thread
struction
to access
units.
structions
cache.
137
instructions
schedule
Unless
issued
are dynamically
units
an
and
instruction
from
other
scheduled
delivered
conflicts
decode
units
to
with
by
in-
functional
other
in-
over the same
Instruction
Cache
le
------‘------------K-R&&er
------
------
(allocated
executing
the
unit,
struction
for
thread)
unit,
there
it
decode
station,
could
from
they
issued
dependencies
is used
bank
has
ports
rent
with
thread
station
in our
one,
two
more
register
is executed,
is logically
ports
are unnecessary
multithreading,
bound
in order.
depth
and
stations.
2.1.2
banks,
and
one
bank
logical
write
data
processor
allocated
for
processor.
3(a)
the
banks
than
the
are
bank
special
between
logical
model
include
level.
our
machine
files,
multiple
facilities.
network
is also
thread
slots,
The
between
in-
functional
a disadvantage.
Pipeline
are
More
stages,
The
concur-
queue
the
thread
The
exclu-
by
followed
unit
is checked
138
dynamic
pipeline
two
in
two
Figure
decode(D)
followed
by
are shown
is read
in one cycle.
scheduling
3(b).
ion
from
the
stages,
logistages
followed
multiple
by
a
execution
stage.
a buffer
In stage
DI,
In stage
or not.
in
In
fet ch(IF)
for convenience,
is tested.
if it is issuable
of the logical
of a superpipelined
instruct
by a write(W)
IF stages
of the instruction
the
as shown
stage,
an instruction
a
the instruction
is a modification
pipeline,
followed
schedule(S)
When
shows
which
processor
Each
is provided
slots.
register
cal processor
can be
to support
thread
as
of which
port.
operand
In order
than
well
to a thread.
because
physical
as
each
other
registers
scheduling
in
Instruction
RISC
into
banks
queue
and
processor
differs
processor,
the
the
and
Figure
banks
to the
instruction
to be free of
set private
with
complexity
processor
logical
communications
and
dynamic
the
register-transfer
cost of register
algorithm[2]
file,
enable
disadvantages
in Tomasulo’s
register
register
contrast,
at the
a logical
though
A standby
is one,
In
Registers
that
the hardware
units
instruc-
between
guarantees
which
creased
same
binding
it.
Files
Queu
slots
access any
to
Some
in a standby
of order,
not
processors
even
the
thread
bank
registers
re-
Register
and
three
are guaranteed
is divided
read
out
unit
does
conflicts.
from
with
bound
architecture.
register
in standby
stays
.. ---- . . ..
Lai-ge
a register
can al-
Consequently,
whose
instructions
as a full
read
a decode
latch
and
operations
instruction
general-purpose
floating-point
stored
from
a reservation
large
their
instruction
add
station
. . . ~-------
sive one-to-one
unit.
stations
cause resource
are executed
is a simple
because
with
be sent to the ALU.
a thread
are issued
station
A
a shift
a succeeding
thread
from
while
schedule
Standby
instructions
to
in-
by an instruction
in a standby
to proceed
issued
For example,
selected
-------
(allocated
for
ready thread)
immediately
the
. . . ..-.
Set
organization
Otherwise,
by the instruction
it is selected.
units
2: Hardware
is sent
is not
--------
K-Register
(allocated
for
waiting
thread)
executed.
is stored
until
if previously
tions
and
an instruction
schedule
mains
instruction
unit
is arbitrated
When
low
the
functional
..
Set
Figure
functional
------
Ki&ister
Set
in fact,
the format
or type
D2, the instruction
Stage
instruction
and,
of an instruction
S is inserted
schedule
for
units.
Z11WIIW21DI
IWISJEXI]EX2
. . . . . . , . - . - . . . -- . --.., --...,
,,,
‘------ . . . . . . . . . . . . .!----- -
Lw
Table
1
--------
Functional
D]
IF2/1
category
unit
instructions
(b)
pipeline
Instruction
Figure
of base
processor
RISC
3: Instruction
processor
of EX
stages
instruction.
For example,
we assumed
each operation
ble
result
1.
The
before
ber
the
instruction
of EX
of stages
may
to access the
cache,
is two
ery
the
cycles.
cycle.
the
issue
cache
the
are
the
compare
1
2
shift
1
2
Integer
multiply
1
6
Multiplier
divide
1
6
FP Adder
add/subtract
1
4
compare
1
absolute/negate
1
Shifter
I
aa well
operands
are
S. A scoreboarding
register
is flaged
the
stages.
EX
initiation
same
instruction
a source,
and 12 in Figure
In
the
fetch
end
3(a).
case
of stage
of
D1,
unknown
ately
11.
five.
some
after
13 is four
What
threads
cycles
The
same
branch
point.
is
into
to
Figure
is worse,
3(b),
But
it
encounter
could
the
last
the
delay
delay
at the
sor
11
threads
of instruc-
Consequently,
a potential
as other
to
arithmetical
throughput.
3(a)
than
five
time.
the
ing
memory
tructions
is
ment
able
139
issues
are copied
buffer,
The
access
state
and
processor
out-of-order
can
the
by
proces-
number
the
of
number
be resident
overhead
on
to save
memory.
buffer,
of the
which
When
load/store
instructions,
and recorded
when
is pipelined
context
contains
requirements.
execution
than
rapidly
as the
exceed
threads
element
deleted
frames
a logical
long
not
reducing
to/from
important
access requirement
11
It
frames,
the
running
As
the
hold
respectively.
context
between
and
for
(which
is achieved
does
processor,
register
areas
words
state),
binding
and allocated
as save
more
register
grouped
save
status
switching
frame.
contexts
frame,
are used
scheduled
context
physical
Another
if
be
register
save
is conceptually
address
has
logical
a context
or restore
at the
between
context
of physical
address
of thread-specific
slots,
to
and floating-point
and program
processor
and
Multithreading
a context
register
the
threads
immedi-
called
pieces
changing
outcome
same
types
machine
register,
instruction
counter
thread
purpose
An
program
Since
instruction
Figure
more
is dam-
some
has
an instruction
entity
status
various
11 is a branch
in
a single
a thread.
the
unit
of general
a thread
of
are also required
executed
become
branches
stage
11 as
Assume
Concurrent
status
3(b).
13 is an instruction
In
in
branch
other
as well
of
Support
a thread
between
an
while
executed.
enhance
and
of instruction
if a conditional
cycles.
2
performance
scheme
and
in ev-
is, as-
fetch
delays,
with
That
instruction
are
of branches
pair
on the
cycles
threads
multithreading
together
back
the
required
thread
however,
of branches,
Each
delay
instructions,
to the
even
are
in Figure
at that
and
additional
instructions.
pipeline
is sent
instruction
and
three
RISC
request
is still
avoid
in
other
delay
2.1.3
be
a destination
off
single
sets,
is written
to
S and
from
the
the
because
operation
instructions
registers
12 uses the result
at least
base
from
stage
can
4
H
type
to
instructions
value
corresponding
in
This
read
of data-dependent
suming
in the
bit
on
2
n
121[4)1
architecture,
parallel
hide
as the instruc-
of these
that
In our
tions
the
num-
can be accepted
result
3),
to a register.
Required
store
2
number
assumed
of load/store
latency
W,
a result
1
are blocked
in Ta-
is the
of the
cycles
of
of cycles
(i.e.
latency
instructions
In stage
(section
number
instruction
two
(cycle)
issue
1
I
kind
as listed
execution
issue
latency
Other
So, the
one cycle.
the
data
issue
simulation
means
Since
on the
had latencies
another
issued.
required
tion
and
before
be
in our
finishes
stages),
of
pipeline
is dependent
latency
latency
add fsubt ract
aged.
number
of
logical
is obvious
The
IIatencies
Integer
I
of multithreaded
pipeline
issue/result
ALU
Barrel
Instruction
and
functional
I Dz I S~EX]
]EX2 I
------ , -----,------,------,-.
,!!!
!
. . . . . . . . ..- - J.-----L-.
.....
lIFzID1lWIS
(a)
unit
-,.-.---,
. . . . . . . . . . . . . . . . ...!
Z21 IF, I
1:
instructions
is
a thread
these
in
ins-
in the access require-
they
and
frame
outstand-
are performed.
standby
of instructions.
stations
This
enmakes
signed
to the
before
the
The
mode.
The
mode,
struction
selection
instruction
schedule
unit
(A,
and
B,
policy
C
in
denote
in-
the
hand,
other
slots.)
it
which
are
tures.
mechanism
Before
the
thread
should
wait
until
instructions
not
performed
source
Therefore,
could
because
also
the
thread
buffer
followed
are
by
struction
address
ensures
imprecise
fault.
floating-point
save
switching
with
starvation
other
because
thrashing
by
the
in-
incorrect
applied
must
choices
problems.
Instruction
In this
section,
to
trap.
can
We will
crest e
report
by the
lection
policy
selection
policy
instruction
Figure
each
thread
other
struction
the
Unless
instructions
priorities
are
assigned,
unit.
rotated.
In
of
the
priority
an issued
from
are
schedule
priorities
example
A unique
other
them.
enlarge
it
order
The
instruction
quires
is selected
to
with
avoid
lowest
to
the
doall
would
order
ion
be
to
we
provide
queue-structured
the
also
usual
at
the
the
case,
end
of
sets
own
next
of other
thread
unique
logical
corresponding
thread,
of
to
therefore,
loops,
between
can
unneceswith
through
queue
reOne
memory.
overhead,
registers.
connected
one
of loops
processors.
communication
with
registers
is
type
logical
communication
the
it
communicate
of doacross
processor
number
its
it executes.
type
reduce
same
registers
Each
the execution
are
The
through
in-
connection
queue
the trade-offs
starvation,
priority
in
It
processors
communicant
is occupied
In the
by setting
iterations
of
ab-
They
to functional
units.
to which
by
But
solution
the
conflicts
slots
case
another.
Simple
processor.
logical
of
due to data
counters
to special
the
2.2
execution
processor
generates
identifiers
for
processor
respectively.
in section
switch
to program
which
logical
parallel
processors
activating
logical
to log-
case of a physical
compiler.
and
each
Loop
on our
is inserted
pro cessor
is assigned
thread
and
iterations
a loop.
slots
se-
policy
from
instruction
address
But
an
priorities.
slot.
to
2.3.2
is available
presented
a context
by the
instruction
In
unit.
mode
an-
pitch.
4 illustrates
with
so as not
loops
Single
kth
supporting
our
in
instruction
schedule
is preferable,
pipeline
multi-level
higher
instruction
the
and a physical
as logical
a
in the
instruction
iteration
sary
in the
parallel
architec-
in section
loop
slots,
threads
A fast-fork
Strategy
dynamic
compiler
in
= O, 1,. . .) iteration
mode,
of threads
to
the
the
other
assigning
example,
kth(i
a change-priority
future.
we present
priority.
supports
parallelize
using
of
by
facilities
In this
entirely
be con-
architecture
Scheduling
is to
parallelize
explicit-rotation
of the
be informed
2.2
other
n thread
sence is suppressed
page
absence
traps
The
are discussed
For
with
a loop.
also
on
be
by a data
such
executed
of a single
the ni+
The
is one
each
cache/memory
in the
re-
mechanism
can
is to aid
machine,
processors.
executes
handling.
is triggered
including
paper
this
execution
processor
When
access
interrupts
to
concerning
care,
and
solution
This
restartable
similar
conditions
sidered
register.
but
Parallel
ical
re-executed,
indicated
Overview
of threads
access re-
the
and
the highest
architecture
Execution
2.3.1
multithreaded
context.
in
again
instruction
exception
Context
Detailed
instructions
decoded
Parallel
execution
memory
of the
is the
architecture.
in the
outstanding
the
Mechanisms
our
is
this
the restarting
as a piece
is resumed,
quirement
being
But
such
saved
and
in
with
is exe-
respectively.
2.3
load/store
instruction
instructions
be flushed.
quests
are
swit thing
rota-
instruction
One
features
The
the
processor
but
latency,
by software.
our
4.
mode,
threads
to
num-
Figure
complex.
logical
instructions
in short
load/store
is possible
the
A load/store
of context
the
pipeline
all issued
somewhat
out,
are performed.
always
main
of programs
is switched
difficult
These
2.3.3,
the restart
is possible.
in
of
code
implicit-
at a given
as shown
why
of two
mode
the
explicit-rotation
mode.
the
one
alternative
occurs
processor
reasons
in
In
rotation
explicit-rotation
schedule
when
priority
explicit-rotation
the
a change-priority
logical
are two
to
is controlled
when
on the
works
and
interval),
in the
of priority
is done
to
thread
priority
On
the
Dynamic
the highest
instruction.
rotation
There
4:
unit
is switched
of cycles(
cuted
had
mode
privileged
ber
rot ation
Figure
schedule
mode
a
rot ation
tion
which
implicit-rotation
through
—.
‘lime
slot
instruction
modes:
I
thread
rotation.
is as-
simple
140
and
topology
registers
between
effective
among
is important
hardware
connection
logical
when
processors
considering
cost and its effect.
One
topology
net-
is a ring
Logical
Prccessor
. .. .. .. ... .. .... ---write
:
read
wn”te
:
Logical
Processor
~------------------,
:
Queue \ ‘cad
Register : 0
~
;0
:1
compiler
Logical
Processor
. .. .. .. ... .. .. .. .. ..
wn”te
\
;
Queue ~ ‘eat
Register : 0
:1
!3
;3
:
:.
:3
#*F+~
;R:giyxstj
:
:.
For
~
I
i
tures,
!3
source
.:
: 31 “
:R:gis.sxs?q
;Regi~~y77s:~j
-
code
architecture.
But
which
use of the
to issue
instructions.
algorithm
Our
work,
as shown
loops’
encountered
ence.
When
ence,
data
cessors.
should
however,
available
Queue
registers
instructions.
ters,
for
next
logical
to
are
one iteration
several
convert
loops
differ-
logical
Some
to one.
reading
to
from
so that
pro-
the
previous
it-
tween
queue
ters.
Full/empty
reducing
registers
and
bits
into
registers.
data
The
as the
move
as scoreb oar ding
to
instructions
have
This
means
tion.
The
know
two
the
As
Static
Code
mentioned
graphics
in
overheads
it
2.3.3
control
sequence
cause
the sequence
dependent
high
registers
compiler
of other
threads,
time
Using
are
of a single
thread
because
whether
proach
the
scheduled
does
not
aims
at flooding
tions
by parallel
On
putation
the
each
thread.
means
a high
of
or not.
have
other
over
of such
hand,
the
behavior
in
is often
of
shortening
sta-
compiler
The
resource
to determine
if there
the compiler
station
to
section
shows
which
or VLIW
Figure
when
is executed.
Loop
Iter-
6.
This
Although
of
our
architec-
difficult
a sample
program
to paral-
of our
program
traverses
it is a simple
that
that
are
architectures.
fragment
application
example,
execution
is fundamental
to
it demon-
scheme
nany
is
a linear
using
a
application
= header;
while
( ptr
! = NULL
= a *
tmp
the
if(tmp
)
{
( (plm->point:
+ b
*
-> X:)
( (ptr->point
)-->y)
+ c;
<O)
break;
ptr
ap-
conflicts,
= ptr
->
next;
>
it
instrucFigure
code.
caee of numerical
a loop.
ptr
of instruc-
sufficient
scheduled
possible
is marked.
of Sequential
loops
code
structu~e”
time
scheduling
this
vector
the
is constant
resource
with
list.
consideration
rate
this
entry
programs.
algorithm[4].
processing
issuing
execution
it
no other
Short
in-
instruction
the
also to tell
standby
how-
those
in a standby
is selected.
Execution
source
linked
loop
by data-
instructions
units
using
in
be-
employ
on
Though
control
functional
programs,
execution-time
concentrates
in the
for
enables
only
pipeliner
scheduler,
the
the
from
When
at an issu-
table
mode
used
but
parallelize
shown
strates
to obtain
without
conflict,
an example,
can
the
in order
scheduling
code
number
computer
predict
execution,
could
list
the
and
for
cases,
compiler
reorders
of
to
dynamically
such
a simple
processing
tions
In
the
than
before
is determined
throughput,
case
thread
a
branches.
techniques
The
of
the
ation
is not
code
marked,
and
instruction
table
Eager
The
difficult
explicit-rot
below.
a softwiwe
is stored
corre-
differs
at ions
ture
1, in
issued
reserva-
entry
regis-
queue
is often
be
whose
standby
is not
function
units.
a resource
described
Our
instruction
which
be-
bits.
section
to
the
developed
of the
dependencies
code.
of
So,
c}pportunities
algorithm
conflicts,
to
issue
refer-
Scheduling
programs,
only
re-
conflict.
schedule
Our
of the
the
most
table
without
an entry
instruction
lelize
2.3.2
not
to avoid
we have
the
is an
architec-
i~pplicable
miss
instruction
resource
entries
is a resource
could
a standby
a NOP
If
a resource
in the point
all of the
checks
cause
station.
ing
to
reference
general/floating-point
attached
pipelining
reservation
to the
are mapped
is interpreted
speregis-
writing
software
is determined
is
by
queue
and
or floating-point
registers
disabled
two
respectively
registers,
available
and
to a standby
ever,
postpone
employs
also
generate
table
might
makes
and
sponds
st ruct ions.
unrolling
tech-
which
VLIW
is also
Consequently,
algorithm
cycle
reservation
it
which
but
6],
for
a technique
algorithm
stations
table
would
techniques,
Loop
enabled,
the
processor
such
queue
enabled
When
general-purpose
to
than
of
iteration-differ-
overhead.
is reduced
type
&e
technique[3].
cial
ence
doacross
have
through
create
tion
registers
Many
has more
could
are
5.
be relayed
difference
one such
Figure
of queue
in p~actice
a loop
This
eration
in
view
scheduling
pipelining[5,
would
straight
of standby
5: Logical
code
technique
a resource
Such
a new
Queue Register
software
conflicts.
instructions
~~
Figure
judicious
scheduling
employs
our
:3
more
example,
effective
:1
:
employs
niques.
predict
Therefore,
6: A sample
program
(written
in C)
comthe
Each
the
the value
141
iteration
of ptr
is executed
is passed
on a logical
through
queue
processor,
and
registers.
The
Estimation
3
Iteration
3.1
ptr
. . . .. ..
ptr’
. . .. . . . .
In the following
II ....
ptr”
........
sults
ptr’”
1 ‘“”’”--”?
‘ pi
kill
of our
cache
vtr””
4
7: Parallel
execution
of while
(see
used
two
Figure
2).
The
compiler
generates
wait
for
can
preceding
mization
with
in
trates
the
parallel
ceives
the value
immediately
the
execute
the
A thread
have
call
been
this
The
scheme.
rest
of the
can
initiate
executed
plays
that,
loop
of several
banks
two
iteration
eager
who
point.
selection
which
in instruction
in preserving
program.
the
By using
policy
does
The
until
thread
earliest
the
gets
stop
for the thread
without
the
struction,
the
with
the highest
priority
highest
should
priority.
This
location
of tmp
ensure
and
tmp
is performed
the source
code.
the
priority.
tries
at
compiled
by
state
translated
to
terval
use
only
and
the
special
In
only
this
thread
that
writes
in order
instructions
between
execution
the
contexts
of the
with
to
of two
it
the
loop,
ware
is
the
with
it
ment
slots
highest
respect
the
times
in
mechreg-
by
use
the
as the
pro-
multithreaded
exe-
Sequential
ex-
version
whose
of
instruction
3(b).
Multithread-
to
unit,
to
up
attained
(1.83).
142
our
compiler
with
with
one
9970.
is saturated
four
all
thread
load/store
from
unit
execution
of the
is the
reason
improves
in
of a factor
the
to
though
improve-
two
thread
are provided,
unit,
3.22
hard-
duplicated
slots,
one
slots
functional
at only
unit,
than
thread
in-
load/store
improvement
increasing
This
rotation
cycles.
is not
(2.89/1.83=1.58)
eight
on
were
is eight
by parallel
although
A performance
by
the
unit
processor
possible
an op-
executed
sequences
when
is gained
streams,
with
was
par-
in C was
simulator.
ratios
a processor
speed-up
written
code
schedule
of the busiest
10.4w79.8%.
loop.
for
multiwe have
can be easily
instruction
instruction
When
becomes
ratio
another
before
traced
instruction
is still
object
of our
program,
program
RISC
The
speed-up
of
results
which
The
of a single-threaded
utilization
memory
of threads
we
is defined
execution.
program,
is less effective
instruction
are also provided
It
Parallel
pixel.
to be used
pro cessor.
the
the
branch
architecture,
simulation
and
case
an 1.83
in-
until
after
store
by the
are performed
Some
alive
instruc-
or overlapped
processor
a commercial
2 lists
in
of 2.89
is global
assumed
simulated
delayed
As an application
option.
a workstation,
a thread
execute
presents
timization
other
is valid
When
to
consist
of each
of the sequential
based
machine.
at every
tries
and to kill
an interlocked
guarantees
consistency
after
into
slots
instruction
another
simultaneously
machines
required
by
a ray-tracing
priority.
variable
compiler
which
thread
a special
highest
it is put
If the
the
of other
Such
with
might
facilities,
in Figure
allelized
of preceding
threads.
units
cache
simulation
with
the execution
chosen
execution
the operation
given
section
threaded
Table
running
All
our
cycles
the
the
it
1.
equipped
Speed-up
This
instruction
loop,
our
by sequential
pri-
not
the
heterogeneous
ing
se-
of succeeding
from
configurations
seven
Latencies
as a criterion.
is shown
3.2
of an
execution
has exited
But
conflict.
on a RISC
schedthe
the
a thread
unit
of
to be accessed
estimate
of total
program
iterations.
When
caches
access
not
highest
the
to
ratio
means
guarantees
hinder
order
ecution
remaining
priority,
in our
to
a data
prediction
suc-
So, we
thread
the
iteration
multilevel
branch
portion
to
execution
with
implemented
of those
units.
are not
to those
might
execution.
consists
in Table
cution
execution.
is the
paper
consists
in order
listed
pipeline
part
It
this
In
body.
acknowledged
are
next
continues
policy
executes
iterations
thread
selection
is not
ority
the
been
attempts
In practice,
no bank
speed-up
in the
the
reSince
ist er windows.
re-
immediately
an important
priority.
that
the
tion
7 illus-
executed
yet
simulation
architecture.
of functional
One
load/store
wm
anisms,
opti-
thread
passes
an iteration
mant ics oft he sequential
highest
in
to
of the
Figure
and
in the sequential
instruction
ule units
executed
Such
Each
thread
iteration,
After
scheme
the
ptr
having
is a variation
machines.
from
a variable
complete.
dataflow
thread
iteration.
to
execution
of ptr
ceeding
for
without
to a variable
preceding
to
versions
be initiated
iterations
respect
use of variables
value
multiple
iterations
we present
that
kinds
other
unit.
there
so that
assumed
load/store
by
sections,
Assumptions
all hit.
We
loop
and
multithreading
has not
we
units.
Figure
four
simulation
were
other threads -
Model
parallel
simulator,
p.tr::’ ‘ ‘
i
Simulation
the
why
times.
speed-up
the
load/store
the
speed-
Addition
ratios
of
by
Table
2: Speed-up
ratio
by parallel
multitthreading
Table
3: Tradeoff
allelism
with
no.
of
thread
slots
one
with
unit
load/store
units
without
with
without
with
standby
standby
standby
standby
stat ion
stat ion
stat ion
station
-E
2
2.02
1.31
1.79
1.83
2.01
2.02
4
3.72
2.43
4
2.84
2.89
3.68
3.72
8
8
3.22
3.22
5.68
5.79
speed-up
ratio
stations
0N2.2%.
This
lelism
low
within
plication
We
the
improvement
an instruction
programs
parallelism,
is due
stream.
whose
thread
greater
improvement
also simulated
a machine
provided
with
tion
units.
fetch
speed-up.
those
improve
Such
organization
tively
replaced
delay
due
sharing
by
to
an
1.80
5.80.
fetch
instruction
are
and
This
same
between
the
thread
also
examined
intervals
application
execution
cycles,
program,
influence
or sixteen
the
(2n
on the
cycles
cycles
where
rotation
interval
performance,
seems
slightly
the
The
the
did
though
is
main
objective
with
ple instruction
similar
ing
the
brid
same
A
hybrid
tions
processor
where
which
number
hardware
that
half
resirlts.
Values
for
hard-
a simple
have
it is clear
that
D=
threading
is introduced
for
the instruction
others.
On the
the
hand,
is to
main
the
of obj ect ives
architectures
im-
of cost-effectiveness,
best
into
the
improve
difference
of the two
1 is the
a more
is to enhance
other
viewpoint
tablle
of D.
architecture
This
and
This
of S produces
increase
ta-
are com-
a (d, 1)-processor,
of multithreading
comparison
in the
processors
3(a)
performance.
eight
support
In this
3.2,
both
We
Effect
will
choice
if parallel
a processor
multi-
architecture.
compare
static
2.3.2.
Kernel
1.
Code
code
The
The
Scheduling
scheduling
sample
source
strategies
program
code
dis-
is the
is shown
Liv-
in Figure
8.
stream
us-
we discuss
and
of Static
in section
ermore
multi-
section,
coarse-
is represented
a thread
slot
issues
of thread
slots.
For
cost
is dxs
hy-
DO IK=l,
fine-
1
N
X(K)
= Q+Y(K)*(R*Z(K+IO)
+T*Z(K+II))
per
ports.
of instructions
and issues
cycle.
The
Each
instructions
instruction
and
is provided
thread
slot
instruction
without
window
tion.
as
4 lists
The
compiler,
by
proces-
Kernel
1 (written
with
our
and
code
execution
code
the
was
generated
schedulers.
one load/store
a
average
source
unit.
Strategy
and strategy
checks
the
scheduling
approach
with
It
cy-
143
a standby
was scheduled
machine
a simple
B represents
a resource
itera-
a commercial
A represents
approach,
and
code
fcm one
simulated
scheduling
window
by
object
The
the
every
in FORTRAN)
cycles
compiled
double
dependencies
is filled
8: Livermore
Table
the
instruction
in both
it requires
Figure
S is
as same
the
cycle
in the
(Zl,S)-
of instruc-
cycle,
For example,
words
by
comparison,
is almost
(d x 2,s/2)-processor
of register
of size D,
each
set of (d,s)-processor,
dependencies
here
number
a fair
of (d,s)-processor
Although
register
number
every
additional
for
the
of superscalar
from
instruction
D is the maximum
bandwidth
sors.
can
a single
employ
of (1 ,dx s-)-processor.
fetch
slot
as in section
which
the
parallelism.
processor,
the
from
objective
But,
Superscalar
architectures.
program
architectures
grained
thread
issuing
to superscalar
requires
increase
than
throughput.
t bread
cussed
each
the
possible.
in
Design
simulator,
boost
provided
superior.
Multithreading
our
that
speed-up
makes
not
—
to
are not
We used
3(b)
in Figure
various
using
—
proposed
units.
in Figure
shown
machine
3.4
3.3
par-
2.79 ml
Simulated
functional
But,
with
n is 0w8).
they
ratios.
shown
pipeline
single
rotation
much
of eight
significant
and
slots
—
1.52 R
1
they
the simulation
demonstrates
as
that
3 lists
pipeline
is respec-
is hidden
but
because
are speed-up
posed
are
possible.
We
slot
]
4.37
[7, 8] are
the
Table
a slight
shows
conflicts
cache
employed
ware.
instruc-
the
5.79
techniques
thread
,
5.79
performance,
ble
slots
and
does provide
1.79
and
case of ap-
thread
ratios
that
instruction
paral-
Some
la
superscalar
in fine-grained
caches
speed-up
2, except
In the
is rich
whose
instruction
Achieved
in Table
to poor
by
can be achieved.
private
and
two
load/store
cle.
In
speed-up
ratio)
2
Standby
this
between
(speed-up
reservation
has
list
the list
tab [e
table.
Strategy
B
achieves
the
is
overall
performance
superior
to
improvement
other
strategies.
by 0N19.3%
Table
4: Comparison
of static
code
Table
scheduling
5:
loop
no.
optimization
of
thread
slots
1
1
strategy
strategy
optimized
A
B
50
no.
.
8.83
8.125
8
8
8
8
small
in
nearly
this
the
B should
The
These
8cycle)
code
Although
to
contains
in
for
three
load
one iteration.
This
the
architecture.
tion
proach
of perfor-
3.5
Effect
In section
tial
of Eager
2.3.3,
the
loop
iterations
simulation
results
gram
shown
in
one load/store
The
code
requires
average
one
iteration.
when
Figure
allied
An
and footers
performs
threads
number
at most
one.
multiple
instructions
source
but
in
the
of iterations
the
As
the
in
concept
lization
HEP[9],
MASA[1O],
utilization.
is
of
fetch
as
units
is
Prasadh’s
an
bandwidth
must
multifrom
but
is still
in our
architecture,
unless
there
are re-
a superscalar
threads
Prasadh
hybrid
and
to sustain
be considered.
four-threaded,
instruction
fetch
In
not
as
For
unit
times
bandwidth
uti-
the
instruction
multiple
functional
example,
although
eight-functional-unit
eight
our
cost-
architec-
Functional
but
present
same.
are
factor,
func-
WU[18]
the
3.3.
archi-
increase
non-superscalar
important
about
to
does
in section
sufficient
also
sor achieves
speed-up,
of thirty-two
procesit requires
words
an
per
cy-
cle.
when
number
of
5
be-
Concluding
Comments
times.
Works
of concurrent
propose
multiple
as a multithreaded
this
paper,
oriented
tiprocessor
a
The
Torng[17]
uses
as mentioned
tecture
Related
hand,
effective
ex-
parallel
each cycle
can be issued
slots,
loop
idea
to be issued,
instructions
architectures
in
ap-
This
instructions
other
which
In
4
each
architecture
ratio
ratio
of our
such
sequential
speed-up
unit
ture
rate.
two
a VLIW
thread
execution
is small.
and
which
tional
of ptr.
code,
one
other
threads
a multithreading
opinion,
number
overhead
the
combina-
other
take
for
an iteration
parallel
maximum
to 56/17=3.29
the
with
multi-
switches.
issuing
hy-
interleaving
context
basic
of issued
On
concurrent
however,
architecture,
control
The
conflicts.
Daddis
executing
speed-up
more
sequential
next
is still
to initiate
than
5 lists
version
of thread
thread
includes
than
Table
the
compete
the total
has
object
to the
dependency
better
increases,
closer
pro-
of the
parallel
increase
a thread
code
ecution
comes
sample
iteration.
number
the
inter-loop
parallel
number
reports
machine
version
of the
is available,
headers
the
of the
is passed
register,
data
by the
iterations
of sequen-
section
simulated
one
a small
iteration.
Since
for
of ptr
enables
limited
The
cycles
With
value
a preceding
as the
execution
sequential
56 cycles
thread
soon
scheme
This
to
sup-
also employs
enables
instruction
In their
tecture
of the
a queue
slots
execution
6.
execution
a new
through
eager
unit.
execution
the
Execution
is addressed.
of the
two
Pleszkun[16]
ex-
[14]
keeps
a block
architecture,
during
increase
threading.
such
or-
thread
access.
Our
on
multithreading
and
to
is closely
saturation.
machine[15]
working
Farrens
are
Neumann
In
and
a thread
memory
our
parallel
[13],
a remote
builds
In
with
of single
in which
of
streams.
APRIL
of interleaving.
scheme
to continue
and
cycles
reason
type
interleaving
instruction
performance
data-flow/von
latter
cycle-by-cycle
machines,
interleaving
achieves
instructions
poor
encounters
strategy
the program.
(3+1)x2=8
is the
are
it
17
multiple
these
block
threading
A
cost,
to optimize
so at least
ef-
brid
between
strategy
improve
ecution
B, however,
at a lower
in order
instruction,
required
differences
21.67
support
from
cycle
one iteration
1
processors
instructions
der
strategy
effectiveness
be employed
object
The
A and
program.
same
one store
mance
code.
of strategy
of sequential
32.5
4-
port
non-optimized
for
3
until
fectiveness
execution
slots
2
8
8
(unit:
over
execution
,
E=EEREl
6
7
8-
eager
of
thread
42
,
of
strategy
non-
11
Evaluation
iterations
multithreading
Horizon[l
1],
and
threaded,
P-RISC
tor
[12].
144
of
for
proposed
use
systems.
ray-tracing
is discussed
we
Through
program,
we
two-load/store-unit
2.02
speed-up
a
over
multithreaded
as a base
processor
the
simulation
ascertained
that
processor
achieves
a sequential
archiof
machine,
mulusing
a twoa facand
a four-threaded
speed-up.
processor
threaded
architecture
nique.
achieves
We also examined
This
effectiveness
the
combined
with
evaluation,
however,
is achieved
when
coarse-grained
parallelism
a factor
of
performance
3.72
superscalar
shows
and ignores
scalar
cost-
Intl.
employs
only
354,
May
B.J.
Smith,
fine-grained
plicable
for our
of software
for
architecture,
pipelining.
use
One
instance
of such
loop
execution
One
It
weak
point
grams.
parallelizes
We
with
of this
are
currently
and
the
is the
many
which
poor
other
of
[11]
Processing,
Symp.
Y. Mochizuki,
Architecture
ing,”
In
with
Proc.
Fukuoka,
A. Nishimura,
“A
of Intl.
finite
puting
Japan,
pp.
87-96,
[13]
Processor
Instruction
Symp,
R. M. Tomasulo,
ploiting
“An
R.S.
Multiple
November
Issu-
Nikhil
nal
of Research
pp.
25-33,
and
January
Units,”
Munshi
Loops
and
on
for
Report
5546,
June
vol.
[14]
Jour-
11, no.
“Scheduling
Processors,”
In
E.G.
1,
ing
Theory,
[5] R.F.
W.D.
SIGPLAN’84
pp.
48-57,
[6] M.
Lam,
June
“Software
Design
1988.
Murakami,
“SIMP
(Single
struction
Processor
Intl.
for
85, May
the
FPS-
of
the
ACM
Proc.
[16]
the
15th
MultiSymAnnual
pp.
Processor
of 1988
November
“Can
Lim,
443-
Archi-
Supercom-
1988.
Dataflow
In
Subsume
Proc.
of the
16th
Architecture,
D. Kranz
Proc.
Computer
and
pp.
and J. Kubiatow-
Architecture
of
the
17th
for
Mul-
Annual
Architecture,
A.
Multiple
Gupta,
pp.
Intl.
104-114,
“Explorirlg
Hardware
Architecture:
the
Contexts
in
Preliminary
Annual
pp.
Int!.
273-280,
An
Proc.
[17]
Effective
Machines,”
In
June
Ben-
a Multi-
Results,”
Symp.
In
on Computer
1989.
pp.
Symp.
on
In Proc.
Computer
Neumann
of the 15th
Architecture,
Annual
pp.
131-
1988.
Farrens
and
A.R.
Improved
of the 18th
Pleszkun,
Annual
Inti.
pp.
362-369,
G.E.
Daddis
Jr.
and
rent
Execution
Throughput
Symp.
May
H.C.
of Multiple
In
In
1991.
Torng,
Proc.
Processing,
for
,“
on Computer
“Tlhe
Instruction
Processors,”
on Parallel
‘(Strategies
Processor
Architecture,
Conf.
318-
a Dataflow/von
Architecture,”
Superscalar
on Programming
Implementation,
M.K.
‘(Toward
Iannucci,
Achieving
Construction,
VLIW
Conf.
and
pp.
Concur-
Streams
of -the 20th
1:76-83,
on
Inti.
August
1991.
N.
Irie,
M.
Instruction
Pipelining):
Architecture,”
Symp.
for
In
Compiler
of the SIGPLAN’88
June
Compiler
Pipelining:
Technique
Language
nual
on
“A
Proc.
Processor
of the 16th
R.A.
Int[.
Schedul-
1984.
328,
[7] K.
Fortran
Symp.
Scheduling
Proc.
and Job-Shop
Computer,”
Smith,
In
on Computer
In
Weber
of
140, June
“A
Scientific
Parallel
Technical
1976.
Touzeau,
164
A
for
Architecture,
Arvind,
A
Architecture,
Sequential
IBM
[15]
In Computer
on
1978.
“MASA:
of
MIMD
Conf.
1990.
efits
1987.
Coffman,
August
Proc.
35-41,
B.H.
on
Hybrid
[4]
Fujita,
Intl.
1989.
“APRIL:
processor
B. Simons,
344-
Ex-
1967.
Parallel
T.
B.J.
Symp.
A. Agarwal,
Proc.
[3] A.
Annual
pp.
Resource
1978
Computing?”
Intl.
Symp.
In IBM
Development,
and
tiprocessing,”
1991.
Algorithm
Arithmetic
6-8,
Computer
pp.
Nuemann
icz,
on Supercomputing,
Efficient
pp.
Horizon,”
Conf.,
June
[2]
17th
Y. Nakase
Multithreaded
Simultaneous
Shared
Architecture
and
for
262-272,
Nishizawa,
the
1988.
Thistle
Annual
T.
of
processor.
References
H. Hirata,
Horowitz,
in a Super-
Architecture,
of the
In
on
June
M.R.
von
and
and
Processor
tecture
[12]
[1]
M.A.
of
pro-
on evaluating
Proc.
Computing,”
Intl.
451,
variety
of the
Proc.
Pipelined,
In
Halstead
bolic
are
application
design
In
Computer
“A
threaded
as an eager
the effectiveness
working
detailed
[10] R.H.
scheme.
architectures.
confirm
and
Scheduling
1990.
Parallel
idea
possibilities
loops
other
paper
by using
effects
investigate
on
Computer,”
ap-
from
is presented
We should
architecture
is derived
Pro cessor ,“
Symp.
Lam
Static
par-
algorithm
multithreading
intention
scheme.
programs.
cache
also
parallel
to be parallelized
tested
our
of
scheduling
which
We
broadened
difficult
code
M.S.
Beyond
tech-
[9]
a static
Smith,
“Boosting
allelism.
We developed
M.D.
best
the
a processor
[8]
of multi-
Kuga
and
Stream
A Novel
In
on Computer
/
Multiple
High-Speed
Proc.
[18]
S. Tomita,
of the
Architecture,
In-
ture,”
Single
16th
R.G.
tion
Prasadh
In
Processing,
An-
pp. 78-
1989.
145
and
C. Wu,
of a Multi-Threaded
Proc.
pp.
of the
1:84-91,
“A
Benchmark
RISC
Processc)r
20th
Int[.
August
Conf.
1991.
EvaluaArchitecon Parallel
Download