IMPLEMENTATION OF VECLIW ARCHITECTURE FOR

advertisement
IMPLEMENTATION OF VECLIW ARCHITECTURE FOR EXECUTING
MULTI-SCALAR INSTRUCTIONS
K.Prathyusha 1
A.Madhu2
K.Shashidhar
1
M.Tech, VLSI System Design Branch in Dhruva institute of engineering and technology
2
Assistant Professor, Dept. of ECE in Dhruva institute of engineering and technology
3
Associate Professor, Dept. of ECE in Dhruva institute of engineering and technolog
Abstract-
This
paper
proposes
new
processor architecture for accelerating data
parallel
applications
based
on
the
combination of VLIW and vector processing
Keywords-VLIW-architecture;
vector processsing; data-level parallelism;
unified datapath.
paradigms. It uses VLIW architecture for
processing
multiple
instructions
independent
concurrently
on
I .INTRODUCTION
scalar
parallel
One of the most important method
is
for achieving high performance is taking
expressed by vector ISA and processed on
advantage of parallelism. The simplest way
the same parallel execution units of the
to take the advantage of parallelism is
VLIW architecture. The proposed processor
through pipelining, which overlaps instruct-
which is called VECLIW (Vector Long
tion execution to reduce the total time to
Instruction Word) and has unified register
complete an instruction sequence. All
file of 64x32bit registers in the decode stage
processors since about 1985 use the
for storing scalar/vector data. VECLIW can
pipelining
issue up to six scalar operations in each
performance by exploiting instruction-level
cycle for parallel processing a set of
parallelism (ILP). The instructions can be
operands and producing up to six results.
processed by parallel because not every
However, it cannot issue more than one
instruction
memory
which
predecessor. After eliminating data and
load/store 192-bit scalar data from/to data
control stalls, the use of an pipelining
cache. Six 32-bit results can be written back
technique can achieve an ideal performance
into VECLIW register file.
of one clock cycle per operation (CPO). To
execution
units.
operation
Data
at
parallelism
a
time,
technique
depends
on
to
its
improve
immediate
further improve the performance, the CPO
would be decreased to less than one.
hardware simplifications. However, VLIW
Obviously, the CPO cannot reduced below
instructions require more compiler support.
one if the issue width is only one operation
VLIW architectures are character-
per clock cycle. Therefore, multiple-issue
ized by instructions that each specify several
scalar processors fetch multiple scalar
independent operations. Thus, VLIW is not
instructions and allow multiple instructions
CISC instruction, which typically specify
to issue in a clock cycle. However, vector
several dependent operations. However
processors fetch a single vector instruction
VLIW
(υ operations) and issue multiple operations
per
clock
cycle.
instructions
are
like
RISC
instructions except that they are longer to
Statically/dynamically
allow them to specify multiple independent
scheduled super scalar processors issue
scalar operations. A VLIW instructions can
varying number of operations per clock cycle
be thought of as several RISC instructions
and use in-order/out-of-order execution. Very
packed together, where RISC instructions
Long Instruction Word (VLIW) processors,
typically specify one operation. The explicit
in contrast, issue a fixed number of
encoding of multiple operations into VLIW
operations formatted either as one large
instruction leads to dramatically reduced
instruction or as a fixed instruction packet
hardware
complexity
compared
to
with the parallelism among independent
superscalar. Thus, the main advantage of
operations
VLIW
explicitly
indicated
by
the
instruction.
ations of traditional scalar instruction sets
some
that
the
highly
parallel
implementation is much simpler and cheaper
VLIW and superscalar implement-
share
is
characteristic:
to built the equivalently concurrent RISC or
CISC chips.
multiple
On multiple execution units, this
execution units and the ability to execute
paper proposes new processor architecture
multiple operations simultaneously. How-
for accelerating data-parallel applications by
ever, the parallelism is explicit in VLIW
the combination of VLIW and vector
instruction and must be discovered by
processing paradigms. It is based on VLIW
hardware at run time in super scalar
architecture for processing multiple scalar
processors. Thus, for high performance,
instructions concurrently. Moreover, data-
VLIW instructions are simpler and cheaper
level
than super scalars because of further
efficiently using vector instructions and
parallelism
(DLP)
is
expressed
processed on the same parallel execution
techniques, Salami and Valero proposed and
units of the VLIW architecture. Thus, the
evaluated adding vector capabilities to a
proposed
processor,
VECLIW,
exploits
instructions
and
DPL
instructions.
The
use
is
called
μSIMD-VLIW
using
VLIW
execution of the DLP regions, while
using
vector
reducing the fetch bandwidth requirements.
which
IPL
of
vector
core
to
speed-up
the
Set
Wada et al. introduced a VLIW vector
Instruction Architecture (ISA) leads to
media coprocessor, “vector coprocessor
expressing programs in a more concise and
(VCP),” that included three asymmetric
efficient way ( high schematic), encoding
execution pipelines with cascaded SIMD
parallelism
explicitly
instruction,
and
each
vector
ALUs. To improve performance efficiency,
simple
design
they reduced the area ratio of the control
techniques ( heavy pipelining and functional
circuit while increasing the ratio of the
unit
arithmetic circuit. This paper combines
replication)
in
using
that
achieve
high
performance at low cost.
VLIW and vector processing paradigms to
accelerate data-parallel applications. On
unified parallel datapath, our proposed
VECLIW
processes
multiple
scalar
instructions packed in VLIW and vector
instructions
by
issuing
up
to
six
scalar/vector operations in each cycle.
However, it cannot issue more than one
memory
operation
loads/stores
192-bit
at
a
time,
which
scalar/vector
data
from/to data cache. Six 32-bit results can be
Figure 1: pipelining
written back into VECLIW register file.
Thus, vector processors remain the
most effective way to exploit data-parallel
applications.
Therefore,
many
vector
II. THE ARCHITECTURE OF VECLIW
PROCESSOR
architectures have been proposed in the
literature
to
accelerate
data
VECLIW is a load-store architecture
parallel
applications. To exploit VLIW and vector
with
simple
hardware,
fixed-length
instruction encoding, and simple code
instruction formats (R-format, I-format, and
generation model. It supports few addressing
J-format).
modes
([31:0])
to
specify
operands:
register,
The
can
first
be
32-bit
instruction
scalar/vector/control
immediate, and displacement addressing
instruction. However, the remaining all 32-
modes. In the displacement addressing
bit instructions ([63:32], [95:64], [127:96],
mode, a constant offset is signed extended
[159:127], and [191:159]) must be scalars.
and added to a scalar register to form the
This simplifies the implementation of
memory address for loading/storing 192-bit
VECLIW and does not effect on the
data. VECLIW has a simple and easy-to-
performance. However, control instructions
pipeline ISA, which supports the general
encode only one operation. In this paper, a
categories of operations (data transfer,
subset of the VECLIW ISA is used to build
arithmetic, logical, and control).
a simple and easy to explain version of 32bit VECLIW architecture.
Figure 3 shows the block diagram of
our proposed VECLIW processor, which has
common
data
path
for
executing
VLIW/vector instructions. Instruction cache
stores 192-bit VECLIW instructions of an
application.
Data
cache
loads/stores
scalar/vector data needed for processing
scalar/vector instructions. A single register
file is used for both multi-scalar/ vector
elements. The control unit feeds the parallel
execution units by the required operands
(scalar/vector elements) and can produce up
Figure 2: VECLIW ISA Formats
to six results each clock cycle. Scalar/vector
VECLIW uses fixed length for
encoding scalar/vector
instructions.
All
VECLIW instructions are 6×32-bit i.e., 192bits ([191:0]), which simplifies instruction
decoding. Above figure shows the VECLIW
loads/stores take place from/to the data
cache of VECLIW in a rate of 192-bit (six
elements: 6×32-bit) per clock cycle. Finally,
the writeback stage writes into the VECLIW
register file up to 6×32-bit results per clock
cycle coming from the memory system or
from the execution units. The use of unified
hardware for processing multi-scalar/vector
data
makes
efficient
exploitation
of
resources even though the percentage of
DLP is low.
Figure 3: VECLIW datapath for executing multi-scalar/vector instructions
Comparing with the baseline scalar
processor
instructions, (3) executing six scalar/vector
(five
stage
pipeline),
the
operations,
of
decode,
execute,
and
load/store 192-bit data, and (5) writing back
writeback stages of the VECLIW are about
six results. The VECLIW instruction pointed
six-times. In more details, VECLIW has a
by PC is read from the instruction cache of
modified five-stage pipeline for executing
the fetch stage and stored in the instruction
multi-scalar/vector
(1)
fetch/decode (IF/ID) pipeline register. The
fetching 192-bit instruction, (2) decoding or
control unit in the decode stage reads the
reading
fetched instruction from IF/ID pipeline
complexity
operands
instructions
of
six
by:
individual
(4)
accessing
memory
to
register to generate the proper control
and RS5/RT5, and RS6/RT6), respectively.
signals needed for processing multiple scalar
Moreover, the 14-bit immediate values
or vector data. The register file of the
([13:0],
VECLIW has eight banks (B0 to B7), eight-
[141:128], and [173:160]) of the I-format
element each (B0.0 to B0.7, B1.0 to B1.7,
are signed/unsigned extended into 6×32-bit
…, and B7.0 to B7.7). Scalar/vector data are
immediate values (ImmVal1, ImmVal2,
accessed from VECLIW register file using
ImmVal3,
ImmVal4,
3-bit bank number (Bn) concatenated with
ImmVal6).
These
3-bit start index (Si). 2×6×32-bit operands
ImmVal values are stored in the ID/EX
can be read and 6×32-bit can be written to
pipeline register for processing in the
the VECLIW register file each clock cycle.
execute stage. In addition to decoding the
Thus, the control unit reads the Si.Bn fields
individual
of RS (register source), RT (register target),
accessing VECLIW register file, RS.Si,
and RD (register destination) of each
RT.Si, RD.Si, and ImmVal values are
individual instruction in the fetched VLIW
loaded into counters called RScounter,
as well as VLR (vector length register) to
RTcounter, RDcounter, and ImmCounter,
generate the sequence of control signals
respectively.
needed
multi-
instructions, the control unit stalls the fetch
scalar/vector data from/to VECLIW register
stage and iterates the process of reading
file. The VECLIW register file can be seen
vector elements, incrementing RScounter,
as 64×32-bit scalar registers or 8×8×32-bit
RTcounter, and RDcounter by six and the
vector registers (eight 8-element vector
immediate value (ImmCounter) by 16, and
registers).
calculating the destination registers. After
for
reading/writing
Six individual instructions packed in
issuing
[45:32],
[77:64],
ImmVal5,
RsVal,
instructions
each
For
[109:96],
of
RtVal,
VLIW
decoding
operation
in
and
and
and
vector
the
vector
VECLIW instruction are decoded and their
instruction, it is removed from IF/ID
operands are read from the unified register
pipeline register and new instruction is
file
fetched from instruction cache.
(RsVal1/RtVal1,
RsVal3/RtVal3,
RsVal5/RtVal5,
and
RsVal2/RtVal2,
RsVal4/RtVal4,
The execute units of VECLIW
RsVal6/RtVal6)
operate on the operands prepared in the
according to six pairs of RS/RT fields
decode
stage
and
perform
operations
(RS1/RT1, RS2/RT2, RS3/RT3, RS4/RT4,
specified by the control unit, which depends
on opcode1/function1, opcode2/function2,
incremented by 16 to prepare the address of
opcode3/function3,
opcode4/function4,
the next 6×32-bit element of the vector data.
opcode5/function5, and opcode6/function6
In the first implementation of our proposed
fields of the individual instructions in
VECLIW processor, six elements (192-bit)
VECLIW. For load/store instructions, the
can be loaded/stored per clock cycle.
first execute unit adds RsVal1 and ImmVal1
Finally, the writeback stage of
to form the effective address. For register-
VECLIW stores the 6×32-bit results come
register instructions, the execute units
from the memory system or from the
perform the operations specified by the
execution units in the VECLIW register file.
control unit on the operands fed from the
Depending on the effective opcode of each
register
(RsVal1/RtVal1,
individual instruction in VECLIW, the
RsVal3/RtVal3
register destination field is specified by
file
RsVal2/RtVal2,
,RsVal4/RtVal4,
RsVal5/RtVal5,
and
either RT or RD. The control signals
RsVal6/RtVal6) through ID/EX pipeline
6×Wr2Reg are used for enabling the writing
register. For register-immediate instructions,
6×32-bit results into the VECLIW register
the execute units perform the operations on
file.
the source values (RsVal1, RsVal2, RsVal3,
RsVal4, RsVal5, and RsVal6) and the
extended
immediate
values
III. SYNTHESIS REPORT OF 192-
(ImmVal1,
BIT INSTRUCTION
ImmVal2, ImmVal3, ImmVal4, ImmVal5,
and ImmVal6). In all cases, the results of the
execute units is placed in the EX/MEM
pipeline register.
The VECLIW registers can be
loaded/stored individually using load/store
i.
Slice Logic Utilization:
Number of Slice Registers: 3379 out of
69120 and utilization is 4%.
Number of Slice LUTs: 11977 out of 69120
and utilization is 17%.
instructions. Displacement addressing mode
is used for calculating the effective address
Number used as Logic: 11593 out of 69120
by adding the singed extended immediate
and utilization is 16%
value (ImmVal1) to RS register (RsVal1) of
Number used as Memory: 384 out of 17920
the first individual instruction in VECLIW.
and utilization is 2%
In addition, the ImmVal1 register is
Number used as RAM: 384
ii.
Slice Logic Distribution:
Number of LUT Flip Flop pairs used:
for (1) fetching 192 bit instruction ( six
14311
individual instructions), (2) decoding or
Number with an unused Flip Flop: 10932
reading operands of the six instructions
out of 14311 and utilization is 76%
packed in VECLIW (3) executing six
operations on parallel execution units, (4)
Number with an unused LUT: 2334 out of
14311 and utilization is 16%.
loading/storing 192 bit (6 X 32 bit scalar)
data from/to data memory and, (5) writing
Number of fully used LUT-FF pairs: 1045
back 6 X 32 bit scalar results.
out of 14311 and utilization is 7%.
Number of unique control sets: 69
iii.
V. FUTURE WORK
IO Utilization:
In the future, the performance fo our
Number of IOs: 386
VECLIW will be evaluated on scientific and
Number of bonded IOBs: 386 out of 640
multimedia kernels/applications.
and utilization is 60%
iv.
REFERENCES
Macro Statistics:
[1] J.Hennessay and D.Patterson, Computer
# Registers : 3358
Architecture A Quantitative Approach, 5th
Flip-Flops : 3358
ed, Morgan-kaufmann, September 2011.
[2] J.Fisher, “VLIW Architecture and the
IV. CONCLUSION
ELI-512,” Proc. 10th International Sympo-
This paper proposes new processor
sium on Computer Architecture, Stockholm,
architecture called VECLIW for accelerating
Sweden, pp. 140-150, June 1983.
data-parallel applications. VECLIW execu-
[3] Philips, Inc., An Introduction to Very-
tes multi scalar and vector instructions on
Long Instruction Word (VLIW) Computer
the
Architecture,
same
parallel
execution
datapath.
VECLIW has a modified five stage pipeline
Philips
Semi-Conductores,
1997.
[4] F.Quintana, R.Espasa, and M.Valero,
“An ISA comparision between superscalar
and vector processors,” in VECPAR, vol1573, Springer-Verlag London, pp, 148-160,
1998.
[5] C.Kozyrakis and D.Patterson, “Vector vs
superscalar and VLIW architecture for
embedded multimedia benchmarks,” Proc.
35th International Symposium on micro
architecture, Istanbul, Turkey, pp 283-193,
November 2002.
[6] J.Gebis, Low-complexity Vector Microprocessor Extensions, Ph.D.Thesis, Massa-
K.PRATHYUSHA is currently pursuing her
M.Tech degree in Very Large Scale
Integration (VLSI System design) in Dhruva
Institute of engineering and technology in
ECE dept from 2013 to 2015.
chusetts Institute of Technology, 2007.
[7]
T.Wada,
S.Ishiwata,
K.Kimura,
T.Miyamori, and M.Nakagawa, “A VLIW
vector media coprocessor with cascaded
SIMD ALUs,” IEEE Transactions on Very
Large Scale Integration (VLSI) Systems,
vol. 17, no. 9, pp. 1285-1296, 2009.
[8] E.Salami and M.Valero “A vectorμSIMD-VLIW architecture for multimedia
applications,”
Proc.
IEEE
ALAKUNTLA MADHU, Assistant
Professor currently working in Dhruva
Institute of engineering and technology.
International
Conference on Parallel Processing, ICPP2005, pp. 69-77, 2005.
[9] P.Yiannacouras, J.Steffan, and J.Rose,
“Portable, flexible, and scalable soft vector
processors,” IEEE Transactions on Very
Large Scale Integration (VLSI) Systems,
vol. 20, no. 8, pp. 1429-1442, August 2012.
K.SHASHIDHAR currently working as
Associate Professor, Dept of ECE in Dhruva
institute of engineering and technology.
Download