IMPLEMENTATION OF VECLIW ARCHITECTURE FOR EXECUTING MULTI-SCALAR INSTRUCTIONS K.Prathyusha 1 A.Madhu2 K.Shashidhar 1 M.Tech, VLSI System Design Branch in Dhruva institute of engineering and technology 2 Assistant Professor, Dept. of ECE in Dhruva institute of engineering and technology 3 Associate Professor, Dept. of ECE in Dhruva institute of engineering and technolog Abstract- This paper proposes new processor architecture for accelerating data parallel applications based on the combination of VLIW and vector processing Keywords-VLIW-architecture; vector processsing; data-level parallelism; unified datapath. paradigms. It uses VLIW architecture for processing multiple instructions independent concurrently on I .INTRODUCTION scalar parallel One of the most important method is for achieving high performance is taking expressed by vector ISA and processed on advantage of parallelism. The simplest way the same parallel execution units of the to take the advantage of parallelism is VLIW architecture. The proposed processor through pipelining, which overlaps instruct- which is called VECLIW (Vector Long tion execution to reduce the total time to Instruction Word) and has unified register complete an instruction sequence. All file of 64x32bit registers in the decode stage processors since about 1985 use the for storing scalar/vector data. VECLIW can pipelining issue up to six scalar operations in each performance by exploiting instruction-level cycle for parallel processing a set of parallelism (ILP). The instructions can be operands and producing up to six results. processed by parallel because not every However, it cannot issue more than one instruction memory which predecessor. After eliminating data and load/store 192-bit scalar data from/to data control stalls, the use of an pipelining cache. Six 32-bit results can be written back technique can achieve an ideal performance into VECLIW register file. of one clock cycle per operation (CPO). To execution units. operation Data at parallelism a time, technique depends on to its improve immediate further improve the performance, the CPO would be decreased to less than one. hardware simplifications. However, VLIW Obviously, the CPO cannot reduced below instructions require more compiler support. one if the issue width is only one operation VLIW architectures are character- per clock cycle. Therefore, multiple-issue ized by instructions that each specify several scalar processors fetch multiple scalar independent operations. Thus, VLIW is not instructions and allow multiple instructions CISC instruction, which typically specify to issue in a clock cycle. However, vector several dependent operations. However processors fetch a single vector instruction VLIW (υ operations) and issue multiple operations per clock cycle. instructions are like RISC instructions except that they are longer to Statically/dynamically allow them to specify multiple independent scheduled super scalar processors issue scalar operations. A VLIW instructions can varying number of operations per clock cycle be thought of as several RISC instructions and use in-order/out-of-order execution. Very packed together, where RISC instructions Long Instruction Word (VLIW) processors, typically specify one operation. The explicit in contrast, issue a fixed number of encoding of multiple operations into VLIW operations formatted either as one large instruction leads to dramatically reduced instruction or as a fixed instruction packet hardware complexity compared to with the parallelism among independent superscalar. Thus, the main advantage of operations VLIW explicitly indicated by the instruction. ations of traditional scalar instruction sets some that the highly parallel implementation is much simpler and cheaper VLIW and superscalar implement- share is characteristic: to built the equivalently concurrent RISC or CISC chips. multiple On multiple execution units, this execution units and the ability to execute paper proposes new processor architecture multiple operations simultaneously. How- for accelerating data-parallel applications by ever, the parallelism is explicit in VLIW the combination of VLIW and vector instruction and must be discovered by processing paradigms. It is based on VLIW hardware at run time in super scalar architecture for processing multiple scalar processors. Thus, for high performance, instructions concurrently. Moreover, data- VLIW instructions are simpler and cheaper level than super scalars because of further efficiently using vector instructions and parallelism (DLP) is expressed processed on the same parallel execution techniques, Salami and Valero proposed and units of the VLIW architecture. Thus, the evaluated adding vector capabilities to a proposed processor, VECLIW, exploits instructions and DPL instructions. The use is called μSIMD-VLIW using VLIW execution of the DLP regions, while using vector reducing the fetch bandwidth requirements. which IPL of vector core to speed-up the Set Wada et al. introduced a VLIW vector Instruction Architecture (ISA) leads to media coprocessor, “vector coprocessor expressing programs in a more concise and (VCP),” that included three asymmetric efficient way ( high schematic), encoding execution pipelines with cascaded SIMD parallelism explicitly instruction, and each vector ALUs. To improve performance efficiency, simple design they reduced the area ratio of the control techniques ( heavy pipelining and functional circuit while increasing the ratio of the unit arithmetic circuit. This paper combines replication) in using that achieve high performance at low cost. VLIW and vector processing paradigms to accelerate data-parallel applications. On unified parallel datapath, our proposed VECLIW processes multiple scalar instructions packed in VLIW and vector instructions by issuing up to six scalar/vector operations in each cycle. However, it cannot issue more than one memory operation loads/stores 192-bit at a time, which scalar/vector data from/to data cache. Six 32-bit results can be Figure 1: pipelining written back into VECLIW register file. Thus, vector processors remain the most effective way to exploit data-parallel applications. Therefore, many vector II. THE ARCHITECTURE OF VECLIW PROCESSOR architectures have been proposed in the literature to accelerate data VECLIW is a load-store architecture parallel applications. To exploit VLIW and vector with simple hardware, fixed-length instruction encoding, and simple code instruction formats (R-format, I-format, and generation model. It supports few addressing J-format). modes ([31:0]) to specify operands: register, The can first be 32-bit instruction scalar/vector/control immediate, and displacement addressing instruction. However, the remaining all 32- modes. In the displacement addressing bit instructions ([63:32], [95:64], [127:96], mode, a constant offset is signed extended [159:127], and [191:159]) must be scalars. and added to a scalar register to form the This simplifies the implementation of memory address for loading/storing 192-bit VECLIW and does not effect on the data. VECLIW has a simple and easy-to- performance. However, control instructions pipeline ISA, which supports the general encode only one operation. In this paper, a categories of operations (data transfer, subset of the VECLIW ISA is used to build arithmetic, logical, and control). a simple and easy to explain version of 32bit VECLIW architecture. Figure 3 shows the block diagram of our proposed VECLIW processor, which has common data path for executing VLIW/vector instructions. Instruction cache stores 192-bit VECLIW instructions of an application. Data cache loads/stores scalar/vector data needed for processing scalar/vector instructions. A single register file is used for both multi-scalar/ vector elements. The control unit feeds the parallel execution units by the required operands (scalar/vector elements) and can produce up Figure 2: VECLIW ISA Formats to six results each clock cycle. Scalar/vector VECLIW uses fixed length for encoding scalar/vector instructions. All VECLIW instructions are 6×32-bit i.e., 192bits ([191:0]), which simplifies instruction decoding. Above figure shows the VECLIW loads/stores take place from/to the data cache of VECLIW in a rate of 192-bit (six elements: 6×32-bit) per clock cycle. Finally, the writeback stage writes into the VECLIW register file up to 6×32-bit results per clock cycle coming from the memory system or from the execution units. The use of unified hardware for processing multi-scalar/vector data makes efficient exploitation of resources even though the percentage of DLP is low. Figure 3: VECLIW datapath for executing multi-scalar/vector instructions Comparing with the baseline scalar processor instructions, (3) executing six scalar/vector (five stage pipeline), the operations, of decode, execute, and load/store 192-bit data, and (5) writing back writeback stages of the VECLIW are about six results. The VECLIW instruction pointed six-times. In more details, VECLIW has a by PC is read from the instruction cache of modified five-stage pipeline for executing the fetch stage and stored in the instruction multi-scalar/vector (1) fetch/decode (IF/ID) pipeline register. The fetching 192-bit instruction, (2) decoding or control unit in the decode stage reads the reading fetched instruction from IF/ID pipeline complexity operands instructions of six by: individual (4) accessing memory to register to generate the proper control and RS5/RT5, and RS6/RT6), respectively. signals needed for processing multiple scalar Moreover, the 14-bit immediate values or vector data. The register file of the ([13:0], VECLIW has eight banks (B0 to B7), eight- [141:128], and [173:160]) of the I-format element each (B0.0 to B0.7, B1.0 to B1.7, are signed/unsigned extended into 6×32-bit …, and B7.0 to B7.7). Scalar/vector data are immediate values (ImmVal1, ImmVal2, accessed from VECLIW register file using ImmVal3, ImmVal4, 3-bit bank number (Bn) concatenated with ImmVal6). These 3-bit start index (Si). 2×6×32-bit operands ImmVal values are stored in the ID/EX can be read and 6×32-bit can be written to pipeline register for processing in the the VECLIW register file each clock cycle. execute stage. In addition to decoding the Thus, the control unit reads the Si.Bn fields individual of RS (register source), RT (register target), accessing VECLIW register file, RS.Si, and RD (register destination) of each RT.Si, RD.Si, and ImmVal values are individual instruction in the fetched VLIW loaded into counters called RScounter, as well as VLR (vector length register) to RTcounter, RDcounter, and ImmCounter, generate the sequence of control signals respectively. needed multi- instructions, the control unit stalls the fetch scalar/vector data from/to VECLIW register stage and iterates the process of reading file. The VECLIW register file can be seen vector elements, incrementing RScounter, as 64×32-bit scalar registers or 8×8×32-bit RTcounter, and RDcounter by six and the vector registers (eight 8-element vector immediate value (ImmCounter) by 16, and registers). calculating the destination registers. After for reading/writing Six individual instructions packed in issuing [45:32], [77:64], ImmVal5, RsVal, instructions each For [109:96], of RtVal, VLIW decoding operation in and and and vector the vector VECLIW instruction are decoded and their instruction, it is removed from IF/ID operands are read from the unified register pipeline register and new instruction is file fetched from instruction cache. (RsVal1/RtVal1, RsVal3/RtVal3, RsVal5/RtVal5, and RsVal2/RtVal2, RsVal4/RtVal4, The execute units of VECLIW RsVal6/RtVal6) operate on the operands prepared in the according to six pairs of RS/RT fields decode stage and perform operations (RS1/RT1, RS2/RT2, RS3/RT3, RS4/RT4, specified by the control unit, which depends on opcode1/function1, opcode2/function2, incremented by 16 to prepare the address of opcode3/function3, opcode4/function4, the next 6×32-bit element of the vector data. opcode5/function5, and opcode6/function6 In the first implementation of our proposed fields of the individual instructions in VECLIW processor, six elements (192-bit) VECLIW. For load/store instructions, the can be loaded/stored per clock cycle. first execute unit adds RsVal1 and ImmVal1 Finally, the writeback stage of to form the effective address. For register- VECLIW stores the 6×32-bit results come register instructions, the execute units from the memory system or from the perform the operations specified by the execution units in the VECLIW register file. control unit on the operands fed from the Depending on the effective opcode of each register (RsVal1/RtVal1, individual instruction in VECLIW, the RsVal3/RtVal3 register destination field is specified by file RsVal2/RtVal2, ,RsVal4/RtVal4, RsVal5/RtVal5, and either RT or RD. The control signals RsVal6/RtVal6) through ID/EX pipeline 6×Wr2Reg are used for enabling the writing register. For register-immediate instructions, 6×32-bit results into the VECLIW register the execute units perform the operations on file. the source values (RsVal1, RsVal2, RsVal3, RsVal4, RsVal5, and RsVal6) and the extended immediate values III. SYNTHESIS REPORT OF 192- (ImmVal1, BIT INSTRUCTION ImmVal2, ImmVal3, ImmVal4, ImmVal5, and ImmVal6). In all cases, the results of the execute units is placed in the EX/MEM pipeline register. The VECLIW registers can be loaded/stored individually using load/store i. Slice Logic Utilization: Number of Slice Registers: 3379 out of 69120 and utilization is 4%. Number of Slice LUTs: 11977 out of 69120 and utilization is 17%. instructions. Displacement addressing mode is used for calculating the effective address Number used as Logic: 11593 out of 69120 by adding the singed extended immediate and utilization is 16% value (ImmVal1) to RS register (RsVal1) of Number used as Memory: 384 out of 17920 the first individual instruction in VECLIW. and utilization is 2% In addition, the ImmVal1 register is Number used as RAM: 384 ii. Slice Logic Distribution: Number of LUT Flip Flop pairs used: for (1) fetching 192 bit instruction ( six 14311 individual instructions), (2) decoding or Number with an unused Flip Flop: 10932 reading operands of the six instructions out of 14311 and utilization is 76% packed in VECLIW (3) executing six operations on parallel execution units, (4) Number with an unused LUT: 2334 out of 14311 and utilization is 16%. loading/storing 192 bit (6 X 32 bit scalar) data from/to data memory and, (5) writing Number of fully used LUT-FF pairs: 1045 back 6 X 32 bit scalar results. out of 14311 and utilization is 7%. Number of unique control sets: 69 iii. V. FUTURE WORK IO Utilization: In the future, the performance fo our Number of IOs: 386 VECLIW will be evaluated on scientific and Number of bonded IOBs: 386 out of 640 multimedia kernels/applications. and utilization is 60% iv. REFERENCES Macro Statistics: [1] J.Hennessay and D.Patterson, Computer # Registers : 3358 Architecture A Quantitative Approach, 5th Flip-Flops : 3358 ed, Morgan-kaufmann, September 2011. [2] J.Fisher, “VLIW Architecture and the IV. CONCLUSION ELI-512,” Proc. 10th International Sympo- This paper proposes new processor sium on Computer Architecture, Stockholm, architecture called VECLIW for accelerating Sweden, pp. 140-150, June 1983. data-parallel applications. VECLIW execu- [3] Philips, Inc., An Introduction to Very- tes multi scalar and vector instructions on Long Instruction Word (VLIW) Computer the Architecture, same parallel execution datapath. VECLIW has a modified five stage pipeline Philips Semi-Conductores, 1997. [4] F.Quintana, R.Espasa, and M.Valero, “An ISA comparision between superscalar and vector processors,” in VECPAR, vol1573, Springer-Verlag London, pp, 148-160, 1998. [5] C.Kozyrakis and D.Patterson, “Vector vs superscalar and VLIW architecture for embedded multimedia benchmarks,” Proc. 35th International Symposium on micro architecture, Istanbul, Turkey, pp 283-193, November 2002. [6] J.Gebis, Low-complexity Vector Microprocessor Extensions, Ph.D.Thesis, Massa- K.PRATHYUSHA is currently pursuing her M.Tech degree in Very Large Scale Integration (VLSI System design) in Dhruva Institute of engineering and technology in ECE dept from 2013 to 2015. chusetts Institute of Technology, 2007. [7] T.Wada, S.Ishiwata, K.Kimura, T.Miyamori, and M.Nakagawa, “A VLIW vector media coprocessor with cascaded SIMD ALUs,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 17, no. 9, pp. 1285-1296, 2009. [8] E.Salami and M.Valero “A vectorμSIMD-VLIW architecture for multimedia applications,” Proc. IEEE ALAKUNTLA MADHU, Assistant Professor currently working in Dhruva Institute of engineering and technology. International Conference on Parallel Processing, ICPP2005, pp. 69-77, 2005. [9] P.Yiannacouras, J.Steffan, and J.Rose, “Portable, flexible, and scalable soft vector processors,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 20, no. 8, pp. 1429-1442, August 2012. K.SHASHIDHAR currently working as Associate Professor, Dept of ECE in Dhruva institute of engineering and technology.