Embedded Computer Architecture Exploiting ILP VLIW architectures TU/e 5KK73 Henk Corporaal What are we talking about? ILP = Instruction Level Parallelism = ability to perform multiple operations (or instructions), from a single instruction stream, in parallel VLIW = Very Long Instruction Word architecture Instruction format example of 5 issue VLIW: operation 1 operation 2 operation 3 operation 4 operation 5 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 2 Single Issue RISC vs VLIW instr instr instr instr instr instr instr instr instr instr instr instr op op op op op op op op op op op op Compiler instr instr instr instr instr op nop op op op op op op nop op op op nop op op execute 1 instr/cycle 3 ops/cycle execute 1 instr/cycle 3-issue VLIW RISC CPU 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 3 Topics Overview • How to speed up your processor? – What options do you have? • Operation/Instruction Level Parallelism – Limits on ILP • VLIW – Examples – Clustering • Code generation (2nd slide-set) • Hands-on 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 4 Speed-up Pipelined Execution of Instructions IF: Instruction Fetch INSTRUCTION CYCLE 1 1 2 3 2 IF 3 DC IF 4 RF DC IF 4 5 EX RF DC IF 6 WB EX RF DC 7 DC: Instruction Decode 8 RF: Register Fetch WB EX RF EX: Execute instruction WB EX WB WB: Write Result Register Simple 5-stage pipeline Purpose of pipelining: • Reduce #gate_levels in critical path • Reduce CPI close to one (instead of a large number for the multicycle machine) • More efficient Hardware Problems • Hazards: pipeline stalls • Structural hazards: add more hardware • Control hazards, branch penalties: use branch prediction • Data hazards: by passing required 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 5 Speed-up Pipelined Execution of Instructions Superpipelining: • Split one or more of the critical pipeline stages • Superpipelining degree S: S(architecture) = f(Op) * lt (Op) Op I_set * where: f(op) is frequency of operation op lt(op) is latency of operation op 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 6 Speed-up Powerful Instructions (1) MD-technique • Multiple data operands per operation • SIMD: Single Instruction Multiple Data Vector instruction: Assembly: for (i=0, i++, i<64) c[i] = a[i] + 5*b[i]; set ldv mulvi ldv addv stv or c = a + 5*b 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman vl,64 v1,0(r2) v2,v1,5 v1,0(r1) v3,v1,v2 v3,0(r3) 7 Speed-up Powerful Instructions (1) SIMD computing SIMD Execution Method time • Nodes used for independent operations • Mesh or hypercube connectivity • Exploit data locality of e.g. image processing applications • Dense encoding (few instruction bits needed) node1 node2 node-K Instruction 1 Instruction 2 Instruction 3 Instruction n 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 8 Speed-up Powerful Instructions (1) • Sub-word parallelism – SIMD on restricted scale: – Used for Multi-media instructions • Examples – MMX, SSX, SUN-VIS, HP MAX-2, AMD 3Dnow, Trimedia II – Example: i=1..4|ai-bi| 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman * * * * 9 Speed-up Powerful Instructions (2) MO-technique: multiple operations per instruction Two options: • CISC (Complex Instruction Set Computer) • VLIW (Very Long Instruction Word) field FU 1 instruction sub r8, r5, 3 FU 2 and r1, r5, 12 FU 3 mul r6, r5, r2 FU 4 ld r3, 0(r5) FU 5 bnez r5, 13 VLIW instruction example 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 10 VLIW architecture: central Register File Shared, Multi-ported Register file Exec Exec Exec unit 1 unit 2 unit 3 Issue slot 1 Exec Exec Exec unit 4 unit 5 unit 6 Issue slot 2 Exec Exec Exec unit 7 unit 8 unit 9 Issue slot 3 Q: How many ports does the registerfile need for n-issue? 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 11 Philips oldie: TriMedia TM32A processor 0.18 micron area : 16.9mm2 200 MHz (typ) 1.4 W 7 mW/MHz I/O INTERFACE I-Cache TAG TAG D-cache TAG H. Corporaal and B. Mesman DSPMUL2 DSPMUL1 IFMUL1 (FLOAT) IFMUL2 (FLOAT) FALU3 FALU0 FCOMP2 ALU3 ALU0 SHIFTER0 DSPALU0 Embedded Computer Architecture ALU1 ALU4 SHIFTER1 ALU2 FTOUGH1 DSPALU2 SEQUENCER / DECODE TAG 4/8/2015 (MIPS processor: 0.9 mW/MHz) 12 Speedup: Powerful Instructions (2) VLIW Characteristics • Only RISC like operation support Short cycle times • Flexible: Can implement any FU mixture • Extensible • Tight inter FU connectivity required • Large instructions (up to 1024 bits) • Not binary compatible !!! • But good compilers exist 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 13 Speed-up Multiple instruction issue (per cycle) Who guarantees semantic correctness? – • User: he specifies multiple instruction streams – • Multi-processor: MIMD (Multiple Instruction Multiple Data) HW: Run-time detection of ready instructions – • Superscalar Compiler: Compile into dataflow representation – 4/8/2015 which can instructions be executed in parallel? Dataflow processors Embedded Computer Architecture H. Corporaal and B. Mesman 14 Multiple instruction issue Three Approaches Example code a := b + 15; Translation to DDG (Data Dependence Graph) c := 3.14 * d; e := c / f; &d 3.14 &f &b ld 15 + &a ld &e ld &c * / st st st 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 15 Generated Code Instr. Sequential Code Dataflow Code I1 I2 I3 I4 I5 I6 I7 I8 I9 I1 I2 I3 I4 I5 I6 I7 I8 I9 ld addi st ld muli st ld div st r1,M(&b) r1,r1,15 r1,M(&a) r1,M(&d) r1,r1,3.14 r1,M(&c) r2,M(&f) r1,r1,r2 r1,M(&e) ld(M(&b) addi 15 -> I3 st M(&a) ld M(&d) muli 3.14 st M(&c) ld M(&f) div st M(&e) -> I2 -> I5 -> I6, I8 -> I8 -> I9 3 approaches: • An MIMD may execute two streams: (1) I1-I3 (2) I4-I9 – No dependencies between streams; in practice communication and synchronization required between streams • A superscalar issues multiple instructions from sequential stream – Obey dependencies (True and name dependencies) – Reverse engineering of DDG needed at run-time • Dataflow code is direct representation of DDG 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 16 Result Tokens Multiple Instruction Issue: Data flow processor Token Matching Token Store Instruction Generate Instruction Store Reservation Stations FU-1 4/8/2015 Embedded Computer Architecture FU-2 H. Corporaal and B. Mesman FU-K 17 Instruction Pipeline Overview (no pipelining) CISC IF DC RF EX RISC IF DC/RF EX WB IF1 IF2 IF3 DC1 DC2 DC3 ISSUE ISSUE ISSUE RF1 RF2 RF3 EX1 EX2 EX3 ROB ROB ROB WB1 WB2 WB3 IFk DCk ISSUE RFk EXk ROB WBk Superpipelined VLIW 4/8/2015 IF IF1 IF2 --- IFs RF1 RF2 EX1 EX2 WB1 WB2 RFk EXk WBk DC Embedded Computer Architecture H. Corporaal and B. Mesman DC RF EX1 DATAFLOW Superscalar WB EX2 --- EX5 WB RF1 RF2 EX1 EX2 WB1 WB2 RFk EXk WBk 18 Four dimensional representation of the architecture design space <I, O, D, S> SIMD 100 Data/operation ‘D’ 10 Vector CISC Superscalar 0.1 MIMD 10 RISC Dataflow 100 Instructions/cycle ‘I’ Superpipelined 10 VLIW 10 Operations/instruction ‘O’ Superpipelining Degree ‘S’ Note: MIMD should better be a separate, 5th dimension ! 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 19 Architecture design space Typical values of K (# of functional units or processor nodes), and <I, O, D, S> for different architectures Architecture K I O D S CISC RISC VLIW Superscalar Superpipelined Vector SIMD MIMD Dataflow 0.2 1 1 3 1 0.1 1 32 10 1.2 1 10 1 1 1 1 1 1 1.1 1 1 1 1 64 1024 1 1 1 1.2 1.2 1.2 3 5 1.2 1.2 1.2 1 1 10 3 1 7 1024 32 10 S(architecture) = Mpar 0.26 1.2 12 3.6 3 32 1229 38 12 f(Op) * lt (Op) Op I_set Mpar = I*O*D*S 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 20 Overview • Enhance performance: architecture methods • Instruction Level Parallelism (ILP) – limits on ILP • VLIW – Examples • Clustering • Code generation • Hands-on 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 21 General organization of an ILP architecture FU-3 FU-4 Data memory FU-2 Register file FU-1 Bypassing network Instruction decode unit Instruction fetch unit Instruction memory CPU FU-5 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 22 Motivation for ILP • Increasing VLSI densities; decreasing feature size • Increasing performance requirements • New application areas, like – multi-media (image, audio, video, 3-D, holographic) – intelligent search and filtering engines – neural, fuzzy, genetic computing • More functionality • Use of existing Code (Compatibility) • Low Power: P = fCVdd2 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 23 Low power through parallelism • Sequential Processor – – – – Switching capacitance C Frequency f Voltage V P = fCV2 • Parallel Processor (two times the number of units) – – – – 4/8/2015 Switching capacitance 2C Frequency f/2 Voltage V’ < V P = f/2 2C V’2 = fCV’2 Embedded Computer Architecture H. Corporaal and B. Mesman 24 Measuring and exploiting available ILP • How much ILP is there in applications? • How to measure parallelism within applications? – Using existing compiler – Using trace analysis • Track all the real data dependencies (RaWs) of instructions from issue window – register dependence – memory dependence • Check for correct branch prediction – if prediction correct continue – if wrong, flush schedule and start in next cycle 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 25 Trace analysis Program Compiled code Trace set r1,0 set r2,3 set r3,&A st r1,0(r3) add r1,r1,1 r3,r3,4 For i := 0..2 set r1,0 add A[i] := i; set r2,3 brne r1,r2,Loop set r3,&A st r1,0(r3) st r1,0(r3) add r1,r1,1 add r1,r1,1 add r3,r3,4 add r3,r3,4 brne r1,r2,Loop S := X+3; Loop: brne r1,r2,Loop st r1,0(r3) add add r1,r1,1 add r3,r3,4 r1,r5,3 brne r1,r2,Loop How parallel can you execute this code? 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman add r1,r5,3 26 Trace analysis Parallel Trace set r1,0 set r2,3 set r3,&A st r1,0(r3) add r1,r1,1 add r3,r3,4 st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop brne r1,r2,Loop add r1,r5,3 Max ILP = Speedup = Lserial / Lparallel = 16 / 6 = 2.7 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 27 Ideal Processor Assumptions for ideal/perfect processor: 1. Register renaming – infinite number of virtual registers => all register WAW & WAR hazards avoided 2. Branch and Jump prediction – Perfect => all program instructions available for execution 3. Memory-address alias analysis – addresses are known. A store can be moved before a load provided addresses not equal Also: – – – – unlimited number of instructions issued/cycle (unlimited resources), and unlimited instruction window perfect caches 1 cycle latency for all instructions (FP *,/) Programs were compiled using MIPS compiler with maximum optimization level 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 28 Upper Limit to ILP: Ideal Processor Integer: 18 - 60 FP: 75 - 150 160 150.1 140 Instruction Issues per cycle IPC 118.7 120 100 75.2 80 62.6 60 54.8 40 17.9 20 0 gcc espresso li fpppp doducd tomcatv Programs 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 29 Window Size and Branch Impact • Change from infinite window to examine 2000 FP: 15 - 45 and issue at most 64 instructions per cycle 61 60 60 58 IPC Instruction issues per cycle 50 Integer: 6 – 12 48 46 46 45 45 45 41 40 35 30 29 19 20 16 15 13 12 14 10 10 9 6 7 6 6 6 7 4 2 2 2 0 gcc espresso li fpppp doducd tomcatv Program 4/8/2015 Perfect Tournament BHT(512) Profile No prediction Perfectand B. Selective predictor Standard 2-bit Static None H. Corporaal Mesman Embedded Computer Architecture 30 Limiting nr. of Renaming Registers • Changes: 2000 instr. window, 64 instr. issue, 8K 2-level predictor (slightly better than tournament predictor) 70 FP: 11 - 45 59 Integer: 5 - 15 60 54 49 IPC Instruction issues per cycle 50 45 44 40 35 29 30 28 20 20 16 15 15 13 10 11 10 10 12 12 12 11 10 9 5 5 4 15 11 6 4 5 5 5 4 7 5 5 0 gcc espresso li fpppp doducd tomcatv Program 4/8/2015 Embedded Computer Architecture 64 32 None Infinite 256128 128 64 32 Infinite 256 H. Corporaal and B. Mesman 31 Memory Address Alias Impact • Changes: 2000 instr. window, 64 instr. issue, 8K 2-level predictor, 256 renaming registers 49 50 49 45 45 45 FP: 4 - 45 (Fortran, no heap) 40 IPC Instruction issues per cycle 35 30 25 Integer: 4 - 9 20 16 16 15 15 12 10 10 5 9 7 7 4 5 5 4 3 3 4 6 5 4 3 4 0 gcc espresso li f pppp doducd tomcat v Program Global/ stack Perf ect InspectionInspection None None Perfect Global/stack perfect Perf ect 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 32 Reducing Window Size • Assumptions: Perfect disambiguation, 1K Selective predictor, 16 entry return stack, 64 renaming registers, issue as many as window 60 56 52 IPC Instruction issues per cycle 50 47 FP: 8 - 45 45 40 35 34 30 22 Integer: 6 - 12 20 15 15 10 10 10 10 9 13 12 12 11 11 10 8 8 6 4 6 3 17 16 14 9 6 4 22 2 14 12 9 8 4 15 9 7 5 4 3 3 6 3 3 0 gcc expresso li f pppp doducd tomcat v Program Inf inite 4/8/2015 256 128 64 32 16 8 Infinite 256 128 64 32 16 8 4 H. Corporaal and B. Mesman Embedded Computer Architecture 4 33 How to Exceed ILP Limits of This Study? • WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not in memory • Unnecessary dependences – compiler did not unroll loops so iteration variable dependence • Overcoming the data flow limit: value prediction, predicting values and speculating on prediction – Address value prediction and speculation predicts addresses and speculates by reordering loads and stores. Could provide better aliasing analysis 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 34 Conclusions • Amount of parallelism is limited – higher in Multi-Media and Signal Processing appl. – higher in kernels • Trace analysis detects all types of parallelism – task, data and operation types • Detected parallelism depends on – quality of compiler – hardware – source-code transformations 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 35 Overview • Enhance performance: architecture methods • Instruction Level Parallelism • VLIW – Examples • • • • C6 TM IA-64: Itanium, .... TTA • Clustering • Code generation • Hands-on 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 36 VLIW: general concept A VLIW architecture with 7 FUs Instruction Memory Instruction register Function Int FU units Int FU Int FU LD/ST LD/ST FP FU FP FU Floating Point Register File Int Register File Data Memory 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 37 VLIW characteristics • • • • Multiple operations per instruction One instruction per cycle issued (at most) Compiler is in control Only RISC like operation support – Short cycle times – Easier to compile for • Flexible: Can implement any FU mixture • Extensible / Scalable However: • tight inter FU connectivity required • not binary compatible !! – (new long instruction format) • low code density 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 38 VelociTI C6x datapath 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 39 VLIW example: TMS320C62 TMS320C62 VelociTI Processor • 8 operations (of 32-bit) per instruction (256 bit) • Two clusters – 8 Fus: 4 Fus / cluster : (2 Multipliers, 6 ALUs) – 2 x 16 registers – One bus available to write in register file of other cluster • • • • • 4/8/2015 Flexible addressing modes (like circular addressing) Flexible instruction packing All instruction conditional Originally: 5 ns, 200 MHz, 0.25 um, 5-layer CMOS 128 KB on-chip RAM Embedded Computer Architecture H. Corporaal and B. Mesman 40 VLIW example: Philips TriMedia TM1000 5 constant 5 ALU 2 memory 2 shift 2 DSP-ALU 2 DSP-mul 3 branch 2 FP ALU 2 Int/FP ALU 1 FP compare 1 FP div/sqrt 4/8/2015 Register file (128 regs, 32 bit, 15 ports) Exec unit Embedded Computer Architecture Exec unit Exec unit Exec unit Exec unit Data cache (16 kB) Instruction register (5 issue slots) PC Instruction cache (32kB) H. Corporaal and B. Mesman 41 Intel EPIC Architecture IA-64 Explicit Parallel Instruction Computer (EPIC) • IA-64 architecture -> Itanium, first realization 2001 Register model: • 128 64-bit int x bits, stack, rotating • 128 82-bit floating point, rotating • 64 1-bit boolean • 8 64-bit branch target address • system control registers See http://en.wikipedia.org/wiki/Itanium 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 42 EPIC Architecture: IA-64 • Instructions grouped in 128-bit bundles – 3 * 41-bit instruction – 5 template bits, indicate type and stop location • Each 41-bit instruction – starts with 4-bit opcode, and – ends with 6-bit guard (boolean) register-id • Supports speculative loads 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 43 Itanium organization 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 44 Itanium 2: McKinley 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 45 EPIC Architecture: IA-64 • EPIC allows for more binary compatibility then a plain VLIW: – Function unit assignment performed at run-time – Lock when FU results not available • See other website (course 5MD00) for more info on IA-64: – www.ics.ele.tue.nl/~heco/courses/ACA – (look at related material) 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 46 What did we talk about? ILP = Instruction Level Parallelism = ability to perform multiple operations (or instructions), from a single instruction stream, in parallel VLIW = Very Long Instruction Word architecture Example Instruction format (5-issue): operation 1 operation 2 operation 3 operation 4 operation 5 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 47 VLIW evaluation FU-3 FU-4 Data memory FU-2 Register file FU-1 Bypassing network Instruction decode unit Instruction fetch unit Instruction memory CPU FU-5 Control problem O(N2) O(N)-O(N2) With N function units 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 48 VLIW evaluation Strong points of VLIW: – Scalable (add more FUs) – Flexible (an FU can be almost anything; e.g. multimedia support) Weak points: • With N FUs: – Bypassing complexity: O(N2) – Register file complexity: O(N) – Register file size: O(N2) • Register file design restricts FU flexibility Solution: .................................................. ? 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 49 Solution TTA: Transport Triggered Architecture + > + * > st 4/8/2015 Embedded Computer Architecture * st H. Corporaal and B. Mesman 50 Transport Triggered Architecture General organization of a TTA FU-1 CPU FU-4 FU-5 Data memory FU-3 Register file Bypassing network Instruction decode unit Instruction fetch unit Instruction memory FU-2 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 51 TTA structure; datapath details Data Memory load/store load/store unit unit integer ALU integer ALU boolean RF instruct. unit float ALU Socket integer RF float RF immediate unit Instruction Memory 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 52 TTA hardware characteristics • Modular: building blocks easy to reuse • Very flexible and scalable – easy inclusion of Special Function Units (SFUs) • Very low complexity – – – – – – 4/8/2015 > 50% reduction on # register ports reduced bypass complexity (no associative matching) up to 80 % reduction in bypass connectivity trivial decoding reduced register pressure easy register file partitioning (a single port is enough!) Embedded Computer Architecture H. Corporaal and B. Mesman 53 TTA software characteristics add r3, r1, r2 That does not look like an improvement !?! o1 o2 +r r1 add.o1; r2 add.o2; add.r r3 • More difficult to schedule ! • But: extra scheduling optimizations 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 54 Program TTAs How to do data operations ? 1. Transport of operands to FU • Operand move (s) Trigger Operand • Trigger move 2. Transport of results from FU • Result move (s) Internal stage Example Add r3,r1,r2 becomes r1 Oint r2 Tadd …………. Rint r3 // operand move to integer unit // trigger move to integer unit // addition operation in progress // result move from integer unit Result FU Pipeline How to do Control flow ? 1. Jumps: 2. Branch: 3. Call: 4/8/2015 #jump-address pc #displacement pcd pc r; #call-address pcd Embedded Computer Architecture H. Corporaal and B. Mesman 55 Scheduling example VLIW load/store unit add r1,r1,r2 integer ALU integer ALU sub r4,r1,95 TTA r1 -> add.o1, r2 -> add.o2 add.r -> sub.o1, 95 -> sub.o2 sub.r -> r4 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman integer RF immediate unit 56 TTA Instruction format General MOVE field: g i src dst : guard specifier : immediate specifier : source : destination g i src dst General MOVE instructions: multiple fields move 1 move 2 move 3 move 4 How to use immediates? Small, 6 bits g 1 imm dst Long, 32 bits g 0 Ir-1 dst 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman imm 57 Programming TTAs How to do conditional execution Each move is guarded Example r1 cmp.o1 r2 cmp.o2 cmp.r g g:r3 r4 4/8/2015 // operand move to compare unit // trigger move to compare unit // put result in boolean register g // guarded move takes place when r1=r2 Embedded Computer Architecture H. Corporaal and B. Mesman 58 Register file port pressure for TTAs Read and write ports required ILP degree 3.50 3.00 2.50 2.00 1.50 1.00 5 Read ports 4/8/2015 Embedded Computer Architecture 4 3 2 1 H. Corporaal and B. Mesman 1 2 3 4 5 Write ports 59 Summary of TTA Advantages • Better usage of transport capacity – Instead of 3 transports per dyadic operation, about 2 are needed – # register ports reduced with at least 50% – Inter FU connectivity reduces with 50-70% • No full connectivity required • Both the transport capacity and # register ports become independent design parameters; this removes one of the major bottlenecks of VLIWs • Flexible: Fus can incorporate arbitrary functionality • Scalable: #FUS, #reg.files, etc. can be changed • FU splitting results into extra exploitable concurrency • TTAs are easy to design and can have short cycle times 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 60 TTA automatic DSE User intercation Optimizer x x x feedback x x Architecture parameters Parametric compiler Pareto curve (solution space) x feedback x x x x Hardware generator x x x x x x x x x x cost Move framework Parallel object code 4/8/2015 Embedded Computer Architecture chip H. Corporaal and B. Mesman 61 Overview • • • • Enhance performance: architecture methods Instruction Level Parallelism VLIW Examples – C6 – TM – TTA • Clustering and Reconfigurable components • Code generation • Hands-on 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 62 Clustered VLIW • Clustering = Splitting up the VLIW data path - same can be done for the instruction path – loop buffer loop buffer loop buffer FU FU FU FU FU FU FU FU FU register file register file register file Level 2 (shared) Cache Level 1 Instruction Cache Level 1 Data Cache 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 63 Clustered VLIW Why clustering? • Timing: faster clock • Lower Cost – silicon area – T2M (Time-to-Market) • Lower Energy What’s the disadvantage? Want to know more: see PhD thesis Andrei Terechko 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 64 Fine-Grained reconfigurable: Xilinx XC4000 FPGA CLB Slew Rate Control CLB Switch Matrix D CLB Q Passive Pull-Up, Pull-Down Vcc Output Buffer Pad Input Buffer CLB Q Programmable Interconnect D Delay I/O Blocks (IOBs) C1 C2 C3 C4 H1 DIN S/R EC S/R Control G4 G3 G2 G1 F4 F3 F2 F1 DIN G Func. Gen. F' G' H Func. Gen. F Func. Gen. D SD Q H' EC RD 1 Y G' H' S/R Control DIN F' G' D SD Q H' 1 EC RD H' K F' X Configurable Logic Blocks (CLBs) 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 65 Recent Coarse Grain Reconfigurable Architectures • SmartCell 2009 – read http://www.hindawi.com/journals/es/2009/518659.html • • • • • • • Montium (reconfigurable VLIW) RAPID NIOS II RAW PicoChip PACT XPP64 ADRES (IMEC) • many more …. 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 66 Xilinx Zynq with 2 ARM processors 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 67 ADRES • Combines VLIW and reconfig. Array • PEs have local registers Top-row PEs share registers 4/8/2015 Embedded Computer Architecture H. Corporaal and B. Mesman 68 PACT XPP: Architecture • XPP (Extreme Processing Platform) – A hierarchical structure consisting of PAEs • PAEs Course grain PEs Adaptive Clustered in PACs PA = PAC + CM A hierarchical configuration tree Memory elements (aside PAs) I/O elements (on each side of the chip) 69 PA PA PA PA RAW with Mesh network Compute Pipeline 8 32-bit channels Registered at input longest wire = length of tile Granularity Makes Differences 4/8/2015 Fine-Grained Architecture Coarse-Grained Architecture Clock Speed Low High Configuration Time Long Short # of Blocks Large Small Flexibility High Low Power High Low Area Large Small Embedded Computer Architecture H. Corporaal and B. Mesman 71 Reconfiguration time HW or SW reconfigurable? reset FPGA Spatial mapping loopbuffer context Temporal mapping Subword parallelism 1 cycle fine 4/8/2015 Embedded Computer Architecture Data path granularity H. Corporaal and B. Mesman VLIW coarse 72