Intro to the “c6x” VLIW processor ● Texas Instruments TMSC6000 series ● TMSC6700 subseries – include floating point ● VLIW = Very Long Instruction Word Operations in Parallel registers Function units Operations in Parallel registers bypassing Function units Non-orthogonal registers Bypass Function units registers Non-orthogonal B A registers registers Bypass Function units L1 S1 M 1 D 1 L2 S2 M 2 D 2 *** See TI's picture *** Specialized Function Units ● L units: arithmetic, compare, and logical ops ● S units: arithmetic, logical, branches, constant generation ● M units: multiplies ● D units: address generation / memory accesses Complicated hardware registers registers Explicit parallelism registers registers Simple VLIW encoding ● Slots that cannot be utilized are filled with no-ops ● Bad for code density, cache utilization, energy, ... C6X: Packets ● ● One bit of each instruction indicates whether next instruction can be executed in parallel (0 = “EOP”) Any slot can go to any function unit 0 1 0 1 1 1 1 1 C6X: Packets ● ● One bit of each instruction indicates whether next instruction can be executed in parallel Any slot can go to any function unit 0 1 0 1 1 1 1 1 C6X: Packets ● ● ● ● ● One bit of each instruction indicates whether next instruction can be executed in parallel Any slot can go to any function unit 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 Packet cannot cross an 8-word boundary Resources constrain which instructions can be combined in the same packet You can branch into the middle of a packet! Explicit scheduling Delay slots must be respected – no HW interlocks or scoreboarding Multiply – 1 delay slot Load – 4 delay slots Branch – 5 delay slots B5 := B3 * B2 B5 := B3 * B2 B7 := B5 + B1 B7 := B5 + B1 Right Wrong Predicated execution Why? To get rid of branches (5 delay slots * 8 wide ....) Basic idea: a comparison result is stored to a condition register ; this register is then used as an operand of other instructions, and its value causes those operations to be selectively enabled or squashed. [Condition registers: A1, A2, B0, B1, B2] Example: If (B3<B4) B3++ else B4++ Predicated execution With branches: cmp B3, B4 bge L2 <nop> B3 := B3+1 b DONE <nop> L2: B4 := B4+1 DONE: With predicates: cmplt B3, B4 B0 [B0] B3 := B3+1 [!B0] B4 := B4+1 ...and the last two can be issued in parallel! Control dependency has been converted to data dependency... Assembly details .text .align 32 .global proc proc: || mvk mvk cmpgt [ b0] mvk.S2 [!b0] mvk.S1 stw ..... 4, 5, b3, 9, 8, a5, b3 b4 b4, b0 b5 a5 *-a15[4] Fetch/execute pipeline PG generate program address PS program address send PW program memory access PR fetch reaches CPU boundary DP instruction dispatch DC instruction decode E1 execute 1 E2 execute 2 E3 execute 3 E4 execute 4 E5 execute 5 Addressing Modes *R *+R[ucst5] *-R[ucst5] (*R) (R[ucst5]) (R[-ucst5]) *+R[offsetR] *-R[offsetR] (R[offsetR]) (R[-offsetR]) Special case: 15b offsets: *+B15[ucst15] *+B14[ucst15] C equivalent Addressing Modes Pre/post increment/decrement *++R , *R++ *++R[ucst5], *R++[ucst5] *--R[ucst5], *R--[ucst5] *++R[offsetR], *R++[offsetR] *--R[offsetR], *R--[offsetR] Resources http://www.cs.cmu.edu/~tcal/15745/