Architectural Exploration: Area-Performance tradeoff in 802.11a Transmitter Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March, 2007 http://csg.csail.mit.edu/arvind 802.11a-1 802.11a Transmitter Overview headers 24 Uncoded bits Controller data Scrambler Interleaver Mapper Cyclic Extend IFFT IFFT Transforms 64 (frequency domain) complex numbers into 64 (time domain) complex numbers March, 2007 Encoder Must produce one OFDM symbol every 4 μsec Depending upon the transmission rate, consumes 1, 2 or 4 tokens to produce one OFDM symbol One OFDM symbol (64 Complex Numbers) http://csg.csail.mit.edu/arvind accounts for 85% area 802.11a-2 1 Preliminary results [MEMOCODE 2006] Dave, Gerding, Pellauer, Arvind Design Block Lines of Code (BSV) Controller Scrambler Conv. Encoder Interleaver Mapper IFFT Cyc. Extender 49 40 113 76 112 95 23 Relative Area 0% 0% 0% 1% 11% 85% 3% Complex arithmetic libraries constitute another 200 lines of code 802.11a-3 http://csg.csail.mit.edu/arvind March, 2007 Combinational IFFT in0 in1 … x16 Bfly4 Bfly4 … Bfly4 Bfly4 … Bfly4 in63 out1 Permute in4 Bfly4 Bfly4 Permute in3 Bfly4 Bfly4 Permute in2 out0 out2 out3 out4 … out63 Reuse the same circuit three times to reduce area March, 2007 http://csg.csail.mit.edu/arvind 802.11a-4 2 Design Alternatives Reuse a block over multiple cycles f f g f g we expect: Throughput to Area to The clock needs to run faster for the same throughput ⇒ hyper-linear increase in energy 802.11a-5 http://csg.csail.mit.edu/arvind March, 2007 Circular pipeline: Reusing the Pipeline Stage in0 out0 Bfly4 in2 … in3 Bfly4 Permute in1 in4 … out2 out3 out4 Stage Counter in63 March, 2007 out1 … out63 http://csg.csail.mit.edu/arvind 802.11a-6 3 Superfolded circular pipeline: Just one Bfly-4 node! in0 out0 in1 64, 2-way Muxes in2 in3 out1 Permute Bfly4 Stage 0 to 2 in4 in63 Index: 0 to 15 out3 out4 4, 16-way DeMuxes 4, 16-way Muxes … out2 … out63 Index == 15? 802.11a-7 http://csg.csail.mit.edu/arvind March, 2007 Pipelining a block f1 C f2 inQ P outQ f1 f2 outQ f FP inQ March, 2007 Pipeline f3 inQ Clock? Combinational f3 outQ Area? http://csg.csail.mit.edu/arvind Folded Pipeline Throughput? 802.11a-8 4 Synchronous pipeline f2 f1 f3 x inQ sReg1 sReg2 rule sync-pipeline (True); inQ.deq(); sReg1 <= f1(inQ.first()); sReg2 <= f2(sReg1); outQ.enq(f3(sReg2)); endrule This is real IFFT code; just replace f1, f2 and f3 with stage_f code March, 2007 outQ This rule can fire only if Atomicity: Either all or none of the state elements inQ, outQ, sReg1 and sReg2 will be updated 802.11a-9 http://csg.csail.mit.edu/arvind Stage functions f1, f2 and f3 function f1(x); return (stage_f(1,x)); endfunction function f2(x); return (stage_f(2,x)); endfunction The stage_f function was given earlier function f3(x); return (stage_f(3,x)); endfunction March, 2007 http://csg.csail.mit.edu/arvind 802.11a-10 5 Problem: What about pipeline bubbles? f2 f1 f3 x inQ sReg1 sReg2 rule sync-pipeline (True); inQ.deq(); sReg1 <= f1(inQ.first()); sReg2 <= f2(sReg1); outQ.enq(f3(sReg2)); endrule outQ Red and Green tokens must move even if there is nothing in the inQ! Also if there is no token in sReg2 then nothing should be enqueued in the outQ Modify the rule to deal with these conditions March, 2007 Valid bits or the Maybe type 802.11a-11 http://csg.csail.mit.edu/arvind The Maybe type data in the pipeline typedef union tagged { void Invalid; data_T Valid; } Maybe#(type data_T); data valid/invalid Registers contain Maybe type values rule sync-pipeline (True); if (inQ.notEmpty()) begin sReg1 <= Valid f1(inQ.first()); inq.deq(); end else sReg1 <= Invalid; case (sReg1) matches tagged Valid .sx1: sReg2 <= Valid f2(sx1); tagged Invalid: sReg2 <= Invalid; case (sReg2) matches tagged Valid .sx2: outQ.enq(f3(sx2)); endrule March, 2007 http://csg.csail.mit.edu/arvind 802.11a-12 6 Folded pipeline The same code will work for superfolded pipelines by changing n and stage function f f x inQ stage sReg outQ rule folded-pipeline (True); if (stage==0) begin sxIn= inQ.first(); inQ.deq(); end else sxIn= sReg; notice stage sxOut = f(stage,sxIn); is a dynamic if (stage==n-1) outQ.enq(sxOut); parameter else sReg <= sxOut; now! stage <= (stage==n-1)? 0 : stage+1; endrule Need type declarations for sxIn and sxOut 802.11a-13 http://csg.csail.mit.edu/arvind March, 2007 no forloop 802.11a Transmitter Synthesis results (Only the IFFT block is changing) The same source code IFFT Design Area (mm2) Throughput Latency (CLKs/sym) Min. Freq Required Pipelined 5.25 04 1.0 MHz Combinational 4.91 04 1.0 MHz Folded (16 Bfly-4s) 3.97 04 1.0 MHz Super-Folded (8 Bfly-4s) 3.69 06 1.5 MHz SF(4 Bfly-4s) 2.45 12 3.0 MHz SF(2 Bfly-4s) 1.84 24 6.0 MHz SF (1 Bfly4) 1.52 48 12 MHZ All these designs were done in less than 24 hours! TSMC .18 micron; numbers reported are before place and route. March, 2007 http://csg.csail.mit.edu/arvind 802.11a-14 7 Why are the areas so similar Folding should have given a 3x improvement in IFFT area BUT a constant twiddle allows lowlevel optimization on a Bfly-4 block a 2.5x area reduction! 802.11a-15 http://csg.csail.mit.edu/arvind March, 2007 Parameterize the synchronous pipeline fn f1 x inQ sReg[1] n and stage are static parameters sReg[n-1] outQ Vector#(n, Reg#(t)) sReg <- replicateM(mkReg(Invalid)); rule sync-pipeline (True); if (inQ.notEmpty()) begin (sReg[1]) <= Valid f(0,inQ.first()); inq.deq(); end else (sReg[1]) <= Invalid; for (Integer stage = 1; stage < n-1; stage = stage+1) case (sReg[n-1]) matches tagged Valid .sx: outQ.enq(f(n-1,sx)); endcase endrule March, 2007 http://csg.csail.mit.edu/arvind 802.11a-16 8 Syntax: Vector of Registers Register suppose x and y are both of type Reg. Then x <= y means x._write(y._read()) Vector of (say) Int x[i] means sel(x,i) x[i] = y[j] means x = update(x,i, sel(y,j)) Vector of Registers March, 2007 x[i] <= y[j] does not work. The parser thinks it means (sel(x,i)._read)._write(sel(y,j)._read), which will not type check (x[i]) <= y[j] does work! http://csg.csail.mit.edu/arvind 802.11a-17 9