Platform Design ASIP Application Specific Instruction-set Processor TU/e 5kk70 Henk Corporaal Bart Mesman Application domain specific processors (ADSP or ASIP) DSP Programmable CPU Programmable DSP Application domain specific Application specific processor flexibility efficiency 4/13/2015 Platform Design H.Corporaal and B. Mesman 2 Application domain specific processors (ADSP or ASIP) takes a well defined application domain as a starting point • exploits characteristics of the domain (computation kernels) • still programmable within the domain e.g. MPEG2 coding uses 8*8 DCT transform, DECT, GSM etc ... implementation Appl. domain GP Appl. domain performance: clock speed + ILP flexible dev. (new apps.) problems manual design, large effort 4/13/2015 ADSP implementation ILP,DLP, tuning to domain cost effective (high volume) - specification - design time and effort => synthesized cores Platform Design H.Corporaal and B. Mesman 3 Size Clock ROM RAM (gates (MHz) (Kbyte) (Kbyte) ) www.adelantetech.com Part Description Speech Components ADPCM Full duplex ITU-T G.726 compliant and 40 kbit/s speech-compression encoder/decoder. 4 5,100 1.3 0.128 ADPCM-16 Full duplex 16 Channel ITU-T G.726 compliant 16, 24, 32 and 40 kbit/s speech-compression encoder/decoder. 32 10,200 1.3 2.048 IW-ASR Speech Recognition Template-based speaker-dependent, isolated-word automatic speech recognition 1.3 9,000 6 approx. 1kbyte/ word G.723.1 Low bit-rate ITU-TG.723.1 compliant speech-compression at 6.3 kbit/s; can be combined with G.723.1A. 20 24,000 22 2.3 G.723.1A Extended version of G.723.1 to reduce bit rate by a silence compression scheme. Uses voice activity detection and comfort-noise generation. Fully compliant with Annex A of speech-compression standard CODEC G.723.1. 20 Yields no additional hardware cost. 24,000 22 2.3 Speech Synthesis Phrase-concatenated speech synthesis Depends on compression requirements Telecommunications Echo Cancellation High-performance Echo-cancellation and suppression processor. 4 6,000 2.80 0.15 DTMF Full-duplex DTMF transceiver. 2 4,000 1.00 0.15 Caller-ID On-hook and off-hook caller line identification. Includes DTMF and V.23. 3 6,000 2.10 0.15 Reed-Solomon Full-duplex Reed-Solomon codec 7,000 3.75 0.15 Viterbi Decoder Configurable rate, code and constraint-length. (depending on throughput) Configurable traceback depth. Supports soft & hard decision making. Supports code puncturing. 5,000 to --9,000 --- V.23 modem ITU-T V23 compliant 1200 baud FSK modem 6,000 0.80 0.15 Low-ripple pink noise filter with filter characteristic of -3 ± 0.08 dB per octave over the bandwidth 20Hz to 20kHz 4,000 0.10 0.10 1,500 none none Other Pink Noise Generator CCIR 656/601 Digital video converter : CCIR to raw-video data and vice versa. 4/13/2015 Platform Design H.Corporaal and B. Mesman 4 Design process application(s) instance processor model e.g. VLIW with shared RFs parameters SW (code generation) Estimations cycles/alg occupation HW design Estimations nsec/cycle, area, power/instr OK? yes yes 4/13/2015 more appl.? no no Platform Design H.Corporaal and B. Mesman 3 phases 1. exploration 2. hw design (layout) + processing 3. design appl. sw Fast, accurate and early feedback go to phase 2 5 Problem statement A compiler is retargetable if it can generate code for a ‘new’ processor architecture specified in a machine description file. A guarded register transfer pattern (GRTP) is a register transfer pattern (RTP) together with the control bits of the instruction word that control the RTP. a: = b + c | instr = xxxx0101 GRTPs contain all inter-RT-conflict information. Instruction set extraction (ISE) is the process of generating all possible GRTPs for a specific processor. 4/13/2015 Platform Design H.Corporaal and B. Mesman 6 Problem statement Algorithm spec Processor spec (instance) FE ISE in ch 4 this is part of the code generator CDFG GRTP Code Generation Machinecode 4/13/2015 Platform Design H.Corporaal and B. Mesman 7 Example: Simple processor [Leupers] Inp RAM I.(20:13) PC I.(12:5) I.(4) +1 I.(3:2) IM I.(1:0) I.(20:0) REG outp 4/13/2015 Platform Design H.Corporaal and B. Mesman 8 Example: Simple processor [Leupers] Instruction PC := PC + 1 REG := Inp REG := IM PC .(20..13) REG := RAM IM PC . (12..5 ) REG := REG - Inp REG := REG - IM PC .(20..13) REG := REG - RAM IM PC . (12..5 ) REG := REG + Inp REG := REG + IM PC .(20..13) REG := REG + RAM IM PC . (12..5 ) RAM IM PC . (12..5 ) := REG outp := REG RAM_NOP 4/13/2015 Instruction bits 21111111111 098765432109876543210 xxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxx011x xxxxxxxxxxxxxxxxx001x xxxxxxxxxxxxxxxxx1x1x xxxxxxxxxxxxxxxxx0101 xxxxxxxxxxxxxxxxx0001 xxxxxxxxxxxxxxxxx1x01 xxxxxxxxxxxxxxxxx0100 xxxxxxxxxxxxxxxxx0000 xxxxxxxxxxxxxxxxx1x00 xxxxxxxxxxxxxxxx1xxxx xxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx0xxxx Platform Design H.Corporaal and B. Mesman 9 ASIP/VLIW architectures A|RT designer template as an example (= set of rules, a model) Differences with VLIW processors of ch. 4 1. // FUs • ASUs = complex appl. Spec. FUs (beyond subword //) e.g. biquad, median, DCT etc … • larger grainsize, more heterogeneous, more pipelines 2. Rfiles • many Rfiles (>5 vs 1 or 2) • limited # ports (3 vs 15) • limited size (<16 vs. 128) 3. Issue slots • all in parallel vs. 5 4/13/2015 Platform Design H.Corporaal and B. Mesman 10 RF1 RF2 FU1 RF3 RF4 FU2 RF5 RF6 FU3 RF7 RF8 FU4 flags IR1 IR2 IR3 Instruction memory 4/13/2015 Platform Design H.Corporaal and B. Mesman IR4 Control 11 ASIP/VLIW architectures Additional characteristics of the A|RT designer template • interconnect network: busses + input multiplexers mux control is part of the instruction control can change every clock cycle network can be incomplete busses can be merged • memories are modeled as FUs separate data in and data out 2 inputs (data in and address) and 1 output • Each FU can generate one or more flags • instruction format (per issue slot) read write read write mux 1 mux 2 address address address address RF 1 RF 1 RF 2 RF 2 4/13/2015 Platform Design H.Corporaal and B. Mesman control FU output drivers 12 ASIP/VLIW architectures: example RF1 19 4/13/2015 RF3 ALU bus1 mux read write 2 RF1 RF1 RF2 MAC 10 read RF2 write RF2 RF4 ALU instr. bus2 0 9 mux 3 read RF3 Platform Design H.Corporaal and B. Mesman write RF3 read write MAC instr. RF4 RF4 13 ASIP/VLIW architectures : example GRTP RF1 = ALU (RF1, RF2) RF2 = ALU (RF1, RF2) RF3 = ALU (RF1, RF2) RF3 = MAC (RF3, RF4) RF4 = MAC (RF3, RF4) RF2 = MAC (RF3, RF4) 4/13/2015 Instruction bits 1111111111 9876543210 xcc ccxxccc xcx ccccccc xcx ccxxccc xxxxxxxxxx xxxxxxxxxx cxxxxccxxx Platform Design H.Corporaal and B. Mesman 9876543210 xxxxxxxxxx xxxxxxxxxx cxxccxxxxx ccc cccx ccc xccx xcc ccc xccxxcx ccc 14 ASIP/VLIW architectures: design flow assign ( a+b, ALU, fu_alu1) assign ( a+_, ALU, fu_alu2) assign ( _+_, ALU, fu_alu3) Algorithm spec Datapath synthesis RF1 : x = RF2 : y, RF3 : z | ALU = ADD Inmux = bus2 Change RTs pragmas Controller synthesis VLIW makes relatively simple code selection possible 4/13/2015 Estimations area, power, timing no OK? yes Platform Design H.Corporaal and B. Mesman 15 ASIP/VLIW architectures: list scheduling Candidate LIST IPB * + 1 * 2 4 + 0 * 3 * * OPB 0 + 1 1 1 * 5 2 * 3 * * * 1 * 4 Scheduled Operation * 3 * 3 + 1 2 * 4 + 3 6 * 2 3 2 Conflict & Priority Comp. 4 * 6 + 3 6 MULT + * 7 3 3 * + 5 8 * 7 * 8 * 5 * 8 + 8 7 ALU * IPB + 9 10 OPB 4/13/2015 4 4 * * 5 5 * * 9 + 9 * 5 * 9 5 * 10 Platform Design H.Corporaal and B. Mesman + 9 16 10 ASIP/VLIW architectures: feedback resource resourceload load architecture architectureview view cycle-count cycle-count bus busload load life-time life-timeanalysis analysis 4/13/2015 Platform Design H.Corporaal and B. Mesman 17 Low power aspects Implementation Independent Design Database • Estimation + area speed power Mistral2 Estimation Database Architecture EXU alu_1 acs_asu_1 or_asu_1 romctrl_1 acu_1 ipb_1 opb_1 ctrl total 4/13/2015 ACTIVITY 20% 83% 10% 16% 36% 20% 11% AREA 261 2382 611 65 294 107 163 1864 5747 POWER 105 3816 122 21 205 43 35 3597 7944 Platform Design H.Corporaal and B. Mesman 18 GSM viterbi decoder : default solution 13750 EXU alu_1 romctrl_1 acu_1 ipb_1 opb_1 ctrl total ACTIV 96% 48% 26% 5% 23% AREA 3469 39 327 131 1804 9821 15591 POWER 46196 259 1209 105 5801 135035 188605 • controller responsible for 70% of power consumption – maximum resource-sharing – heavy decision-making : “main” loop with 16 metrics-computations per iteration • EXU-numbers include Registers for local storage 4/13/2015 Platform Design H.Corporaal and B. Mesman 19 GSM viterbi decoder : no loop-folding 14247 EXU alu_1 romctrl_1 acu_1 ipb_1 opb_1 ctrl total ACTIV 92% 45% 25% 5% 22% AREA 3411 39 294 107 1661 4919 10431 POWER 45073 255 1087 86 5340 70087 121928 • area down by 33% • power down by 35% • next step: reduce # of program-steps with second ALU 4/13/2015 Platform Design H.Corporaal and B. Mesman 20 GSM viterbi decoder : 2 ALU’s 9739 EXU alu_1 alu_2 romctrl_1 acu_1 ipb_1 opb_1 ctrl total ACTIV 69% 65% 67% 37% 8% 33% AREA 1797 1393 39 294 149 2136 8957 14766 POWER 12248 8916 255 1087 119 6871 87235 116731 cycle count down 30% area up 42% power down by 5% next step: introduce ASU to reduce ALU-load 4/13/2015 Platform Design H.Corporaal and B. Mesman 21 GSM viterbi decoder : 1 x ACS-ASU func ACS ( M1, M2, d ) MS, MS8 = begin MS = if ( M1+d > M2-d ) -> ( M1+d) || ( M2-d) fi; MS8 = if ( M1- d > M2+d) -> ( M1- d) || ( M2+d) fi; end; EXU alu_1 acs_asu_1 or_asu_1 romctrl_1 acu_1 ipb_1 opb_1 ctrl total ACTIV 20% 83% 10% 16% 36% 20% 11% AREA 261 2382 611 65 294 107 163 1864 5747 POWER 105 3816 122 21 205 43 35 3597 7944 = 1930 cycle count down 5X power down 20X ! 4/13/2015 Platform Design H.Corporaal and B. Mesman 22 GSM viterbi decoder : 4 x ACS-ASU 425 EXU alu_1 acs_asu_1 acs_asu_2 acs_asu_3 acs_asu_4 split_asu_1 or_asu_1 romctrl_1 acu_1 ipb_1 opb_1 ctrl total ACTIV 94% 95% 95% 95% 95% 47% 47% 28% 98% 23% 50% AREA 243 1041 1041 1041 1041 90 592 48 212 60 369 1306 7084 POWER 97 420 420 420 420 18 118 6 85 6 80 555 2645 cycle count down another 5X area up 23% power down another 3X ! 4/13/2015 Platform Design H.Corporaal and B. Mesman 23 GSM viterbi example : summary Implementation Independent Design Database 20000 18000 power 16000 area cycles 14000 Mistral2 12000 10000 8000 6000 4000 72x ! 2000 0 default 4/13/2015 loop 2 ALU Platform Design H.Corporaal and B. Mesman 1 ACS 4 ACS 24 Discussion: phase 3 processor model application(s) SW (code generation) HW design no no OK? application(s) Freeze processor model no yes yes OK? yes no more appl.? Exploration phase 4/13/2015 SW (code generation) Application software development: constraint driven compilation Platform Design H.Corporaal and B. Mesman 25 Discussion: problems with VLIWs code size and instruction bandwidth • code compaction = reduce code size after scheduling possible compaction ratio ? e.g. p0 = 0.9 and p1 = 0.1 information content (entropy) = - pi log2 pi = 0.47 maximum compression factor 2 • control parallelism during scheduling = switch between different processor models (10% of code = 90% runtime) • architecture reduce number of control bits for operand addresses e.g. 128 reg (TM) -> 28 bits/issue slot for addresses only => use stacks and fifos 4/13/2015 Platform Design H.Corporaal and B. Mesman 26 RF1 RF2 RF3 RF4 FU1 FU2 FU3 FU4 flags IR1 IR2 IR3 Instruction memory 4/13/2015 Platform Design H.Corporaal and B. Mesman IR4 Control 27 Conclusions • ASIPs provide efficient solutions for well-defined application domains (2 orders of magnitude higher efficiency). • The methodology is interesting for IP creation. • The key problem is retargetable compilation. • A (distributed) VLIW model is a good compromise between HW and SW. • Although an automatic process can generate a default solution, the process usually is interactive and iterative for efficiency reasons. The key is fast and accurate feedback. 4/13/2015 Platform Design H.Corporaal and B. Mesman 28