Hot Chips 16 August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott Mahlke Krisztián Flautner, Koen Van Nieuwenhove CCCP Research Group University of Michigan http://cccp.eecs.umich.edu ARM Limited Hot Chips 16 August 24, 2004 OptimoDE Overview • OptimoDE – A configurable VLIW-styled Data Engine architecture – Targeted at intensive data processing • Characteristics – Very wide performance envelope • Power / area / speed tradeoff • Exploiting parallelism in applications – Unlimited data path configuration options – User extensible through ISA customization • Semi-automatic design system – User-in-the-loop design, retargetable compiler toolchain Hot Chips 16 August 24, 2004 SRAM I/O S M2 M switch S S S FIFO S M DATA 2 M1 DATA 1 CTRL M Data ARM CPU Memory Memory SDRAM Controller AHB Bus Matrix SDRAM Control OptimoDE in a System On Chip Data Engine MEM DMA Controller Interrupt Controller SoC Hot Chips 16 August 24, 2004 OptimoDE Architecture Model Function units Memories … interconnect Functional Units – ALU, ACU, Multipliers – Custom Controller … • I/O ports • – RAM ( asynch / synch ) – ROM • I/O ports – addressable – handshake protocol • Registers Memory Registers – Register files regs regs regs … regs interconnect Interconnect Controller • Interconnect – Direct connection – Shared bus mC • Controller • • All layers required Intra-layer configuration Hot Chips 16 August 24, 2004 Design Toolchain mA Evaluation mA Definition OptimoDE OptimoDE Library Library Librarian User User Library Library mA mA .inc .c1 .inc xxxxx xx dct xxxxx instantiate DesignDE mA mA .inc .c2 .inc mA mA .inc .c3 .inc DEvelop mA .tmp save ISS set target #include … run / profile 001010 110011 010110 110111 load main() { … = dct(); create_resource LIFETIME } LOAD Hot Chips 16 August 24, 2004 Compiler Toolchain DEvelop + + 2 C Source Description 3 INPUT * + + 1 3 2 * * 4 + 5 + 7 6 * * 8 1 OUTPUT analysis/opti bind 0 1 2 3 4 5 Analysis feedback * + 1 2 * 4 * + 3 6 * + 8 7 * 5 * + 9 10 Microcode schedule Syntax checks Match architecture Optimize code and Dataflow analysis and dataflow graph register use Hot Chips 16 August 24, 2004 32-point DCT Microarchitecture inport rom outport ram_1 acu_1 acu_2 imm_1 ram_2 alu_1 alu_2 acu_3 imm_2 control • 2 Custom FUs, 2 RAM, 1 ROM, 3 ACU, 2 I/O ports • Designer responsible for creating custom units manually Hot Chips 16 August 24, 2004 Retargetable Customization • Prototype 2 technologies in OptimoDE – Automated ISA customization – Retargetable customization to an “application-area” • Customizing for 1 application – Programmability Nominally programmable – Critical problem – Cannot sustain performance across similar applications – How well does a custom ISA generalize • 5 encryption algorithms, create custom design for each • Average loss >80% versus native [MICRO, 2003] – Proactive generalization creates a retargetable design Hot Chips 16 August 24, 2004 Creating Custom Instructions • Candidate discovery – Identify customization opportunities • Examine program DFG • Partition DFG at: – Memory operations – Unprofitable edges • Enumerate candidate subgraphs within each partition Hot Chips 16 August 24, 2004 Grouping and Selection • Group candidate subgraphs with same structure Group 1 • Estimate performance and cost for each group • Greedily select groups to implement in hardware subject to budget • 1 CFU created per group Group 1 Cost: 2 Adders Gain: 10,000 Cycles Group 3 Group 2 Group 2 Cost: 0.5 Adders Gain: 1,000 Cycles Group 4 Cost: 1 Adder Cost: 1 Adder Gain: 1,500 Cycles Group Gain: 3 2,500 Cycles Group 4 Hot Chips 16 August 24, 2004 Proactively Generalize Groups Input 1 • Cost-effectively extend group functionality to enable reuse 0x8 >> 0xFF Input 1 Input 2 • Wildcard – multiple functionality at nodes | >> 0xFF + Input 2 Output • Subsumed – configurable interconnect to bypass nodes 0x8, 0x4 +,- Output |,& Hot Chips 16 August 24, 2004 Native Speedups 9 8 ARM 926EJ (200 MHz) OptimoDE (333 MHz, 5 Issue: 3 ALU, 1 Mem, 1 Brn) OptimoDE + 1 CFU (333 MHz) 7 Speedup 6 5 4 3 2 1 0 3Des AES Blowfish Md5 Rc4 SHA 33d 33dd3d d eess3d eeesss- - -33 dd B3 ees es -B -B lldooews s w 33l odw ffiiss 3d d eefssi s- hh hR 33eds -R c 3d d ee-Rs c c44 4 es s--A A 33 - EE BBllo3d ddeeAsES SS wef s s--SS BBlBlo o w wfwf f iiss-hhSh hhaa Bl loo w ow f iisish --3a fis shh --B-3 3ddee de ss BBhl -B Blloow lo wsfi Bl loo w w fi s w o f BBllwofi f iisshhfish shh s -R Bl o w ow wffhis-R -R cc4 Blofis ishh--cA4 4 Bl wh- AE ow f i AE ES S fis sh -S h S R Rc-S h RcRc4c44--h3a a R 4 -3 3dd c4 -B d ee s -B lo es s lo w w fi R Rc4fish sh c4 -R - c R Rc4Rc4 4 c4 A R-A E R c4ES S c4 S AEAE-Sh ha A S S AE E - -3a S-S-B3de d e Bl lo s s ow w f AEAE fish ish S S- -R AEAE Rc c4 S-S- 4 AA A AE E ESES S- SShSh ShSha a ShSh a-3a -3 a-a -B de de Bl lo s s ow w f ShSh fishi sh a- a R ShShRc4 c4 a- a AA ShS ESES h a- a Sh-S a ha Speedup Speedup Hot Chips 16 August 24, 2004 Importance of Generalization 77 7 3des 33 22 11 00 Blowfish Rc4 AES 66 6 Key: application run – application designed for Sha Base Base Generalized Generalized 55 5 44 4 3 2 1 0 Hot Chips 16 August 24, 2004 Case Study - Md5 10 9 8 Speedup 7 6 5 4 3 2 1 0 0 2 4 6 8 CFU Cost Budget (Relative to 32-bit ripple-carry adder) 10 Hot Chips 16 August 24, 2004 OptimoDE Design for this Point Input 1 Input 2 + Input 3 + ALU 1 RF RF ALU 2 ACU SRAM … Input 4 CFU RF RF + 0x1B >> Control Memory 0x5 << | Output Input 2 Input 3 Input 1 ^ & ^ Output Hot Chips 16 August 24, 2004 Die Area Breakdown OptimoDE = 5.5 in 0.13 m ARM 926EJ = 5.0 mm2 in 0.13 m mm2 CFU Reg. File and 2% Interconnect 9% Control Memory 11% ALU 1 RF RF ALU 2 ACU SRAM … Control Memory CFU RF RF Baseline Design 78% Hot Chips 16 August 24, 2004 Conclusions • OptimoDE – Configurable VLIW-style data engine architecture – Automated tools for implementing embedded signal and data processing solutions • Automatic retargetable customization – Customized design combined with cost-effective generalization – Performance programmability - Performance stability across a family of similar applications Hot Chips 16 August 24, 2004 For More Information • CCCP group website – cccp.eecs.umich.edu • ARM OptimoDE information – www.arm.com/products/CPUs/families/OptimoDE.html Hot Chips 16 August 24, 2004 Designing for a Domain 1 Fraction of Native Speedup Attained 0.9 0.8 0.7 3des AES Blowfish Rc4 Sha Md5 Mean 0.6 0.5 0.4 0.3 0.2 0.1 0 Rc4 Rc4, Sha Blowfish, Rc4, Sha 3des, Blowfish, Rc4, Sha Apps Designed For 3des, AES, Blowfish, Rc4, Sha