Design Methodology for Customizable Programmable Processors Berkeley – Finland Day, Oct. 18, 2002 Prof. Jarmo Takala Institute of Digital and Computer Systems Tampere University of Technology Tampere, Finland Tel: +358 – 33115 3879; Email: jarmo.takala@tut.fi Outline Motivation Transport Triggered Architecture (TTA) Design Methodology for TTAs Research at TUT Conclusions J.Takala/TUT Berkeley – Finland Day, Oct.18, 2002 Motivation Programmable processors often used in products using digital signal processing (DSP) Flexibility Ease of verification Traditionally DSP processor architectures have been developed based on average performance in several benchmark tasks (~100) User applications often contain only subset of total benchmarks Efficiency can be improved by customizing architecture according to given tasks J.Takala/TUT Berkeley – Finland Day, Oct.18, 2002 Motivation DSP applications are often hard realtime constrained execution should be deterministic dynamic runtime behaviours should be avoided Static scheduling lends itself to DSP Current design complexities call for increase in designer productivity High level languages should be used DSP algorithms contain inherent parallelism Instruction level parallelism (ILP) should be maximized J.Takala/TUT Berkeley – Finland Day, Oct.18, 2002 What is needed? Application driven design process with easy design space exploration Replace hardware complexity by software complexity Compiler driven process Use templated architecture Flexible heterogeneous function units Modular scalability Orthogonal compiler friendly J.Takala/TUT Berkeley – Finland Day, Oct.18, 2002 Choices for Architecture Template Application Frontend Determine Dependencies Determine Independencies Bind Function Units Compilation time (Software) J.Takala/TUT ILP Architectures sequential (superscalar) dependence (dataflow) independence (EPIC) independence (VLIW) Determine Dependencies Determine Independencies Bind Function Units Bind Datapaths & Execute Run time (Hardware) Berkeley – Finland Day, Oct.18, 2002 FU-3 FU-4 FU-5 Data Memory FU-2 Register File FU-1 Bypassing Network Instruction Decode Instruction Fetch Instruction Memory VLIW Gained Popularity in DSP CPU J.Takala/TUT Berkeley – Finland Day, Oct.18, 2002 Transport Triggered Architecture VLIW drawbacks Bypass complexity Register file complexity Register file design restricts FU flexibility Operation encoding format restricts FU flexibility Reverse programming paradigm [H. Corporaal, 94] data transport operation Instruction set contains only a single instruction: move J.Takala/TUT Berkeley – Finland Day, Oct.18, 2002 From VLIW to TTA VLIW TTA J.Takala/TUT FU-4 FU-5 FU-3 FU-4 Data Memory FU-3 FU-2 Register File FU-2 FU-1 Bypassing Network Instruction Decode Instruction Fetch Instruction Memory FU-1 FU-5 Register File Berkeley – Finland Day, Oct.18, 2002 TTA Datapath Data Memory Load/Store Load/Store Unit Unit Integer ALU Integer ALU Float ALU Socket Integer RF Float RF Boolean RF Instruction Unit Immediate Unit Instruction Memory J.Takala/TUT Berkeley – Finland Day, Oct.18, 2002 Function Units Optional shadow register C T O Operands written to operand registers (O) Operation performed when last operand written to trigger register (T) Pipeline synchronized with control bits (C) Standard interface logic C logic C logic C R optional J.Takala/TUT FU_ready Result_ready Global_lock Berkeley – Finland Day, Oct.18, 2002 ILP Architectures Application Frontend Determine Dependencies Determine Independencies Bind Function Units Bind Datapaths Compilation time J.Takala/TUT sequential (superscalar) dependence (dataflow) independence (EPIC) independence (VLIW) independence (TTA) Determine Dependencies Determine Independencies Bind Function Units Bind Datapaths Execute Run time Berkeley – Finland Day, Oct.18, 2002 TTA Characteristics: HW Modular Can be constructed with standard building blocks Very flexible and scalable FU functionality can be arbitrary Supports user defined Special Function Units (SFU) Lower complexity Reduction on # register ports Reduced bypass complexity Reduction in bypass connectivity Reduced register pressure Trivial decoding (implies long instructions) J.Takala/TUT Berkeley – Finland Day, Oct.18, 2002 TTA Characteristics: SW Traditional operation-triggered instruction: mul r1,r2,r3; Transport-triggered instruction: r1mul.o; r2mul.t; mul.rr3; or r1mul.o, r2mul.t; mul.rr3; Reminds dataflow and time-stationary coding J.Takala/TUT Berkeley – Finland Day, Oct.18, 2002 TTA Design Tools Design tools based on TTA architecture template have been developed at Delft University of Technology (DUT), Delft, the Netherlands MOVE project lead by Prof. Henk Corporaal Fully parametric C/C++ Compiler buses, connections, function units, register files, etc. Design space explorer Processor generator J.Takala/TUT Berkeley – Finland Day, Oct.18, 2002 Code Generation Trajectory Application (C/C++) Architecture Description GCC or SUIF Compiler Frontend Sequential Code Sequential Simulator Compiler Backend Profiling Data Parallel Code Parallel Simulator I/O I/O (MOVE Project at DUT) J.Takala/TUT Berkeley – Finland Day, Oct.18, 2002 TTA Specific Optimizations TTA allows extra scheduling optimizations E.g., software bypassing Bypassing can eliminate the need of RF access Example: r1 → add.o, r2 → add.t; add.r → r3; r3 → sub.o, r4 → sub.t sub.r → r5; Translates to: r1 → add.o, r2 → add.t; add.r → sub.o, r4 → sub.t; sub.r → r5; However, more difficult to schedule ! J.Takala/TUT Berkeley – Finland Day, Oct.18, 2002 Design Space Exploration Application (C/C++) Resource Optimization Resources (Mach) Frontend Map&Schedule Select Resources Simulator Design Points Connectivity Optimization FU models Cost Functions Map&Schedule Reduce Connections Simulator Design Point J.Takala/TUT (MOVE Project at DUT) Berkeley – Finland Day, Oct.18, 2002 Exploration: Resourse Optimization (MOVE Project at DUT) ALU ALU LSU LSU LSU Pareto curve represents the lowest bound of found architecture configurations Selected architecture for further optimization IRU J.Takala/TUT IRU IU IU IU Berkeley – Finland Day, Oct.18, 2002 Exploration: Connectivity Optimization (MOVE Project at DUT) ALU ALU LSU LSU LSU Reduced connections decrease bus delay Critical connections have been removed IRU IRU J.Takala/TUT IU IU IU Berkeley – Finland Day, Oct.18, 2002 Topics to be Investigated Poor code density good target for code compression techniques apriori information of application, thus instruction propabilities known Estimations Power estimation Fast estimations with sufficient accuracy Flexibity, reuse Applications may change, thus additional resources need to assigned although not needed by the original application Tool-assisted special function unit generation Analysis support Model creation support Characterization support Parameterized processor generator Interconnections, control, etc. maybe realized in several ways depending on the target Low-power optimizations Clustered TTAs Interprocessor communication schemes These topics considered in FlexDSP Project at TUT J.Takala/TUT Berkeley – Finland Day, Oct.18, 2002 New Design Environment Functionality (C/C++) Target of FlexDSP Project at TUT Frontend FU models (C, HDL) Cost Functions (area, power, speed) Operation Analysis Resource Constraints Design Space Exploration SFU Generation Parametric Compiler Parallel Object Code J.Takala/TUT Code Compression Parametric Processor Generator TTA Processor HDL Code Berkeley – Finland Day, Oct.18, 2002 Conclusions Design methodologies allowing processor customization will improve efficiency in certain application areas, e.g., multimedia, telecom TTA is a promising candidate for architectural template for customized processors In particular, support for custom function units allows powerful tailoring Results of MOVE project at DUT have already proven the concept Parameterized compiler allows tool-assisted design space exploration Still more research needed on Hardware implementations Enhanced compiler strategies J.Takala/TUT Berkeley – Finland Day, Oct.18, 2002