Liquid Architecture Microarchitecture Optimization for Embedded Systems D. Schuehler, B. Brodie, R. Chamberlain, R. Cytron, S. Friedman, J. Fritts, P. Jones, P. Krishnamurthy, J. Lockwood, S. Padmanabhan, and H. Zhang Dept. of Computer Science and Engineering Washington University in St. Louis Supported by NSF ITR-0313203 Liquid Architecture • Configurable architecture that can adapt to needs of particular application • E.g., within an FPGA – Soft-core processors • E.g., as an embedded processor – Tensilica supports configuration at fab time – Stretch support configuration at run time • Today’s discussion is on performance analysis and configuration choice Block Diagram Event Bus Statistics Module LEON SPARCcompatible processor External Memory I-Cache D-Cache AHB LED UART Memory Controller ` Network Interface ` Control Packet Processor FPX FPGA APB Adapter Layered Internet Protocol Wrappers Boot Rom Microarchitecture Configurability • Instruction set • Memory subsystem – Cache size (I and D) – Associativity – Cache line size • Co-processor(s) • Instruction pipeline • Full HDL source is available Design Flow Internet Write and compile embedded SPARC application with GCC Identify configuration for candidate architecture Reconfigure FPX hardware via Internet and upload system software. Execute program on FPX Platform and measure runtime performance Method Time / Cycles Cycle-accurate profiling .text main addQuery findMatch computeKey computeBase computeStep fillQuery Rnd • Choose methods to profile from the user interface Method Address Range .text main Lo addQuery 0x4000027C 0x400003EF Hi findMatch computeKey computeBase computeStep fillQuery Rnd Event Bus Method PC Statistics Module .text 0x4000035A main Lo addQuery 0x4000027C 0x400003EF Hi findMatch computeKey computeBase computeStep fillQuery Rnd CLK Event Bus Function PC .text Statistics Module Lo main 0x4000027C ≤ 0x4000035A ≤ addQuery findMatch computeKey computeBase computeStep fillQuery Rnd INCR Counter 0x400003EF Hi CLK Event Bus Function PC .text Statistics Module Lo main 0x4000027C ≤ 0x4000035A ≤ 0x400003EF Hi addQuery INCR findMatch Counter computeKey computeBase Lo 0x400005D8 ≤ 0x4000035A ≤ computeStep fillQuery Rnd INCR Counter 0x4000061F Hi CLK Event Bus PC Statistics Module Lo 0x4000027C ≤ 0x4000035A INCR ≤ 0x400003EF Hi Counter To User Lo 0x400005D8 ≤ 0x4000035A INCR ≤ Counter 0x4000061F Hi CLK Where is time spent? 100% 90% % of total runtime 80% Rest coreLoop findMatch 70% 60% BLASTN biosequence search application 50% 40% 30% 20% 10% 0% 128K 32K Size of hash table (Bytes) Function Time / Cycles Cache Hits / Misses Read Write .text main addQuery findMatch computeKey computeBase computeStep fillQuery Rnd Expand to measure cache hits/misses Measure Several Configurations Impact of D-cache Configuration 100 Total findMatch coreLoop 98 hit rate (%) 96 BLASTN biosequence search application 94 92 90 88 86 128K, 1Kx1 128K, 32Kx1 128K, 16Kx2 32K, 1Kx1 32K, 32Kx1 32K, 16Kx2 Size of hash table, D-cache configuration Impact of I-cache Configuration 35 Run time (secs) 30 1KB I-Cache 4KB I-Cache 25 BLASTN biosequence search application 20 15 10 5 0 128K 32K BLASTN hash table sizes (Bytes) Function .text main addQuery findMatch computeKey computeBase computeStep fillQuery Rnd Time / Cycles Cache Hits / Misses Read Write Pipeline Stalls Branch Predict Time for Single Run 100000 80000 10000 Time (sec) 1800 1000 100 10 1 SimpleScalar 3.0 LEON Almost 2 orders of magnitude faster than simulation Implications of Slow Simulation • Focus has historically been on measuring the performance of a single thread of a single application • Real apps are often executed in a multitasking environment – Impacts cache behavior – Ignores OS (system call) performance • Liquid architecture system enables direct measurement, including OS OS Boot Sequence Summary • Run-time reconfigurable processors will be available sooner rather than later • Determining desired configuration is a difficult design task – Large search space – Depends on accurate performance data • Liquid architecture system enables direct measurement of performance properties Current and Future Work • Evaluation of several arch. design ideas • Automated search of the design space • Characterizing performance analysis methods – Analytic models – Simulation models – Direct execution models • Usable as is for evaluating soft-core procs • Like to extend to higher-speed procs