Chapter 7, part 3: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf High Performance Embedded Computing © 2007 Elsevier Topics Multi-objective optimization. Co-synthesis for control. Co-synthesis for caches. Co-synthesis for reconfigurable platforms. Hardware/software co-simulation. © 2006 Elsevier Multi-objective optimization Operations research provides notions for optimization functions with multiple objectives. Pareto optimality: optimal solution cannot be improved without making something else worse. © 2006 Elsevier GOPS Feasibility factor computed from throughput factors. Upper-bound throughput for RMS: Upper-bound throughput for EDF: © 2006 Elsevier Upper bound feasibility Upper-bound feasibility tests: © 2006 Elsevier Lower bound feasibility test Lower bound: © 2006 Elsevier Feasibility factor Feasibility factor lP: Use feasibility factor to prune the search space and as an optimization objective. © 2006 Elsevier Genetic algorithms Modeled as: Genes = strings of symbols. Mutations = changes to strings. Types of moves: Reproduction makes a copy of a string. Mutation changes a string. Crossover interchanges parts of two strings. © 2006 Elsevier MOGAC Technology tables characterize hardware components. Genetic model: Processing element allocation string lists all PEs and types. Task allocation string shows assignment of tasks to PEs. Link allocation task maps communication to links. IC allocation string maps tasks to chips. © 2006 Elsevier MOGAC optimization procedure Forms initial solution. Repeats evolve/evaluate cycle. Evaluation determines noninferior solutions. Some noninferior solutions may not survive evolution. Clusters solutions to reduce run time. © 2006 Elsevier [Dic98] © 1998 IEEE MOGAC constraints nis(x): noninferior solutions in x. dom(a,b) = 1 if a is not dominated by b. Cluster rank: © 2006 Elsevier Energy-aware task scheduling Yang et al. schedule multiprocessors for energy. Combine design-time and runtime: At design time, scheduler evaluates scheduling/allocation choices; optimizes with genetic algorithms; generates table. At run time, heuristics use the table to choose best scheduling/allocation pattern. © 2006 Elsevier Co-synthesis for wireless Wireless systems are bandwidth and energy limited. COWLS uses parallel recombinative simulated annealing. Ranked by communication time, computation time, utilization. Scheduling influences both power consumption and timing. Slack determines idle time. © 2006 Elsevier Control and I/O synthesis Control finite-state machine (CFSM) model describges control-dominated systems. Event-driven model. Finite, non-zero, unbounded reaction times. Implementations: Hardware is logic guarded by latches. Software is synthesized from s-graph that models control flow graph. Can be used as an intermediate representation for Esterel, etc. © 2006 Elsevier Modal process model Chou et al. use modal models: I/O behavior depends on current mode and on inputs. Abstract control types define control operations with known properties. © 2006 Elsevier Interface synthesis Chou et al. represent I/O as control flow graphs. Generate tasks, allocate I/O ports, split wide-word operations, use memory mapped I/O where ncessary, generate I/O sequencer. Daveau et al. synthesize communication by allocating operations to units in a library. Communication unit must provide requred services, use the right protocol, and run at the required data rate. © 2006 Elsevier Cache modeling for co-synthesis Cache state affects task execution time. Li and Wolf used twostate model for processes in cache: One time if in cache. Another time if not in cache. This model is more abstract than cache line model. © 2006 Elsevier [Li99] © 1999 IEEE Co-synthesis with caches System cost: Hierarchical scheduling algorithm: Schedule tasks (>= process) over hyperperiod. Refine schedule by moving processes within a task. Dynamic urgency models how process uses cache: © 2006 Elsevier Wuytack et al. 1. 2. 3. 4. 5. 6. Methodology for dynamic memory management: Define application using abstract data types. Refine ADTs into concrete data structures. Virtual memory divided among several memory managers. Spit virtual memory segments into groups to parallelize data accesses. Order background memory accesses to optimize bandwidth. Allocate physical memories. © 2006 Elsevier Co-synthesis for reconfigurable systems FPGA fabric can hold different accelerators at different times. Combinations of accelerators may be limited. Must take floorplan into account. Schedule must take reconfiguration time, energy into account. © 2006 Elsevier CORDS CORDS uses evolutionary algorithms similar to MOGAC. Adds reconfiguration delay to costs based on current schedule state. Dynamic priority of task depends on slack + reconfiguration delay. Increases dynamic priority of tasks with low reconfiguration time to group together several reconfigurations and save energy. © 2006 Elsevier Nimble Performs fine-grained partitioning for instructionlevel parallelism. Platform described in architecture description language. Program represented as control flow graph. Selects interesting parts of loops by analyzing control dependence graph. [Li00] © 2000 IEEE © 2006 Elsevier Hardware/software co-simulation Must connect models with different models of computation, different time scales. Simulation backplane manages communication. Becker et al. used PLI in Verilog-XL to add C code that communicates with software models, UNIX networking to connect hardware simulator. © 2006 Elsevier Mentor Graphics Seamless Hardware modules described using standard HDLs. Software can be loaded as C or binary. Bus interface module connects hardware models to processor instruction set simulator. Coherent memory server manages shared memory. © 2006 Elsevier