PPT - ECE 751 Embedded Computing Systems

advertisement
Lecture 8: Embedded
Processor Issues
Embedded Computing Systems
Mikko Lipasti, adapted from M. Schulte
Based on slides and textbook from Wayne Wolf
High Performance Embedded Computing
© 2007 Elsevier
Topics




Bus encoding.
Security-oriented architectures.
CPU simulation.
Configurable processors.
© 2006 Elsevier
Bus encoding

Encode information on bus
to reduce toggles and
dynamic energy
consumption.



Count energy consumption
by toggle counts.
encoded
bus
mem
Bus encoding is invisible to
rest of architecture.
Some schemes transmit
side information about
encoding.
© 2006 Elsevier
enc
dec
side
information
CPU
Bus-invert coding



Stan and Burleson: take
advantage of correlation
between successive bus
values.
Choose sending true or
complement form of bus
values to minimize toggles.
 Why might this approach
work well?
Can break bus into fields
and apply bus-invert coding
to each field.
 How might the bus be
divided?
© 2006 Elsevier
Working zone encoding

Mussoll et al.:




Used to encode address buses
Uses the observation that the majority of the execution
time for a program is spent in a small range of addresses
Divides addresses into sets called working-zone
Address in a working zone is sent as an offset from the
base in a one-hot code.



Why is a one-hot code used?
Addresses that are not in a working zone have the entire
value sent.
Compared to bus-invert coding, what would you expect
to be the advantages and disadvantages of this
approach?
© 2006 Elsevier
Address bus encoding




Benini et al: cluster correlated address bits and
then encode clusters
Compute correlation coefficients of transition
variables to determine clusters:
Need to ensure clusters don’t become too large,
since this can increase encode/decode logic.
Use logic synthesis to design encoders and
decoders for each cluster
© 2006 Elsevier
Benini et al. results

What important tradeoffs of the address encoding
technique are not shown in the table below?
[Ben98] © 1998 IEEE
© 2006 Elsevier
Dictionary-based encoding

Takes advantage of the observation that many values
are repeated on buses.
© 2006 Elsevier
Dictionary-based encoding




Takes advantage of the observation that many values
are repeated on buses.
Divides bus into three parts:
Only the upper bits of the bus are stored in the
dictionary and used to match dictionary values that
are indexed by the index part.
When the upper bits match, they are put in a high-Z
state and the remaining bits are sent; otherwise all
bits are sent.
© 2006 Elsevier
Lv et al. dictionary-based architecture
© 2006 Elsevier
[Lv03] © 2003 IEEE
Lv et al. energy savings
© 2006 Elsevier
[Lv03] © 2003 IEEE
Security-oriented architectures

There are a variety of security attacks:




Typical desktop/server attacks, such as Trojan
horses and viruses.
Physical access allows side channel attacks.
Cryptographic instruction sets have been
developed for several architectures.
Embedded systems architecture must add
protection for side effects, consider energy
consumption.
© 2006 Elsevier
Secure architectures

SmartMIPS and ARM SecureCore offer
security extensions


Include encryption instructions, specialized
memory management units, etc.
SAFE-OPS



Designed to protect against software modification
Compiler embeds a watermark into code based
on register assignment.
FPGA accelerator checks the validity of the
watermark during execution.
© 2006 Elsevier
Power attacks


Kocher et al.:
Adversary can observe
power consumption at
pins and deduce data,
instructions within CPU.
Yang et al.: Dynamic
voltage/frequency
scaling (DVFS) can be
used as a
countermeasure.
© 2006 Elsevier
[Yan05] © 2005 ACM Press
CPU simulation





Performance vs. energy/power simulation.
Temporal accuracy.
Trace vs. execution.
Simulation vs. direct execution.
Simulate using appropriate benchmarks for
embedded systems



Don’t use SPEC CPU Benchmarks!
Embedded Benchmarks include EEMBC,
MediaBench, MiBench
Benchmarks often should be domain-specific
© 2006 Elsevier
Trace-based analysis



Instrumentation
generates side
information.
PC-sampling checks
PC value during
execution.
Can measure control
flow, memory
accesses.
© 2006 Elsevier
Program counter (PC) sampling


Example: Unix prof.
Interrupts are used to sample PC periodically.




Must run on the platform.
Doesn’t provide complete trace.
Subject to sampling problems: undersampling,
periodicity problems.
Generates a call-graph report that indicates the
percentage execution time spent in each
program.
© 2006 Elsevier
Program instrumentation


Example: dinero.
Modify the program to
write trace information.



Track entry into basic
blocks.
Requires editing object
files.
Provides complete trace.
© 2006 Elsevier
Microarchitecture-modeling simulators

Varying levels of detail:




Instruction scheduler is not cycle-accurate.
Cycle timers are cycle-accurate.
Can simulate for performance or
energy/power.
Typically written in general-purpose
programming language (e.g., C), not
hardware description language.
© 2006 Elsevier
Cycle-accurate simulator
Models the
microarchitecture.


I-box
Simulating one instruction
requires executing routines for
instruction fetch, decode,
execute, etc.
Models pipeline state.

Microarchitectural registers are
exposed to the simulator.
© 2006 Elsevier
IR
PC
reg

Trace-based vs. execution-based

Trace-based:





Gather trace first, then
generate timing
information.
Basic timing information is
simpler to generate.
Full timing information may
require regenerating
information from the
original execution.
Requires owning the
platform.
© 2006 Elsevier
Execution-based:



Simulator fully executes the
instruction.
Requires a more complex
simulator.
Requires explicit
knowledge of the
microarchitecture, not just
instruction execution times.
Power simulation


Model capacitance in the processor.
Keep track of activity in the processor.



Requires full simulation.
Activity determines capacitive
charge/discharge, which determines power
consumption.
CPU Power Simulators include:


Simple Power and Wattch for embedded GP
Trimaran with EPIC Explorer for embedded VLIW
© 2006 Elsevier
Automated CPU design

Customize aspects of CPU for application:








Instruction set.
Functional units.
Memory system (including register files).
Busses, I/O, and peripherals.
Tools help design and implement custom CPUs.
FPGAs make it easier to implement custom CPUs.
Application-specific instruction processor (ASIP) has
custom instruction set.
Configurable processor is generated by a tool set.
© 2006 Elsevier
Techniques



Architecture optimization tools help choose
the instruction set and microarchitecture.
Configuration tools implement the
microarchitecture (and perhaps compiler).
Early example: MIMOLA [1984] analyzed
programs, created microarchitecture and
instructions, synthesized logic.
© 2006 Elsevier
CPU configuration process
© 2006 Elsevier
Tensilica configuration options
© 2006 Elsevier
© 2004 Tensilica
Tensilica EEMBC comparison
© 2006 Elsevier
© 2004 Tensilica
Tensilica energy consumption by subsystem
© 2006 Elsevier
© 2006 Tensilica
Toshiba MePcore
© 2006 Elsevier
LISA language
© 2006 Elsevier
[Hof01] © 2001 IEEE
LISA descriptions and generation




Memory model includes registers and other
memories.
Uses clause binds operations to hardware.
Timing specified by PIPELINE, IN,
ACTIVATION, ENTITY.
Generates hierarchical VHDL design.
© 2006 Elsevier
PEAS-III

Synthesis driven by:






Architectural parameters
such as number of pipeline
stages.
Declaration of function
units.
Instruction format
definitions.
Interrupt conditions and
timing.
Micro-operations for
instructions and interrupts.
Generates both simulation
and synthesis models in
VHDL.
© 2006 Elsevier
Instruction set synthesis


Generate instruction set
from application
program, other
requirements.
Sun et al. analyzed
design space for simple
BYTESWAP() program.
[Sun04] © 2004 IEEE
© 2006 Elsevier
Complex function definition

Atasu et al. try to
combine many
operations into an
instruction:




Disjoint operator graphs.
Multi-output instructions.
Operator graph must be
convex---value cannot
leave, then re-enter the
instruction.
Textbook discusses
several other
approaches
© 2006 Elsevier
[Ata03] © 2003 ACM Press
Limited-precision arithmetic




Fang et al. used affine
arithmetic to analyze
numerical characteristics of
algorithms.
Mahlke synthesize variable
bit-width architectures given
bit-width requirements.
Cluster operations to find a
small number of distinct bit
widths.
What advantages and
disadvantages might this
approach have?
[Mah01] © 2001 IEEE
© 2006 Elsevier
Download