Introduction to Embedded Systems Rabie A. Ramadan rabieramadan@gmail.com http://www.rabieramadan.org/classes/2014/e mbedded/ 2 Topics Embedded microprocessor market. Categories of CPUs. RISC, DSP, and Multimedia processors. CPU mechanisms. 2 Demand for Embedded Processors Embedded processors account for • Over 97% of total processors sold Sales expected to increase by roughly 15% each year Evaluating Processors Performance • Latency : the time required to execute an instruction from start to finish, • Throughput : the rate at which instructions are finished Evaluating Processors At the program level, computer architects also speak of average performance or peak performance. Often calculated assuming that instruction throughput proceeds at its maximum rate and all processor resources are fully utilized Evaluating Processors Embedded system designers often talk about program performance in terms of worst-case (or sometimes best-case) performance: This is not simply a characteristic of the processor; it is determined for a particular program running on a given processor. Evaluating Processors Cost The purchase price of the processor. In VLSI design, cost is often measured in terms of the silicon area required to implement a processor, which is closely related to chip cost. Evaluating Processors Energy and power In modern processors, energy and power consumption must be measured for a particular program and data for accurate results. Evaluating Processors Predictability Important characteristic for embedded systems When designing real-time systems, we want to be able to predict execution time. More difficult to measure. Evaluating Processors Security An important characteristic of all processors, including embedded processors. Security is inherently unmeasurable because of the fact that we do not know of a successful attack on a system; this does not mean that such an attack cannot exist. Basic Computer Architecture Memory instruction Input unit data ALU Processor CU Reg. Output unit Von Neumann Architecture Levels of Parallelism Bit level parallelism • Within arithmetic logic circuits Instruction level parallelism • Multiple instructions execute per clock cycle Memory system parallelism • Overlap of memory operations with computation Operating system parallelism • • More than one processor Multiple jobs run in parallel • • Loop level Procedure level Levels of Parallelism Bit Level Parallelism Within arithmetic logic circuits Levels of Parallelism Instruction Level Parallelism (ILP) Multiple instructions execute per clock cycle • Pipelining (instruction - data) • Multiple Issue - Very long instruction word (VLIW) Levels of Parallelism Memory System Parallelism Overlap of memory operations with computation Levels of Parallelism Operating System Parallelism • • There are more than one processor Multiple jobs run in parallel • • Loop level Procedure level Flynn’s Taxonomy Single Instruction stream - Single Data stream (SISD) Single Instruction stream - Multiple Data stream (SIMD) Multiple Instruction stream - Single Data stream (MISD) Multiple Instruction stream - Multiple Data stream (MIMD) Single Instruction stream - Single Data stream (SISD) Memory instruction CU data ALU Processor Von Neumann Architecture Flynn’s Taxonomy Single Instruction stream - Single Data stream (SISD) Single Instruction stream - Multiple Data stream (SIMD) Multiple Instruction stream - Single Data stream (MISD) Multiple Instruction stream - Multiple Data stream (MIMD) Single Instruction stream - Multiple Data stream (SIMD) instruction PE data CU PE PE instruction Instructions of the program are broadcast to more than one processor Each processor executes the same instruction synchronously, but using different data Used for applications that operate upon arrays of data data data data Memory PE Flynn’s Taxonomy Single Instruction stream - Single Data stream (SISD) Single Instruction stream - Multiple Data stream (SIMD) Multiple Instruction stream - Single Data stream (MISD) Multiple Instruction stream - Multiple Data stream (MIMD) Multiple Instruction stream - Multiple Data stream (MIMD) Each processor has a separate program An instruction stream is generated for each program on each processor Each instruction operates upon different data Multiple Instruction stream - Multiple Data stream (MIMD) Shared memory Distributed memory Shared vs Distributed Memory M M M M P P P P Distributed memory • • Network P P P Bus Memory P Each processor has its own local memory Message-passing is used to exchange data between processors Shared memory • • Single address space All processes have access to the pool of shared memory Distributed Memory M M M M P P P P NI NI NI NI Network Processors cannot directly access another processor’s memory Each node has a network interface (NI) for communication and synchronization Distributed Memory M instr CU PE data data data CU PE data M instr data CU PE data M instr data data CU PE Network M instr Each processor executes different instructions asynchronously, using different data Shared Memory CU PE data CU PE data CU PE data CU PE instruction Memory data Each processor executes different instructions asynchronously, using different data Shared Memory P P P P Bus Uniform memory access (UMA) • Memory P P P P P P P Bus Bus Memory Memory Network P Each processor has uniform access to memory (symmetric multiprocessor - SMP) Non-uniform memory access (NUMA) • • • Time for memory access depends on the location of data Local access is faster than nonlocal access Easier to scale than SMPs Distributed Shared Memory Making the main memory of a cluster of computers look as if it is a single memory with a single address space Shared memory programming techniques can be used Multicore Systems Many general purpose processors GPU (Graphics Processor Unit) GPGPU (General Purpose GPU) Hybrid The trend is: Board composed of multiple many core chips sharing memory Rack composed of multiple boards A room full of these racks Memory Other axes of comparison RISC vs. CISC---Instruction set style. Instruction issue width. Static vs. dynamic scheduling for multiple-issue machines. Scalar vs. vector processing. Single-threaded vs. multithreading. A single CPU can fit into multiple categories. RISC vs. CISC Complex Instruction Set Computer “High level” Instruction Set • Executes several “low level operations” • Ex: load, arithmetic operation, memory store – VAX, Intel X86, IBM 360/370, etc. 32 Features of CISC Small number of general purpose registers Instructions take multiple clocks to execute Few lines of code per operation RISC vs. CISC Reduced Instruction Set Computer • RISC is a CPU design that recognizes only a limited number of instructions • Simple instructions • Instructions are executed quickly MIPS, DEC Alpha, SUN Sparc, IBM 801 34 Features of RISC “Reduced” instruction set Executes a series of simple instruction instead of a complex instruction Instructions are executed within one clock cycle Incorporates a large number of general registers for arithmetic operations to avoid storing variables on a stack in memory Pipelining = speed Single issue versus Multiple issue Instruction issue width important aspect of processor performance. Processors that can issue more than one instruction per cycle generally execute programs faster. They do so at the cost of increased power consumption and higher cost. 36 static versus dynamic scheduling Static scheduling • instructions is determined when the program is written. Dynamic scheduling • determines which instructions are issued at runtime. • Superscalar is a common technique for dynamic instruction issue - Tomasulo 37 Embedded vs. general-purpose processors Embedded processors may be customized for a category of applications. • Customization may be narrow or broad. We may judge embedded processors using different metrics: • Code size. • Energy efficiency. • Memory system performance. • Predictability. Embedded RISC processors RISC processors often have simple, highly-pipelinable instructions Pipelines of embedded RISC processors have grown over time: • • • ARM7 has 3-stage pipeline. ARM9 has 5-stage pipeline ARM11 has 8-stage pipeline. RISC processor families ARM: • • ARM7 has in-order execution, and no memory management or branch prediction; ARM9 ARM11 has out of order execution, memory management, and branch prediction, MIPS: • • • MIPS32 4K has 5-stage pipeline; 4KE family has DSP extension; 4KS is designed for security. PowerPC: • • PowerPC 400 series includes several embedded processors; Motorola and IBM offer superscalar versions of the PowerPC Embedded DSP Processors Embedded DSP processors are optimized to perform DSP algorithms; speech coding, filtering, convolution, fast Fourier transforms, discrete cosine transforms N y k bn x k n n0 Embedded DSP Processors- example AT&T DSP-16 was the first DSP it had an onboard multiplier and provided a multiply–accumulate instruction. dest = src1*src2 + src3, a common operation in digital signal processing. Based on Harvard-architecture with separate data and instruction memories. Data accesses could rely on consistent bandwidth from the memory, which is particularly important for sampleddata systems. 42 Parallelism extraction Dynamic: Static: • Use hardware to • Use compiler to identify opportunities. analyze program. • More complex CPU. • Simpler CPU. • Can make use of data • Can’t depend on data values. values. • Very Long Instruction • Superscalar Word (VLIW) Very Long Instruction Word (VLIW) Widespread use in embedded systems provide instruction-level parallelism with relatively low hardware overhead. The execution unit includes a pool of function units connected to a large register file. the execution unit reads a packet of instructions—each instruction in the packet can control one of the function units in the machine. 44 Simple VLIW architecture Large register file feeds multiple function units. E box Add r1,r2,r3; Sub r4,r5,r6; Ld r7,foo; St r8,baz; NOP Register file ALU ALU Load/store Load/store FU Clustered VLIW architecture Register file, function units divided into clusters. Cluster bus Execution Execution Register file Register file Very Long Instruction Word (VLIW) Example 1 : • Trimedia family of processors • designed for use in video systems. • Video algorithms often perform similar operations on several pixels at time. 47 Very Long Instruction Word (VLIW) Example 2 : Texas Instruments C6x VLIW DSP 48 Very Long Instruction Word (VLIW) Example 2: Texas Instruments C6x VLIW DSP Onboard program and a data RAM as well as standard devices and DMA. The processor core includes two clusters, each with the same configuration. Each register file holds 16 words. Each data path has eight function units: two load units, two store units, two data address units, and two register file cross paths. 49 Superscalar Processors more than one instruction per clock cycle. Unlike VLIW processors, they check for resource conflicts on-the-fly to determine which combinations of instructions can be issued at each step. Superscalar processors are not as common in the embedded world. Used to some extent in embedded processors. • Embedded Pentium is two-issue in-order. • Some PowerPCs are superscalar 50 SIMD Instructions and Parallelism Recent multimedia processors commonly support Single Instruction Multiple data (SIMD) instructions The same operation is performed on multiple data operands using a single instruction A3 A2 A1 A0 B3 B2 B1 B0 A3+B3 A2+B2 A1+B1 A0+B0 Exploits low precision and high data parallelism of multimedia applications Thread-Level Parallelism Hardware multithreading • Alternately fetches instructions from separate threads. • On one cycle, it fetches several instructions from one thread, fetching enough instructions to be able to keep the pipelines full. • On the next cycle, it fetches instructions from another thread. 52 Thread-Level Parallelism Simultaneous multithreading (SMT): Fetches instructions from several threads on each cycle rather than alternating between threads. 53 Better-Than-Worst-Case Design Digital systems are traditionally designed as synchronous systems governed by clocks. The clock period is determined by careful analysis so that values are stored into registers properly, with the clock period extended to cover the worst case delay the worst-case delay is relatively rare in many circuits and the logic sits idle for some period most of the time. 54 Better-Than-Worst-Case Design Alternative design style in which logic detects and recovers from errors, allowing the circuit to run most of the time at a higher speed. Razor Architecture uses a specialized register that measures and evaluates errors 55 Better-Than-Worst-Case Design The system register holds the latched value and is clocked at the higher-thanworst-case clock rate. A separate register is clocked separately and slightly behind the system register. If the results stored in the two registers are different, then an error occurred, probably due to timing. The XOR gate measures that error and causes the latter value to replace the value in the system register. 56