Systemy RT i embedded Wykład 6 Rdzenie ARM, część 2 Wrocław 2013 Pipelining What Is A Pipeline? • Pipelining is used by virtually all modern microprocessors to enhance performance by overlapping the execution of instructions. • In terms of a pipeline within a CPU, each instruction is broken up into different stages. What Is A Pipeline? • Ideally if each stage is balanced (all stages are ready to start at the same time and take an equal amount of time to execute.) the time taken per instruction (pipelined) is defined as: Time per instruction (unpipelined) / Number of stages What is a pipeline • If the stages of a pipeline are not balanced and one stage is slower than another, the entire throughput of the pipeline is affected. What is a pipeline • In terms of a CPU, the implementation of pipelining has the effect of reducing the average instruction time, therefore reducing the average CPI (Clocks per Instruction). • Example: If each instruction in a microprocessor takes 5 clock cycles (unpipelined) and we have a 4 stage pipeline, the ideal average CPI with the pipeline will be 1.25 . Classical 5-stage pipeline • Usually we have 5 cycle deep pipeline: •Instruction Fetch Cycle •Instruction Decode/Register Fetch Cycle •Execution Cycle •Memory Access Cycle •Write-Back Cycle Instruction Fetch (IF) Cycle • During IF cycle the instructiondecoder fetches an instruction from an instruction memory Instruction Decode (ID)/Register Fetch Cycle • Decoding the instruction and at the same time reading in the values of the register involved. • As the registers are being read, the equality test is done in case the instruction decodes as a branch or jump. • Instruction can be decoded in parallel with reading the registers because the register addresses are at fixed locations. Execution (EX)/Effective Address Cycle • If a branch or jump did not occur in the previous cycle, the arithmetic logic unit (ALU) can execute the instruction. • At this point the instruction falls into three different types: • Memory Reference: ALU adds the base register and the offset to form the effective address. • Register-Register: ALU performs the arithmetic, logical, etc… operation as per the opcode. • Register-Immediate: ALU performs operation based on the register and the immediate value (sign extended). Memory Access (MEM) Cycle • If a load, the effective address computed from the previous cycle is referenced and the memory is read. The actual data transfer to the register does not occur until the next cycle. • If a store, the data from the register is written to the effective address in memory. Write-Back (WB) Cycle • Occurs with Register-Register ALU instructions or load instructions. • Simple operation whether the operation is a register-register operation or a memory load operation, the resulting data is written to the appropriate register. Problems With The Previous Figure • The memory is accessed twice during each clock cycle. This problem is avoided by using separate data and instruction caches. • It is important to note that if the clock period is the same for a pipelined processor and an non-pipelined processor, the memory must work five times faster. • Another problem that we can observe is that the registers are accessed twice every clock cycle. To try to avoid a resource conflict we perform the register write in the first half of the cycle and the read in the second half of the Problems With The Previous Figure (cont’d) • We write in the first half because therefore a write operation can be read by another instruction further down the pipeline. • A third problem arises with the interaction of the pipeline with the PC. We use an adder to increment PC by the end of IF. Within ID we may branch and modify PC. How does this affect the pipeline? Pipeline Hazards • The performance gain from using pipelining occurs because we can start the execution of a new instruction each clock cycle. In a real implementation this is not always possible. • Another important note is that in a pipelined processor, a particular instruction still takes at least as long to execute as non-pipelined. • Pipeline hazards prevent the execution of Types Of Hazards • There are three types of hazards in a pipeline, they are as follows: • Structural Hazards: are created when the data path hardware in the pipeline cannot support all of the overlapped instructions in the pipeline. • Data Hazards: When there is an instruction in the pipeline that affects the result of another instruction in the pipeline. • Control Hazards: The PC causes these due to the pipelining of branches and other instructions that change the PC. A Hazard Will Cause A Pipeline Stall • We can look at pipeline performance in terms of a faster clock cycle time as Clock cycle time unpipelined CPI unpipelined well: Speedup = x CPI pipelined Clock cycle pipelined = Clock cycle time pipelined Clock cycle time unpipelined Pipeline Depth Speedup = 1 1 + Pipeline stalls per Ins x Pipeline Depth Dealing With Structural Hazards • Structural hazards result from the CPU data path not having resources to service all the required overlapping resources. • Suppose a processor can only read and write from the registers in one clock cycle. This would cause a problem during the ID and WB stages. • Assume that there are not separate instruction and data caches, and only one memory access can occur during one clock cycle. A hazard would be caused during the IF and MEM cycles. Dealing With Structural Hazards • A structural hazard is dealt with by inserting a stall or pipeline bubble into the pipeline. This means that for that clock cycle, nothing happens for that instruction. This effectively “slides” that instruction, and subsequent instructions, by one clock cycle. • This effectively increases the average CPI. Dealing With Structural Hazards (cont’d) Speedup = CPI no haz Clock cycle time no haz x Clock cycle time haz CPI haz Speedup = 1 x 1+0.4*1 = 0.75 1 1/1.05 Dealing With Structural Hazards (cont’d) • We can see that even though the clock speed of the processor with the hazard is a little faster, the speedup is still less than 1. • Therefore the hazard has quite an effect on the performance. • Sometimes computer architects will opt to design a processor that exhibits a structural A: The improvement to the processor data path is too hazard. •Why? costly. • B: The hazard occurs rarely enough so that the processor will still perform to specifications. Data Hazards (A Programming Problem?) • We haven’t looked at assembly programming in detail at this point. • Consider the following operations: DADD R1, R2, R3 DSUB R4, R1, R5 AND R6, R1, R7 OR R8, R1, R9 XOR R10, R1, R11 Pipeline Registers What are the problems? Data Hazard Avoidance • In this trivial example, the programmer cannot be expected to reorder his/her operations. Assuming this is the only code we want to execute. • Data forwarding can be used to solve this problem. • To implement data forwarding we need to bypass the pipeline register flow: – Output from the EX/MEM and MEM/WB stages must be fed back into the ALU input. – We need routing hardware that detects when the next instruction depends on the write of a previous General Data Forwarding • It is easy to see how data forwarding can be used by drawing out the pipelined execution of each instruction. • Now consider the following instructions: DADD R1, R2, R3 LD R4, O(R1) SD R4, 12(R1) Problems • Can data forwarding prevent all data hazards? • NO! • The following operations will still cause a data hazard. This happens because the further down the pipeline we get, the R1, O(R2) less we canLDuse forwarding. DSUB R4, R1, R5 AND R6, R1, R7 OR R8, R1, R9 Problems • We can avoid the hazard by using a pipeline interlock. • The pipeline interlock will detect when data forwarding will not be able to get the data to the next instruction in time. • A stall is introduced until the instruction can get the appropriate data from the previous instruction. Control Hazards • Control hazards are caused by branches in the code. • During the IF stage remember that the PC is incremented by 4 in preparation for the next IF cycle of the next instruction. • What happens if there is a branch performed and we aren’t simply incrementing the PC by 4. • The easiest way to deal with the occurrence of a branch is to perform the IF stage again once the branch occurs. Performing IF Twice • We take a big performance hit by performing the instruction fetch whenever a branch occurs. Note, this happens even if the branch is taken or not. This guarantees that the PC will get the correct value. branch IF ID EX MEM WB IF ID EX MEM WB IF IF ID EX MEM WB Performing IF Twice • This method will work but as always in computer architecture we should try to make the most common operation fast and efficient. • By performing IF twice we will encounter a performance hit between 10%-30% • Next class we will look at some methods for dealing with Control Hazards. Control Hazards (other solutions) • What if every branch is treated as “not taken”. Than not only the registers are read during ID, but we also an equality test is performed in case a branch is necessary or not. • The performance can be improved by assuming that the branch will not be taken. • The complexity arises when the branch evaluates and we end up needing to actually take the branch. Control Hazards (other solutions) • If the branch is actually taken than the pipeline needs to be cleared of any code loaded in from of the “not-taken” path. • Likewise it can be assumed that the branch is always taken. Control Hazards (other solutions) • The next method for dealing with a control hazard is to implement a “delayed” branch scheme. • In this scheme an instruction is inserted into the pipeline that is useful and not dependent on whether the branch is taken or not. It is the job of the compiler to determine the delayed branch instruction. How To Implement a Pipeline Multi-clock Operations • Sometimes operations require more than one clock cycle to complete. Examples are: • Floating Point Multiply • Floating Point Divide • Floating Point Add Dependences and Hazards • Types of data hazards: – RAW: read after write – WAW: write after write – WAR: write after read • RAW hazard was already shown. WAW hazards occur due to output dependence. • WAR hazards do not usually occur because of the amount of time between the read cycle and write cycle in a pipeline. Dynamic Scheduling • In the statically scheduled pipeline the instructions are fetched and then issued. If the users code has a data dependency / control dependence it is hidden by forwarding. • If the dependence cannot be hidden a stall occurs. • Dynamic Scheduling is an important technique in which both dataflow and exception behavior of the program are maintained. Dynamic Scheduling (continued) • Data dependence can cause stalling in a pipeline that has “long” execution times for instructions that dependencies. • EX: Let’s consider this code ( .D is floating point) , DIV.D F0,F2,F4 ADD.D F10,F0,F8 SUB.D F12,F8,F14 Dynamic Scheduling (continued) • Longer execution times of certain floating point operations give the possibility of WAW and WAR hazards. EX: DIV.D ADD.D SUB.D MUL.D F0, F6, F8, F6, F2, F4 F0, F8 F10, F14 F10, F8 Dynamic Scheduling (continued) • If we want to execute instructions out of order in hardware (if they are not dependent etc…) we need to modify the ID stage of the 5 stage pipeline. • Split ID into the following stages: – Issue: Decode instructions, check for structural hazards. – Read Operands: Wait until no data hazards, then read operands. • IF still precedes ID and will store the instruction into a register or queue. Branch Prediction In Hardware • Data hazards can be overcome by dynamic hardware scheduling, control hazards need also to be addressed. • Branch prediction is extremely useful in repetitive branches, such as loops. • A simple branch prediction can be implemented using a small amount of memory and the lower order bits of the address of the branch instruction. • The memory only needs to contain one bit, representing whether the branch was taken or Branch Prediction In Hardware • If the branch is taken the bit is set to 1. The next time the branch instruction is fetched we will know that the branch occurred and we can assume that the branch will be taken. • This scheme adds some “history” to our previous discussion on “branch taken” and “branch not taken” control hazard avoidance. 2-bit Prediction Scheme • This method is more reliable than using a single bit to represent whether the branch was recently taken or not. • The use of a 2-bit predictor will allow branches that favor taken (or not taken) to be mispredicted less often than the onebit case. Branch Predictors • The size of a branch predictor memory will only increase it’s effectiveness so much. • We also need to address the effectiveness of the scheme used. Just increasing the number of bits in the predictor doesn’t do very much either. • Some other predictors include: – Correlating Predictors – Tournament Predictors Branch Predictors • Correlating predictors will use the history of a local branch AND some overall information on how branches are executing to make a decision whether to execute or not. • Tournament Predictors are even more sophisticated in that they will use multiple predictors local and global and enable them with a selector to improve accuracy. ARM cores, part 2 Plan • • • • ARM9 AMBA Cortex-M Cortex-R ARM9 Source: [2] Source: [2] ARM9 ARM9 - features • Over 5 Billion ARM9 processors have been shipped so far • The ARM9 family is the most popular ARM processor family ever • 250+ silicon licensees • 100+ licensees of the ARM926EJ-S processor • ARM9 processors continue to be successfully deployed across a wide range of products and applications. • The ARM9 family offers proven, low risk and easy to use designs which reduce costs and enable rapid time to market. • The ARM9 family consists of three processors ARM926EJ-S, ARM946E-S and ARM968E-S ARM9 – family features • Main features – Based on ARMv5TE architecture – Efficient 5-stage pipeline for faster throughput and system performance – Fetch/Decode/Execute/Memory/Writeback – Supports both ARM and Thumb® instruction sets – Efficient ARM-Thumb interworking allows optimal mix of performance and code density ARM9 – family features • Main features – Harvard architecture - Separate Instruction & Data memory interfaces – Increased available memory bandwidth – Simultaneous access to I & D memory – Improved performance – 31 x 32-bit registers – 32-bit ALU & barrel shifter – Enhanced 32-bit MAC block ARM9 – DSP enhancements • Single cycle 32x16 multiplier implementation • Speeds up all multiply instructions • Pipelined design allows one 16x16 or 32x16 to start each cycle • New 32x16 and 16x16 multiply instructions • Allow independent access to 16-bit halves of registers ARM9 – DSP enhancements • Gives efficient use of 32-bit bandwidth for packed 16-bit operands • ARM ISA provides 32x32 multiply instructions • Efficient fractional saturating arithmetic • QADD, QSUB, QDADD, QDSUB • Count leading zeros instruction • CLZ for faster normalisation and division Source: [3] ARM9 – features comparison AMBA AMBA - Advanced Microcontroller Bus Architecture • AMBA – onchip communications standard for designing high-performance embedded microcontrollers introduced by ARM in 1996 • A few versions: – – – – AHB (Advanced High-performance Bus) ASB (Advanced System Bus) APB (Advanced Peripheral Bus) AXI AMBA – first specification • Buses defined : – Advanced System Bus (ASB) – Advanced Peripheral Bus (APB) AMBA 2 – specification • Buses defined : – Advanced High-performance Bus (AHB) widely used on ARM7, ARM9 and ARM CortexM based designs – Advanced System Bus (ASB) – Advanced Peripheral Bus (APB2 or APB) AMBA 3 – specification • Buses defined: – Advanced eXtensible Interface (AXI3 or AXI v1.0) - widely used on ARM Cortex-A processors including Cortex-A9 – Advanced High-performance Bus Lite (AHBLite v1.0) – Advanced Peripheral Bus (APB3 v1.0) – Advanced Trace Bus (ATB v1.0) AMBA 2 – specification • Buses defined : – AXI Coherency Extensions (ACE) - widely used on the latest ARM Cortex-A processors including Cortex-A7 and Cortex-A15 – AXI Coherency Extensions Lite (ACE-Lite) – Advanced eXtensible Interface 4 (AXI4) – Advanced eXtensible Interface 4 Lite (AXI4-Lite) – Advanced eXtensible Interface 4 Stream (AXI4Stream v1.0) – Advanced Trace Bus (ATB v1.1) – Advanced Peripheral Bus (APB4 v2.0) APB • APB – designed for low-power system modules, for example register interfaces on system peripherals • optimized for minimal power consumption and reduced interface complexity to support peripheral functions • It has to support 32bit and 66 MHz signals. ASB • ASB – designed for high-performance system modules • alternative system bus suitable for use where the high-performance features of AHB are not required • supports also the efficient connection of: – processors, – on-chip memories – off-chip external memory interfaces with low-power peripheral macrocell functions AHB • AHB – designed for high-performance, high clock frequency system modules • acts as the high-performance system backbone bus • supports the efficient connection of: – processors, – on-chip memories – off-chip external memory interfaces with low-power peripheral macrocell functions AHB • Features: – – – – – – – – single edge clock protocol split transactions several bus masters burst transfers pipelined operations single-cycle bus master handover non-tristate implementation large bus-widths (64/128 bit). AHB - Lite • AHB – Lite is a subset of AHB • This subset simplifies the design for a bus with a single master AXI • AXI – designed for high-performance, high clock frequency system modules with low latency • enables high-frequency operation without using complex bridges • provides flexibility in the implementation of interconnect architectures • is backward-compatible with existing AHB and APB interfaces. AXI • Features: – separate address/control and data phases – support for unaligned data transfers using byte strobes – burst based transactions with only start address issued – issuing of multiple outstanding addresses with out of order responses – easy addition of register stages to provide timing closure. Typical AMBA system Cortex-M Source: [2] Cortex family • Currently Cortex family is strongly introduced to the market by ARM corporation • Cortex family consists of three subfamilies: – Cortex-M – cores for microcontrollers and costsensitive applications; Thumb-2 instructions supported Cortex family • Cortex family consists of three subfamilies: – Cortex-R – cores for real time systems appliactions; ARM, Thumb and Thumb-2 instructions supported – Cortex-A – the most complex and the most powerful cores, for multimedia devices and application processors; ARM, Thumb and Thumb-2 instructions supported Source: [4] Cortex-M Cortex-M Cortex-M • Main features: – – – – – 32-bit processor 3 stage pipelining Thumb-2 instruction list – concise and efficient code Many power saving modes and domains Nested Vectored Interrupt Controller – well defined times and methods of interrupts invoking – RTOS support – Debugger support (JTAG, SWD – Serial Wire Debug) Source: [2] Cortex-M0/M0+ Source: [2] Cortex-M0/M0+ Cortex-M0 • Main features: – The armest version of ARM cores – The most power saving version of ARM cores – only 85mW/MHz – Upward compatibility with Cortex-M3 – Only 12000 gates – Only 56 C-optimized instructions – Support for low power wireless communication: Bluetooth Low Energy (BLE), ZigBee, etc. – Performance 0.9 DMIPS/MHz – Single cycle 32x32 multiply instructions – Interrupt execution delay: 16 cycles Source: [2] Cortex-M0 Cortex-M0 • Processor modes: – Thread mode: Used to execute application software. The processor enters Thread mode when it comes out of reset. – Handler mode: Used to handle exceptions. The processor returns to Thread mode when it has finished all exception processing. Cortex-M0 – core registers Cortex-M0 – memory map Cortex-M0 – vector table Cortex-M0 – register stacking Source: [2] Cortex-M1 Cortex-M1 – Core destined for FPGA applications – Support for Actel, Altera and Xilinx chips – Easy migration from FPGA (development) to ASIC (production) Source: [2] • Main features: Cortex-M1 – A general-purpose 32-bit microprocessor, which executes the ARMv6-M subset of the Thumb-2 instruction set and offers high performance operation and small size in FPGAs. – It has: • a three-stage pipeline • a three-cycle hardware multiplier • little-endian format for accessing all memory. – A system control block containing memory-mapped control registers. Source: [2] • Main features: Cortex-M1 – An integrated Operating System (OS) extensions system timer. – An integrated Nested Vectored Interrupt Controller (NVIC) for low-latency interrupt processing. – A memory model that supports accesses to both memory and peripheral registers. – Integrated and configurable Tightly Coupled Memories (TCMs) – Optional debug support. Source: [2] • Main features: Source: [2] Cortex-M1 Cortex-M1 Source: [2] • Processor modes as in Cortex-M0 Source: [2] Cortex-M1 – Memory Map Source: [2] Cortex-M3 Cortex-M3 • Main features: – Introduced to the market in 2004 – Destined for the most demanding microcontrollers – High performance and many additional features – Low power consumption (12.5 DMIPS/mW) – Up to 240 interrupt sources!!! – Support for many serial protocols Cortex-M3 • Main features: – Performance of 1.25DMIPS/MHz – Support for bit operations – Single cycle 32x32bit multiply; 2-12 cycle division – Three stage pipelining with branch prediction – Memory Protection Unit (MPU) – Max speed: up to 275 MHz /340 DMIPS Cortex-M3 Cortex-M3 • Core features: – – – – Thumb instruction set (ARMv7) Banked Stack Pointer Hardware integer divide instructions Automatic processor state saving and restoration for low latency Interrupt Service Routine (ISR) entry and exit. Cortex-M3 • NVIC (Nested Vector Interrupt Controller) features: – – – – External interrupts, configurable from 1 to 240. Bits of priority, configurable from 3 to 8. Dynamic reprioritization of interrupts. Priority grouping - selection of preempting interrupt levels and non preempting interrupt levels. – Support for tail-chaining and late arrival of interrupts. This enables back-to-back interrupt processing without the overhead of state saving and restoration between interrupts. – Processor state automatically saved on interrupt entry, and restored on interrupt exit, with no instruction overhead. – Optional Wake-up Interrupt Controller (WIC), providing ultra-low power sleep mode support. Cortex-M3 • MPU features features: – Eight memory regions. – Sub Region Disable (SRD), enabling efficient use of memory regions. – The ability to enable a background region that implements the default memory map attributes. Cortex-M3 • Bus interfaces: – Three Advanced High-performance Bus-Lite (AHB-Lite) interfaces: ICode, DCode, and System bus interfaces. – Private Peripheral Bus (PPB) based on Advanced Peripheral Bus (APB) interface. – Bit-band support that includes atomic bit-band write and read operations. – Memory access alignment. – Write buffer for buffering of write data. – Exclusive access transfers for multiprocessor systems. Cortex-M3 • The processor supports two modes of operation, Thread mode and Handler mode: – The processor enters Thread mode on Reset, or as a result of an exception return. Privileged and Unprivileged code can run in Thread mode. – The processor enters Handler mode as a result of an exception. All code is privileged in Handler mode. Cortex-M3 • The processor can operate in one of two operating states: – Thumb state. This is normal execution running 16-bit and 32-bit halfword aligned Thumb instructions. – Debug State. This is the state when the processor is in halting debug. Cortex-M3 Cortex-M3 – bit band mapping Source: [2] Cortex-M4 Cortex-M4 • Main features: – The richest version of Cortex-M subfamily – Destined for low power digital signal applications – Integrated 32b CPU and DSP – Single precision FPU unit – Other features like in Cortex-M3 – DSP instructions – Max speed: up to 300 MHz /375 DMIPS Cortex-M4 Cortex-M4 • FPU features: – 32-bit instructions for single-precision (C float) dataprocessing operations. – Combined Multiply and Accumulate instructions for increased precision (Fused MAC). – Hardware support for conversion, addition, subtraction, multiplication with optional accumulate, division, and square-root. – Hardware support for denormals and all IEEE rounding modes. – 32 dedicated 32-bit single precision registers, also addressable as 16 double-word registers. – Decoupled three stage pipeline. Cortex-M4 - FPU • FPU registers: – sixteen 64-bit doubleword registers, D0-D15 – or thirty-two 32-bit single-word registers, S0S31 Source: [2] Cortex-R4 Cortex-R4 • Main features: – A mid-range processor for use in deeply-embedded, real-time systems – Includes Thumb-2 technology for optimum code density and processing throughput – Integrated 32b CPU and DSP – Single precision FPU unit (in versions R4F) – ARM and Thumb instructions – Tightly-Coupled Memory (TCM) ports for low-latency and deterministic accesses to local RAM, in addition to caches for higher performance to general memory Cortex-R4 • Main features: – High-speed Advanced Microprocessor Bus Architecture (AMBA) Advanced eXtensible Interfaces (AXI) for master and slave interfaces – Dynamic branch prediction with a global history buffer, and a 4-entry return stack – The ability to implement and use redundant core logic, for example, in fault detection – ECC – Error Corrcting Codes - Optional singlebit error correction and two-bit error detection for cache and/or TCM memories with ECC bits Cortex-R4 • Main features: – A Harvard L1 memory system with: • optional Tightly-Coupled Memory (TCM) interfaces with support for error correction or parity checking memories • optional caches with support for optional error correction schemes • optional ARMv7-R architecture Memory Protection Unit (MPU) • optional parity and Error Checking and Correction (ECC) on all RAM blocks. – An L2 memory interface: • single 64-bit master AXI interface • 64-bit slave AXI interface to TCM RAM blocks and cache RAM blocks. Cortex-R4 • Main features: – A Harvard L1 memory system with: • optional Tightly-Coupled Memory (TCM) interfaces with support for error correction or parity checking memories • optional caches with support for optional error correction schemes • optional ARMv7-R architecture Memory Protection Unit (MPU) • optional parity and Error Checking and Correction (ECC) on all RAM blocks. – An L2 memory interface: • single 64-bit master AXI interface • 64-bit slave AXI interface to TCM RAM blocks and cache RAM blocks. Cortex-R4 • Operating modes: – User (USR) mode - the usual mode for the execution of ARM or Thumb programs. – Fast interrupt (FIQ) mode entered on taking a fast interrupt. – Interrupt (IRQ) mode entered on taking a normal interrupt. – Supervisor (SVC) mode is a protected mode for the operating system entered on taking a Supervisor Call (SVC), formerly SWI. – Abort (ABT) mode entered after a data or instruction abort. – System (SYS) mode is a privileged user mode for the operating system. – Undefined (UND) mode entered when an Undefined Cortex-R4 – register set Cortex-R4 – status register Source: [2] Cortex-R5 Cortex-R5 • Main features: – Improved (extended) version of Cortex-R4 processor – Added hardware Accelerator Coherency Port (ACP) to reduce the requirement for slow software cache maintenance operations when sharing memory with other master – Added Vector Floatin-Point v3 – Added Multiprocessing Extensions for multiprocessing functionality – Added Low Latency Peripheral Port for integration of latency sensitive peripherals with processor Cortex-R5 • Implementation example: Thank you for your attention Cortex-R5 • VFP v3-D16: – The FPU fully supports single-precision and double-precision add, subtract, multiply, divide,multiply and accumulate, and square root operations – provides conversions between fixed-point and floating-point data formats, and floating-point constant instructions – includes 16 double-precision registers Cortex-R5 • Vector instructions: Source: [2] Cortex-R7 Cortex-R7 • Main features: – The highest perfoming Cortex-R processor – On a 40 nm G process the Cortex-R7 processor can be implemented to run at well over 1 GHz when it delivers over 2700 Dhrystone MIPS performance – On a 28nm process the perfomance is estimated to reach 4600 Dhrystone MIPS Cortex-R7 • Main features: – Eleven-stage pipeline with instruction prefetch, branch prediction, superscalar and out of order execution – divide and floating-point 2.53 Dhrystone MIPS/MHz – Added LLRAM – low latency memory port designed specifically to connect to local memory (64-bit) Thank you for your attention References [1] ARM7TDMI core documentation; www.arm.com [2] www.arm.com [3] ARM9 family documentation; www.arm.com [4] Cortex family documentation; www.arm.com [5] http://www.engr.mun.ca/~venky/Pipelining.ppt#256,1,P ipelining: Basic and Intermediate Concepts