Migrating from Cortex-M3 to Cortex-M4 Roy Luo Global Technology Centre element14 (Formerly Premier Farnell) March 2011 1 Introduction The ARM Cortex-M4 processor is the latest embedded processor by ARM specifically developed to address digital signal control markets that demand an efficient, easy-to-use blend of control and signal processing capabilities in microcontroller applications. The combination of high-efficiency signal processing functionality with the low-power, low cost and ease-of-use benefits of the Cortex-M family of processors is designed to satisfy the emerging category of flexible solutions specifically targeting the motor control, automotive, power management, embedded audio and industrial automation markets. The Cortex-M4 processor extends the use of Cortex-M cores to applications that require more computational performance than available currently with Cortex-M3. The Cortex-M4 features a single-cycle multiply-accumulate (MAC) unit, optimized single instruction multiple data (SIMD) instructions, saturating arithmetic instructions and an optional single precision Floating-Point Unit (FPU). So, the Cortex-M4 is a Cortex-M3 with the DSP instruction add-ons, and migrating from Cortex-M3 to Cortex-M4 is very easy! 1.1 Why change to Cortex-M4? • Higher Performance Just like the Cortex-M3, the Cortex-M4 provides an integer performance level of 1.25 Dhrystone 2.1 MIPS per MHz, but Cortex-M4 provides higher performance on digital signal processing. Please refer to 2. Cortex-M4 Features for more information on Cortex-M4. • Digital Signal Processing Capabilities The Cortex-M4 integrates a single-cycle multiply-accumulate (MAC) unit supporting a variety of 16- and 32-bit multiplies with 32- and 64-bit accumulations and an instruction set of single-cycle SIMD (Single Instruction Multiple Data) featuring dual 16-bit and quad 8-bit operations. The Cortex-M4 FPU is an implementation of the single precision variant of the ARMv7-M Floating-Point Extension (FPv4-SP). It provides floating-point computation functionality that is compliant with the ANSI/IEEE STD 754-2008, IEEE Standard for Binary Floating-Point Arithmetic, referred to as the IEEE 754 standard. The FPU Page 1 Total: 17 Pages supports all single-precision data-processing instructions and data types described in the ARM Architecture Reference Manual. • Satisfying the Requirements of Next-Generation Products The ARM Cortex-M family is aimed at the areas such as commercial electronics and low-cost industrial control including motor control, power management, automotive electronics, and audio processing. The increasing computational loads in these areas consume an unacceptable portion of the CPU resources if all the digital signal processing tasks are handled by software. The ARM Cortex-M4 solves this issue by integrating a single-cycle multiply-accumulate (MAC) unit and an instruction set of single-cycle SIMD operations, as well as an optional FPU to satisfy the digital signal processing requirements of next-generation products. A Cortex-M4 can be regarded as a Cortex-M3 with integrated DSP extensions, which means the software from the Cortex-M3 can also function in the M4 and it is easy to implement migration from M3 to M4 without too much effort. The figures shown below illustrate the relation between these two processors. Cortex-M3 + DSP & Optional FPU = Cortex-M4 1.2 References Materials Cortex-M3 Technical Reference Manual, ARM DDI0337G, ARM Ltd. Cortex-M4 Technical Reference Manual, ARM DDI0337G, ARM Ltd. ARMv7-M Architecture Reference Manual, ARM DDI0403D, ARM Ltd. Cortex Microcontroller Software Interface Standard (see www.onarm.com). Application Note 179 – Cortex-M3 Embedded Software Development, ARM DAI0179B, ARM Ltd. Page 2 Total: 17 Pages 2 Cortex-M4 Features 2.1 32-bit Multiply-Accumulate (MAC) Unit The 32-bit hardware multiply-accumulate (MAC) unit added in the Cortex-M4 is capable of accomplishing an operation of up to 32×32+64->64 or two operations of 16×16 in a signal cycle. This high-performance unit makes digital signal processing more efficient and greatly reduces the consumption of CPU resources. The 32-bit multiply-accumulate (MAC) unit has three main features: • Wide range of multiply-accumulate instructions • Choice of 16 or 32 bit multiply and 32 or 64 bit accumulate • All instructions execute in a single cycle 2.2 Single Instruction Multiple Data (SIMD) Instructions The Cortex-M4 is integrated with a set of single-cycle SIMD instructions. The SIMD instruction set includes a series of DSP instructions such as add, subtract, multiply, multiply and accumulate, which is used to realize the implementation of the common DSP operations including FIR, IIR, complex FFT, PID, matrix addition, matrix subtraction, and matrix multiplication. With these instructions, a Cortex-M4 can offer a higher computational efficiency when running DSP programs than a Cortex-M3. The SIMD has three main features: • Quad (4 parallel) 8-bit adds or subtracts • Dual (2 parallel) 16-bit adds or subtracts • All instructions execute in a single cycle 2.3 Floating Point Unit (FPU) The FPU is an optional unit of the Cortex-M4. Manufacturers can make their own decisions on the availability of this unit according to their different requirements. The FPU fully supports single-precision add, subtract, multiply, divide, multiply and accumulate, and square root operations. It also provides conversions between fixed-point and floating-point data formats, and floating-point constant instructions. The FPU has four main features: • FP extension registers that software can view as either 32 single-precision or 16 doubleword registers • Single-precision floating-point arithmetic • Conversions among integer, single-precision floating-point, and half-precision (16-bit) Page 3 Total: 17 Pages floating point formats • Data transfers of single-precision and doubleword registers The rest of features such as NVIC (Nested Vectored Interrupt Controller), MPU (Memory Protection Unit), and DAP (Debug Access Port) are the same as the Cortex-M3. Please refer to the datasheet of the Cortex-M3 for detailed information. Page 4 Total: 17 Pages 3 Comparisons between Cortex-M3 and Cortex-M4 The table shown below lists the differences between the Cortex-M3 and M4. Cortex-M3 Cortex-M4 Architecture ARMv7-M (Harvard) ARMv7-M (Harvard) ISA Support Thumb/Thumb-2 Thumb/Thumb-2 Single cycle 16,32-bit MAC Single cycle dual 16-bit MAC DSP Extensions NA Optional Floating Point Unit NA Single precision floating point unit IEEE 754 compliant Pipeline 3-stage + branch speculation 3-stage + branch speculation Dhrystone 1.25 DMIPS/MHz 1.25 DMIPS/MHz Memory Protection Optional 8 region MPU with sub regions and background region Optional 8 region MPU with sub regions and background region Interrupts Non-maskable Interrupt (NMI) + 1 to 240 physical interrupts Non-maskable Interrupt (NMI) + 1 to 240 physical interrupts Interrupt Latency 12 cycles 12 cycles Inter-Interrupt Latency 6 cycles 6 cycles Interrupt Priority Levels 8 to 256 priority levels 8 to 256 priority levels Wake-up Interrupt Controller Up to 240 Wake-up Interrupts Up to 240 Wake-up Interrupts Sleep Modes Integrated WFI and WFE Instructions and Sleep On Exit capability. Sleep & Deep Sleep Signals Optional Retention Mode with ARM Power Management Kit Integrated WFI and WFE Instructions and Sleep On Exit capability. Sleep & Deep Sleep Signals Optional Retention Mode with ARM Power Management Kit Bit Manipulation Integrated Instructions & Bit Banding Integrated Instructions & Bit Banding Debug Optional JTAG & Serial-Wire Debug Ports. Up to 8 Breakpoints and 4 Watchpoints. Optional JTAG & Serial-Wire Debug Ports. Up to 8 Breakpoints and 4 Watchpoints. Trace Optional Instruction Trace (ETM), Data Trace (DWT), and Instrumentation Trace (ITM) Optional Instruction Trace (ETM), Data Trace (DWT), and Instrumentation Trace (ITM) 8,16-bit SIMD arithmetic Hardware Divide (2-12 Cycles) This table shows that most features of the Cortex-M3 and M4 are the same with the significant difference that Cortex-M4 has DSP extensions and an optional FPU. There is nearly no need for modification of hardware and software to migrate from M3 to M4. The next sections introduce the Cortex-M4 core in detail with emphasis on its digital signal processing capability. Page 5 Total: 17 Pages 3.1 Programmers Model 3.1.1 Operating Modes Same as the Cortex-M3, Cortex-M4 supports two modes of operation: Thread mode and Handler mode. The processor enters Thread mode on reset, or as a result of an exception return. Privileged and Unprivileged code can run in Thread mode. The processor enters Handler mode as a result of an exception. All code is privileged in Handler mode. 3.1.2 Operating States Same as the Cortex-M3, Cortex-M4 can operate in one of two operating states: Thumb and Debug State. Thumb state is the normal execution running 16-bit and 32-bit half word aligned Thumb instructions. Debug State is the state when the processor is in halting debug. 3.1.3 Instruction Set The Cortex-M4 uses the same architecture as the Cortex-M3, i.e., the ARMv7-M architecture. The instructions of these processors are from the Thumb-2 instruction set which includes 16-bit and 32-bit instructions. Additionally, the Cortex-M4 has integrated SIMD and the optional floating point instructions, which increase the total number of instructions up to 291, more than the 186 instructions of the Cortex-M3. The figure shown above illustrates the relationship between the instructions of the Cortex-M family. The Cortex-M3 ISA is upwards compatible with the Cortex-M4 ISA, and the Cortex-M4F (a Cortex-M4 processor plus FPU) is built by adding FPU instructions to the baseline Cortex-M4. Page 6 Total: 17 Pages 3.1.4 System Address Map Cortex-M3 and Cortex-M4 have the same system address map. The following figure shows the system address map: 3.1.5 Bit Banding Same as the Cortex-M3, the Cortex-M4 provides bit access to two 1MB regions of memory, one within the internal SRAM region and the other in the peripheral region. A further 32MB of address space is reserved for this purpose and each word within these regions aliases to a specific bit within the corresponding bit-band region. Reading from the alias region returns a word containing the value of the corresponding bit; writing to bit 0 of a word in the alias region results in an atomic read-modify-write of the corresponding bit within the bit-band region. 3.1.6 Core Register Comparison Same as the Cortex-M3, the Cortex-M4 has 16 general purpose registers, R0-R15, all 32-bit. R0-R12 are generally available for essentially all instructions, R13 is used as the Stack Pointer, R14 as the Link Register (for subroutine and exception return) and R15 as the Program Counter. The following figure shows the core register comparison between Page 7 Total: 17 Pages Cortex-M3 and Cortex-M4: Cortex-M3 Core Registers Cortex-M4 Core Registers 3.2 MPU Same as the Cortex-M3, the MPU is an optional component for memory protection in Cortex-M4. The processor supports the standard ARMv7 Protected Memory System Architecture model. You can use the MPU to enforce privilege/access rules, and separate processes. The MPU provides full support for: • Protection regions • Overlapping protection regions, with ascending region priority: 7 = highest priority 0 = lowest priority • Access permissions • Exporting memory attributes to the system 3.3 DSP Capability The figures shown below illustrate relative performance comparisons between the Cortex-M3 and Cortex-M4 regarding the capability of digital signal processing where both processors are operating at the same speed. Page 8 Total: 17 Pages In the following figures, the y-axis represents the relative cycle counts to execute the given function. Accordingly, the smaller the cycle count, the better the performance. Since the Cortex-M3 is used as the reference, the Cortex-M4 performance is calculated by taking the reciprocal of its relative cycle count. As an example, for the PID function, the Cortex-M4 cycle count is approximately 0.7x versus the Cortex-M3, so the relative performance is 1/0.7, or 1.4x. Cortex-M 16-bit functions cycle count Cortex-M 32-bit functions cycle count It is clear that the Cortex-M4 presents a great advantage in terms of digital signal processing compared with the Cortex-M3 for both16-bit or 32-bit operations. All the DSP Page 9 Total: 17 Pages instructions executed by the Cortex-M4 complete in a single cycle while the Cortex-M3 needs multiple instructions and multiple cycles to complete the equivalent function. Even for the PID, the most resource-consuming job among these common DSP operations, the Cortex-M4 provides a 1.4x performance improvement. As another application example, an MP3 decode requiring 20-25 MHz on a Cortex-M3 would only require 10-12 MHz on a Cortex-M4. 3.3.1 32-bit Multiply-Accumulate (MAC) The 32-bit multiply-accumulate (MAC) includes new instructions and an optimized hardware execution unit in the Cortex-M4. It is capable of accomplishing a 32 x 32 + 64 -> 64 operation or two 16 x 16 operations in a single cycle. The table shown below lists the operations that this unit can carry out. Operation Instruction Cycles 16 x 16 = 32 SMULBB, SMULBT, SMULTB, SMULTT 1 16 x 16 + 32 = 32 SMLABB, SMLABT, SMLATB, SMLATT 1 16 x 16 + 64 = 64 SMLALBB, SMLALBT, SMLALTB, SMLALTT 1 16 x 32 = 32 SMULWB, SMULWT 1 (16 x 32) + 32 = 32 SMLAWB, SMLAWT 1 (16 x 16) ± (16 x 16) = 32 SMUAD, SMUADX, SMUSD, SMUSDX 1 (16 x 16) ± (16 x 16) + 32 = 32 SMLAD, SMLADX, SMLSD, SMLSDX 1 (16 x 16) ± (16 x 16) + 64 = 64 SMLALD, SMLALDX, SMLSLD, SMLSLDX 1 32 x 32 = 32 MUL 1 32 ± (32 x 32) = 32 MLA, MLS 1 32 x 32 = 64 SMULL, UMULL 1 (32 x 32) + 64 = 64 SMLAL, UMLAL 1 (32 x 32) + 32 + 32 = 64 UMAAL 1 2 ± (32 x 32) = 32 (upper) SMMLA, SMMLAR, SMMLS, SMMLSR 1 (32 x 32) = 32 (upper) SMMUL, SMMULR 1 3.3.2 SIMD The Cortex-M4 supports SIMD instructions, which were unavailable in the previous members of the Cortex-M family. Some of the instructions in the above table belong to SIMD instructions. By working with the optimized multiply-accumulate (MAC) hardware, all these instructions are executed in a single cycle. Powered by SIMD instructions, the Cortex-M4 processor is able to carry out an operation of up to 32 x 32 + 64 -> 64 in a single cycle, freeing up processor bandwidth for other tasks rather than being consumed by sequences of multiplications and additions. Page 10 Total: 17 Pages Consider the following complex arithmetic operation where two 16 x 16 multiplies plus a 32-bit accumulation are encoded and performed by a single instruction: Sum = Sum + (A x C) + (B x D) 32-bit 32-bit 3.3.3 FPU FPU is an optional unit of the Cortex-M4 for floating point operations. Therefore it is a unit dedicated to floating-point tasks. This unit boosts performance by using hardware to handle single precision floating point operations and is compliant with IEEE 754. It is an implementation of the single precision variant of the ARMv7-M Floating-Point Extension (FPv4-SP). The FPU extends the register programming model with a register file containing 32 single-precision registers. These can be viewed as: • Sixteen 64-bit doubleword registers, D0-D15 • Thirty-two 32-bit single-word registers, S0-S31 The FPU provides three modes of operation to accommodate a variety of applications: • Full-Compliance Mode In full-compliance mode, the FPU processes all operations according to the IEEE 754 standard in hardware. • Flush-to-Zero Mode Setting the FZ bit of the Floating -point Status and Control Register, FPSCR [24], enables flush-to-zero mode. In this mode, the FPU treats all subnormal input operands of arithmetic CDP operations as zeros in the operation. Exceptions that result from a zero operand are signaled appropriately. VABS, VNEG, and VMOV are not considered arithmetic CDP operations and are not affected by flush-to-zero mode. A result that is tiny, as described in the IEEE 754 standard, where the destination precision is smaller in magnitude than the minimum normal value before rounding, is replaced with a zero. The IDC flag, FPSCR [7], indicates when an input flush occurs. The UFC flag, FPSCR [3], indicates when a result flush occurs. • Default NaN Mode Page 11 Total: 17 Pages Setting the DN bit, FPSCR [25], enables default NaN mode. In this mode, the result of any arithmetic data processing operation that involves an input NaN, or that generates a NaN result, returns the default NaN. Propagation of the fraction bits is maintained only by VABS, VNEG, and VMOV operations. All other CDP operations ignore any information in the fraction bits of an input NaN. The following table shows instruction set of the FPU. Operation Description Assembler Cycles Absolute value of float VABS.F32 1 Addition floating point VADD.F32 1 float with register or zero VCMP.F32 1 float with register or zero VCMPE.F32 1 Convert between integer, fixed-point, half-precision and float VCVT.F32 1 Divide Floating-point VDIV.F32 14 multiple doubles VLDM.64 multiple floats VLDM.32 number of floats. single double VLDR.64 3 single float VLDR.32 2 top/bottom half of double to/from core register VMOV 1 immediate/float to float-register VMOV 1 float to/from one core register VMOV 2 floating-point control/status to core register VMRS 1 core register to floating-point control/status VMSR 1 float VMUL.F32 1 then accumulate float VMLA.F32 3 then subtract float VMLS.F32 3 then accumulate then negate float VNMLA.F32 3 then subtract then negate float VNMLS.F32 3 then accumulate float VFMA.F32 3 Compare 1+2*N, where N is the number of doubles 1+N, where N is the Load two floats/one double to/from two core registers or one Move Multiply Multiply Page 12 Total: 17 Pages (fused) then subtract float VFMS.F32 3 then accumulate then negate float VFNMA.F32 3 then subtract then negate float VFNMS.F32 3 float VNEG.F32 1 and multiply float VNMUL.F32 1 double registers from stack VPOP.64 float registers from stack VPOP.32 double registers to stack VPUSH.64 float registers to stack VPUSH.32 of float VSQRT.F32 multiple double registers VSTM.64 multiple float registers VSTM.32 number of floats. single double register VSTR.64 3 single float registers VSTR.32 2 float VSUB.F32 1 Negate 1+2*N, where N is the Pop number of double registers. 1+N where N is the number of registers. 1+2*N, where N is the Push Square-root number of double registers. 1+N, where N is the number of registers. 14 1+2*N, where N is the number of doubles. 1+N, where N is the Store Subtract 3.4 Debug Same as the Cortex-M3, Cortex-M4 devices are debugged via a standard JTAG or Serial-Wire Debug (SWD) connector. A simple, standardized external connector is required to interface to a host system. 3.5 Power 3.5.1 Power Management Same as Cortex-M3, Cortex-M4 has four power modes: Active mode, Sleep mode, Standby mode, Power off mode. The following figure shows the four power modes: Page 13 Total: 17 Pages Power Modes Power Consumption Description Active mode Leakage + dynamic Running Dhrystone 2.1 benchmark Sleep mode Leakage + some dynamic CM4Core clock gated, NVIC awake Standby mode Leakage only Power still on, all clocks off Power off mode Zero power Power off 3.5.2 Comparison Based on Power It is obvious from the table shown below that the Cortex-M4 performs much better than the Cortex-M3 in terms of power efficiency. Process Cortex-M3 Cortex-M4 TSMC 90nm G 65nm low power process Optimization Type Speed Optimized Area Optimized Speed Optimized Area Optimized Standard Cell Library ARM SC9 ARM SC9 ARM SC12 ARM SC9 Integer Performance (Total DMIPS) 344 63 375 188 Frequency (MHz) 275 50 300 150 Page 14 Total: 17 Pages Power Efficiency (DMIPS/mW) TBD 12.5 24 38 Area (mm2) 0.083 0.047 0.21 0.11 FPU Area (mm2) NA NA 0.08 0.06 4 Migrating a Software Application 4.1 General Information Since the Cortex-M4 represents a superset ISA extension from the Cortex-M3, the software including system level software can be used on both platforms. Specifically, the stack, memory, code and data placement, as well as interrupts in both processors are all the same because they have the same ARM v7-M hardware and Thumb/Thumb-2 instruction set. A software migration from the Cortex-M3 to the M4 can be done very easily with few modifications. If the code is developed with C language, there is no need for any modifications. Compilers targeted for Cortex-M4 automatically invokes the 32-bit multiply-accumulate (MAC) unit and SIMD instructions to execute DSP tasks. However, there are still some considerations despite the fully compatible code. • Use word transfers only to access registers in the NVIC and System Control Space (SCS). • Treat all unused SCS registers and register fields on the processor as Do-Not-Modify. • Configure the following fields in the CCR: STKALIGN bit to 1 UNALIGN_TRP bit to 1 Leave all other bits in the CCR register as their original value. 4.2 Example Code The example shown below is a single high-level arithmetic source code statement used to implement IIR filter algorithm and the cycle counts that the Cortex-M3 and M4 consume. :y[n] = b0 * x[n] + b1 * x[n-1] + b2 * x[n-2] - a1 * y[n-1] - a2 * y[n-2] Function Cortex-M3 Cortex-M4 xN = *x++; 2 2 yN = xN * b0; 3-7 1 Page 15 Total: 17 Pages yN += xNm1 * b1; 3-7 1 yN += xNm2 * b2; 3-7 1 yN -= yNm1 * a1; 3-7 1 yN -= yNm2 * a2; 3-7 1 *y++ = yN; 2 2 xNm2 = xNm1; 1 1 xNm1 = xN; 1 1 yNm2 = yNm1; 1 1 yNm1 = yN; 1 1 Decrement loop counter 1 1 Branch 2 2 26~46 Cycles 16 Cycles To execute the same source code, the Cortex-M3 needs 26~46 cycles (note the execution time for the multiply operations is data dependent) while the Cortex-M4 only needs 16 cycles. The Cortex-M4 provides a 1.6x - 2.9x performance improvement for this IIR filter calculation. By looking into the details, the difference is found at the code lines that perform the successive multiply-accumulate operations. To execute these functions, the Cortex-M3 requires multiple instructions and consumes 3-7 cycles, while the Cortex-M4 only requires a single 1-cycle instruction. This is a real-world signal processing example showing the ISA capabilities and microarchitecture strength of the Cortex-M4 core. 5 Cortex-M4 Products It is currently known that the manufacturers including Freescale, NXP and STMicroelectronics will offer MCUs based on Cortex-M4 core. Among these suppliers, Freescale has already launched its Kinetis Cortex-M4 product line that includes the K10, K20, K30, K40 and K60 families in 2010. Designers can easily evaluate and develop Cortex-M4 products by using TWR-K40X256-KIT and TWR-K60N512-KIT Tower kit from Freescale or its distributors. 6 Summary The Cortex-M4 boasts powerful capabilities to deal with the digital signal processing tasks that were unavailable in the previous members of the Cortex-M family. Benefiting from the same hardware platform and compatible instruction set, designers can carry out migration from the Cortex-M3 to the M4 with little effort, preserving their existing software developments. The easy job of migration not only reduces the workload of developing new products, but also enables the new products to handle digital signal processing more Page 16 Total: 17 Pages efficiently with lower power consumption, making the Cortex-M4 an ideal choice for the next-generation products. Page 17 Total: 17 Pages