Best Practices for Real-Time Optimizations With the 12th Generation Intel® Core™ Processors White Paper December 2022 Document Number: 737973-1.1 Notices and Disclaimers You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or visit www.intel.com/design/literature.htm. Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No product or component can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com. Intel, the Intel logo and Intel Core are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © Intel Corporation Best Practices for Real-Time Optimizations for the 12th Generation Intel® Core™ Processors White Paper December 2022 2 Document Number: 737973-1.1 Contents 1.0 Introduction .................................................................................................................................. 5 2.0 Architecture Overview ................................................................................................................ 6 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 Topology ............................................................................................................................................................... 6 Architecture Implications ............................................................................................................................... 7 Stock-Keeping Units (SKUs).......................................................................................................................... 7 CPU Topology Detection ................................................................................................................................ 8 2.4.1 CPUID ................................................................................................................................................... 8 Thread Affinity .................................................................................................................................................... 9 Performance Optimizations ....................................................................................................................... 10 Specific Optimizations and Solutions for Real-Time Systems ................................................... 11 2.7.1 Real-Time Thread ........................................................................................................................ 11 2.7.2 Efficient-Cores (E-cores) ........................................................................................................... 11 2.7.3 Performance-Cores (P-cores) ................................................................................................. 11 L2 Cache Allocation Technology (CAT)................................................................................................. 11 2.8.1 Resource Control & Heterogeneous systems .................................................................. 12 2.8.2 Mitigations ...................................................................................................................................... 13 Summary and Conclusion ........................................................................................................................... 13 Figures Figure 1. Figure 2. Comparison of Performance-cores and Efficient-cores ................................................................... 6 The “L2 QOS Enumeration” option in BIOS ........................................................................................ 12 Tables Table 1. Table 2. 12th Generation Intel® Core™ Processors Which Support Real-Time Applications ............. 7 CPUID Functions for the Hybrid Flag and Core Type ........................................................................ 9 Best Practices for Real-Time Optimizations for the 12th Generation Intel® Core™ Processors December 2022 White Paper Document Number: 737973-1.1 3 Revision History Date Revision Description December 2022 1.1 Added section on L2 Cache Allocation Technology (CAT) August 2022 1.0 Initial release § Best Practices for Real-Time Optimizations for the 12th Generation Intel® Core™ Processors White Paper December 2022 4 Document Number: 737973-1.1 Introduction 1.0 Introduction The 12th Generation Intel® Core™ processors (previously codenamed Alder Lake) incorporate a new performance hybrid architecture that combines two core types: Performance-cores or P-cores (previously codenamed Golden Cove) and Efficient-cores or E-cores (previously codenamed Gracemont). This white paper is meant for real-time software developers and provides the architecture brief and best practices for real-time optimizations using the new hybrid architecture. The introduction of hybrid cores is a significant architecture change, which means developers will need to understand the specifics, incorporate new design goals, and make key decisions around best practices to create efficient software that utilizes all available capabilities and features. Properly detecting CPU topology is critical for obtaining optimal real-time performance and compatibility for applications running on hybrid processors. This document provides information on architecture, programming paradigm, detection, and optimization strategies. § Best Practices for Real-Time Optimizations for the 12th Generation Intel® Core™ Processors December 2022 White Paper Document Number: 737973-1.1 5 Architecture Overview 2.0 Architecture Overview This section includes information about the performance hybrid architecture used in select 12th Generation Intel® Core™ processors. 2.1 Topology The 12th Generation Intel® Core™ processor is a new performance hybrid architecture that combines two core types: P-cores and E-cores. Topologically, each P-core has with its own L2 cache which is connected to L3 cache. Each E-core module has 4 E-cores that share L2 cache. The E-core clusters connect to a shared L3 cache with the P-cores. All cores are exposed to the real-time operating system (RTOS) as individual Logical Processors. Figure 1. Comparison of Performance-cores and Efficient-cores The P-cores and E-cores feature separate L1 and L2 caches so they can run largely independently of each other. The L3 cache (LLC) is shared. Best Practices for Real-Time Optimizations for the 12th Generation Intel® Core™ Processors White Paper December 2022 6 Document Number: 737973-1.1 Architecture Overview 2.2 Architecture Implications From a programming perspective, all cores appear functionally identical, differing only in performance and efficiency. The RTOS would need to leverage the Intel® Thread Director capability to ensure an intelligent job of distributing threads. Support for this is planned in the Linux* 5.18 kernel, the remainder of this document will focus on optimizations and considerations in addition to RTOS thread scheduling. Real-time developers should be aware of these areas when writing code for a hybrid architecture: 2.3 • CPUID, by design, returns different values depending on the core it is executed on. On hybrid cores more of the CPUID leaves will have data that varies, meaning that if software expects the value of CPUID to be the same across cores, software must not run across core types, or only rely on CPUID leaves that contain data that doesn’t change between core types. • Bitwise operations (AND, OR, XOR, NOT, rotate) do not have a notion of signed overflow, so the defined value varies on different processor architectures; some clear the bit unconditionally, others leave it unchanged, and still others set it to an undefined value. Shifts and multiplies do permit a well-defined value, but it is not consistently implemented. For example, the x86 instruction set only defines the overflow flag for multiplies and 1-bit shifts. Other than that, the single-bit shift behavior is undefined and can vary between P-cores and E-cores. Stock-Keeping Units (SKUs) You can use selected 12th Generation Intel® Core™ processors to create real-time systems and other embedded applications. The following variants of 12th Generation Intel® Core™ processors are recommended for real-time applications and will be available with various combinations of cores and execution units (EUs): Table 1. 12th Generation Intel® Core™ Processors Which Support Real-Time Applications Processor Total Cores P-cores E-cores Total Threads Intel® Core™ i9-12900E processor 16 8 8 24 Intel® Core™ i7-12700E processor 12 8 4 20 Intel® Core™ i5-12500E processor 6 6 0 12 Intel® Core™ i3-12100E processor 4 4 0 8 Best Practices for Real-Time Optimizations for the 12th Generation Intel® Core™ Processors December 2022 White Paper Document Number: 737973-1.1 7 Architecture Overview 2.4 CPU Topology Detection Properly detecting CPU topology is critical for obtaining optimal performance and compatibility for applications running on hybrid processors such as 12th Generation Intel® Core™ processors. Previously, a developer could count the total number of physical processors on a system, and it was easy to assume parity for performance. There are several methods you can use to determine your target CPU’s topology; the general guidance is to use CPUID. 2.4.1 CPUID As of March 2020, the Intel® Architecture Instruction Set Extensions And Future Features Programming Reference provides the details necessary for using the CPUID instruction to determine hybrid topology on a target system. There are two new flags defined: The Hybrid Flag can be obtained by calling CPUID with the value “07H” in the EAX register and reading the 15th bit of the EDX register. The Core Type Flag can be obtained by calling CPUID on each logical processor with a value of “1AH” in the EAX register; this will return each processor’s type in bits 24-31 of the EAX register. You can use the core type to determine if a logical processor is either an E-core which is type 20H, or a P-core which is type 40H. However, the return value for Core Type does not differentiate between physical and simultaneous multithreading (SMT) cores for an Intel® Core™ processor. Both will be represented as Intel® Core™ processors (40H). Ecore processors are not hyper-threaded and do not require special detection of logical cores per physical processor. Note: The width of the L2 CAT way mask differs between the P and the E cores. Best Practices for Real-Time Optimizations for the 12th Generation Intel® Core™ Processors White Paper December 2022 8 Document Number: 737973-1.1 Architecture Overview Table 2. CPUID Functions for the Hybrid Flag and Core Type Name 2.5 Function (EAX) Leaf (ECX) Register BitStart BitEnd Comment Hybrid Flag 07H 0 EDX 15 15 If 1, the processor is identified as a hybrid part. Additionally, on hybrid parts (CPUID.07H.0H:EDX[15]=1), software must consult the native model ID and core type from the Hybrid Information Enumeration Leaf. Core Type 1AH 0 EAX 24 31 Hybrid Information Sub-leaf (EAX = 1AH, ECX = 0) EAX Bits 31-24: Core type 10H: Reserved 20H: Intel Atom® Processors 30H: Reserved 40H: Intel® Core™ Processors Thread Affinity To get the most optimized temporal determinism, and performance-utilization on a processor that supports a hybrid architecture, you must consider where your threads are being executed, as discussed above. Specifically, the developer can explicitly dictate where critical threads will run, or the developer can allow the RTOS to schedule the threads on specific cores as it deems appropriate. For instance, when working with a critical thread (for example, processing an industrial ethernet control frame), you may prefer it to prioritize execution on the logical P-cores. However, it may be more optimal to run background worker threads on the E-cores. In either case (explicit assignment, or RTOS control), it is recommended that the developer utilizes the mechanisms provided by the RTOS to control affinity. Historically the guidance has been to avoid hard affinities, however if the RTOS schedules a thread across core types there can be significant run-to-run variation in the execution time. As a result, using hard affinities prevents potential problems, and eliminates any dependency on the RTOS schedular, and its support of hybrid architectures. When choosing your affinity strategy, it is important to consider the frequency of thread context switching and cache flushing that may be incurred by changing a thread’s affinity at runtime. For the most demanding real-time applications hard affinity alone might be insufficient, and one would dedicate a core to the real-time thread. In these scenarios the hybrid architecture can provide an advantage by allowing a single efficiency core to be Best Practices for Real-Time Optimizations for the 12th Generation Intel® Core™ Processors December 2022 White Paper Document Number: 737973-1.1 9 Architecture Overview dedicated, rather than a performance core. The performance core would then be available to run other threads. However, this assumes the real-time thread performance is satisfied by the efficiency core and does not require a performance core. The E-cores share the level two (L2) cache, and as a result real-time applications can benefit from utilizing Intel’s Cache Allocation Technology (CAT) on the shared L2 cache to minimize interference. Alternatively, on the P-cores the L2 cache is dedicated, and CAT can be used to further isolate between threads running on the core. It is also recommended to use CAT on the shared last level cache for optimal real-time performance. The P-cores support per core P-States via Hardware Power States allowing targeted control of the processor’s frequency, so applications that benefit from higher frequencies can leverage this capability to achieve higher performance. By default, the system will choose its own strategy for determining power throttling, unless you explicitly choose to enable or disable power throttling. For more information about optimizing power management for real-time refer to the real-time tuning guide. 2.6 Performance Optimizations Since 12th Generation Intel® Core™ processors represents a significant change in CPU architecture, you may need to fine-tune applications to take full advantage of both core types. That will be particularly important for multithreaded applications with complex job managers, and multiple-use middleware. Elsewhere, if the RTOS Scheduler (Linux* 5.18+ kernel) is optimized to account for 12th Generation Intel® Core™ processors’ performance hybrid architecture, the impact may be minimal. Analysis of applications on hybrid architectures has shown the majority perform well, with older single threaded applications favoring the P-cores. Applications that were already built to heavily utilize multithreading, and that can scale to double-digit core counts, were found to benefit from hybrid architecture due to better throughput. However, there are inevitable performance inversions, attributed to either poor multithreading architectures, poor RTOS scheduling, or increased threading overhead. Such issues were addressed by introducing optimizations using the Intel® Thread Director technology, along with hardware hints to the RTOS scheduler (in Linux* 5.18+ kernel). Make sure you keep your development environment current to achieve the best performance on 12th Generation Intel® Core™ processors and other hybrid architectures. The RTOS, tools, and libraries should all regularly updated. Best Practices for Real-Time Optimizations for the 12th Generation Intel® Core™ Processors White Paper December 2022 10 Document Number: 737973-1.1 Architecture Overview 2.7 Specific Optimizations and Solutions for Real-Time Systems 2.7.1 Real-Time Thread A compute bound real-time thread runs at its peak performance when running on PCores, it could handle the logic of computing actuator commands. Depending on the number of axis it could represent a tremendous workload to process all the inputs and compute all set points. Most real-time applications try to process the new set points as early as possible to avoid deadline violations in the case of jitter. For example, as soon as the sensor input is available processing can begin, however until the RTOS has scheduled the real-time thread no processing can begin. 2.7.2 Efficient-Cores (E-cores) E-cores run at a lower frequency, and thus execute instructions with lower performance. In the 12th Generation Intel® Core™ processors are represented with fewer P-Cores, E-cores are likely to be a major workhorse. Thus, developers will want to move most non-critical parallel workloads to these cores. 2.7.3 Performance-Cores (P-cores) P-cores run at a higher frequency, and thus execute instructions with higher performance. P-Cores can offer the highest performance with dedicated L2 caches and individual control of the processor frequency. However, care should be taken to dedicating these cores to a real-time thread, and the cost in residual performance will be high. 2.8 L2 Cache Allocation Technology (CAT) 12th Generation Intel® Core™ processor systems with a BIOS that supports Intel® Time Coordinated Computing (Intel® TCC) will have a BIOS option to enable enumeration of L2 Cache Allocation Technology: Best Practices for Real-Time Optimizations for the 12th Generation Intel® Core™ Processors December 2022 White Paper Document Number: 737973-1.1 11 Architecture Overview Figure 2. The “L2 QOS Enumeration” option in BIOS When the “L2 QOS Enumeration” option is enabled, software will be able to detect support for L2 CAT via CPUID as described in Chapter 17.19.4.1 “Enumeration and Detection Support of Cache Allocation Technology”, Volume 3B, of the Intel® 64 and IA32 Architectures Software Developer Manuals. When this option is enabled on heterogeneous systems with a 12th Generation Intel® Core™ processor such as the Intel® Core™ i9-12900E and Intel® Core™ i7-12700E processors, reduced E-core performance may be observed when running Linux with kernel support for Resource Control (CONFIG_X86_CPU_RESCTRL). 2.8.1 Resource Control & Heterogeneous systems When Resource Control (CONFIG_X86_CPU_RESCTRL) is enabled in the kernel, during the boot sequence the capacity bitmask (CBM) registers for each cache hierarchy that supports Intel® Resource Director Technology (Intel® RDT) Cache Allocation Technology (CAT) are reinitialized. On heterogenous systems with a 12th Generation Intel® Core™ processor, the capacity bitmask length for the L2 cache may differ in length between core types (P-core and E-core). As a result, and depending on which core type was used for initial discovery, programming of a consistent mask length to all core types may result in reduced performance or kernel error messages. For example, consider the 12th Generation Intel® Core™ i9-12900E processor which consists of 8 P-cores and 8 E-cores. The following table represents the capacity bitmasks out of reset and after reinitialization by Resource Control. Best Practices for Real-Time Optimizations for the 12th Generation Intel® Core™ Processors White Paper December 2022 12 Document Number: 737973-1.1 Architecture Overview Model Specific Register (MSR) 0xD10 – CBM for L2 Cache Core Type Initial value out of reset Value after Resource Control reinitialization P-Core 0x3FF 0x3FF E-Core 0xFFFF 0x3FF (results in reduced performance) The above values demonstrate the case when the initial probe of CPUID to obtain the capacity bitmask length was issued from a P-core. Reduced performance is a result of changing the E-core L2 CBM from 16-ways (0xFFFF) to 10 ways (0x3FF), effectively limiting the use of the L2 cache to 62.5% of its total capacity. It is possible that CPUID may be issued from an E-core, resulting in a CBM length of 0xFFFF. Since it is not possible to program a CBM value of 0xFFFF to a P-core (whose maximum value is 0x3FF), this will result in an unchecked MSR access error being recorded in the kernel’s log messages. 2.8.2 Mitigations There are multiple options for restoring full use of the L2 cache for the E-cores when L2 QOS Enumeration is enabled. 1. Disable Resctrl support of L2 CAT: Add rdt=!l2cat to the kernel boot parameters. 2. Manually reinitialize L2 CBM: Write the full waymask to the IA32_L2_QOS_MASK_0 – IA32_QOS_MASK_15 MSRs (0xD10 – 0xD1F) on at least 1 E-core per a given cache_id. 3. Remove Resource Control support from the kernel: Recompile the kernel and disable CONFIG_X86_CPU_RESCTRL. 2.9 Summary and Conclusion Developers must fully enumerate the available logical processors on a system to optimize for temporal determinism and performance. You should design your software with performance hybrid architectures (such as the one in the 12th Generation Intel® Core™ processors) in mind. Prepare your code to support performance hybrid architectures by completing these steps: 1. 2. 3. Enable hybrid topology detection. Implement architecture choices that optimize performance. Plan your thread scheduling in advance. § Best Practices for Real-Time Optimizations for the 12th Generation Intel® Core™ Processors December 2022 White Paper Document Number: 737973-1.1 13