Uploaded by Isty

737973 best practices for real-time optimizations rev1.1

advertisement
Best Practices for Real-Time
Optimizations With the 12th Generation
Intel® Core™ Processors
White Paper
December 2022
Document Number: 737973-1.1
Notices and Disclaimers
You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products
described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject
matter disclosed herein.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product
specifications and roadmaps.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published
specifications. Current characterized errata are available on request.
Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or visit
www.intel.com/design/literature.htm.
Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation.
Performance varies depending on system configuration. No product or component can be absolutely secure. Check with your system
manufacturer or retailer or learn more at intel.com.
Intel, the Intel logo and Intel Core are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
© Intel Corporation
Best Practices for Real-Time Optimizations for the 12th Generation Intel® Core™ Processors
White Paper
December 2022
2
Document Number: 737973-1.1
Contents
1.0
Introduction .................................................................................................................................. 5
2.0
Architecture Overview ................................................................................................................ 6
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
Topology ............................................................................................................................................................... 6
Architecture Implications ............................................................................................................................... 7
Stock-Keeping Units (SKUs).......................................................................................................................... 7
CPU Topology Detection ................................................................................................................................ 8
2.4.1
CPUID ................................................................................................................................................... 8
Thread Affinity .................................................................................................................................................... 9
Performance Optimizations ....................................................................................................................... 10
Specific Optimizations and Solutions for Real-Time Systems ................................................... 11
2.7.1
Real-Time Thread ........................................................................................................................ 11
2.7.2
Efficient-Cores (E-cores) ........................................................................................................... 11
2.7.3
Performance-Cores (P-cores) ................................................................................................. 11
L2 Cache Allocation Technology (CAT)................................................................................................. 11
2.8.1
Resource Control & Heterogeneous systems .................................................................. 12
2.8.2
Mitigations ...................................................................................................................................... 13
Summary and Conclusion ........................................................................................................................... 13
Figures
Figure 1.
Figure 2.
Comparison of Performance-cores and Efficient-cores ................................................................... 6
The “L2 QOS Enumeration” option in BIOS ........................................................................................ 12
Tables
Table 1.
Table 2.
12th Generation Intel® Core™ Processors Which Support Real-Time Applications ............. 7
CPUID Functions for the Hybrid Flag and Core Type ........................................................................ 9
Best Practices for Real-Time Optimizations for the 12th Generation Intel® Core™ Processors
December 2022
White Paper
Document Number: 737973-1.1
3
Revision History
Date
Revision
Description
December 2022
1.1
Added section on L2 Cache Allocation Technology (CAT)
August 2022
1.0
Initial release
§
Best Practices for Real-Time Optimizations for the 12th Generation Intel® Core™ Processors
White Paper
December 2022
4
Document Number: 737973-1.1
Introduction
1.0
Introduction
The 12th Generation Intel® Core™ processors (previously codenamed Alder Lake)
incorporate a new performance hybrid architecture that combines two core types:
Performance-cores or P-cores (previously codenamed Golden Cove) and Efficient-cores
or E-cores (previously codenamed Gracemont). This white paper is meant for real-time
software developers and provides the architecture brief and best practices for real-time
optimizations using the new hybrid architecture.
The introduction of hybrid cores is a significant architecture change, which means
developers will need to understand the specifics, incorporate new design goals, and
make key decisions around best practices to create efficient software that utilizes all
available capabilities and features. Properly detecting CPU topology is critical for
obtaining optimal real-time performance and compatibility for applications running on
hybrid processors.
This document provides information on architecture, programming paradigm,
detection, and optimization strategies.
§
Best Practices for Real-Time Optimizations for the 12th Generation Intel® Core™ Processors
December 2022
White Paper
Document Number: 737973-1.1
5
Architecture Overview
2.0
Architecture Overview
This section includes information about the performance hybrid architecture used in
select 12th Generation Intel® Core™ processors.
2.1
Topology
The 12th Generation Intel® Core™ processor is a new performance hybrid architecture
that combines two core types: P-cores and E-cores.
Topologically, each P-core has with its own L2 cache which is connected to L3 cache.
Each E-core module has 4 E-cores that share L2 cache. The E-core clusters connect to a
shared L3 cache with the P-cores. All cores are exposed to the real-time operating
system (RTOS) as individual Logical Processors.
Figure 1. Comparison of Performance-cores and Efficient-cores
The P-cores and E-cores feature separate L1 and L2 caches so they can run largely
independently of each other. The L3 cache (LLC) is shared.
Best Practices for Real-Time Optimizations for the 12th Generation Intel® Core™ Processors
White Paper
December 2022
6
Document Number: 737973-1.1
Architecture Overview
2.2
Architecture Implications
From a programming perspective, all cores appear functionally identical, differing only
in performance and efficiency. The RTOS would need to leverage the Intel® Thread
Director capability to ensure an intelligent job of distributing threads. Support for this is
planned in the Linux* 5.18 kernel, the remainder of this document will focus on
optimizations and considerations in addition to RTOS thread scheduling.
Real-time developers should be aware of these areas when writing code for a hybrid
architecture:
2.3
•
CPUID, by design, returns different values depending on the core it is executed on.
On hybrid cores more of the CPUID leaves will have data that varies, meaning that if
software expects the value of CPUID to be the same across cores, software must
not run across core types, or only rely on CPUID leaves that contain data that
doesn’t change between core types.
•
Bitwise operations (AND, OR, XOR, NOT, rotate) do not have a notion of signed
overflow, so the defined value varies on different processor architectures; some
clear the bit unconditionally, others leave it unchanged, and still others set it to an
undefined value. Shifts and multiplies do permit a well-defined value, but it is not
consistently implemented. For example, the x86 instruction set only defines the
overflow flag for multiplies and 1-bit shifts. Other than that, the single-bit shift
behavior is undefined and can vary between P-cores and E-cores.
Stock-Keeping Units (SKUs)
You can use selected 12th Generation Intel® Core™ processors to create real-time
systems and other embedded applications.
The following variants of 12th Generation Intel® Core™ processors are recommended
for real-time applications and will be available with various combinations of cores and
execution units (EUs):
Table 1.
12th Generation Intel® Core™ Processors Which Support Real-Time Applications
Processor
Total
Cores
P-cores
E-cores
Total
Threads
Intel® Core™ i9-12900E processor
16
8
8
24
Intel® Core™ i7-12700E processor
12
8
4
20
Intel® Core™ i5-12500E processor
6
6
0
12
Intel® Core™ i3-12100E processor
4
4
0
8
Best Practices for Real-Time Optimizations for the 12th Generation Intel® Core™ Processors
December 2022
White Paper
Document Number: 737973-1.1
7
Architecture Overview
2.4
CPU Topology Detection
Properly detecting CPU topology is critical for obtaining optimal performance and
compatibility for applications running on hybrid processors such as 12th Generation
Intel® Core™ processors. Previously, a developer could count the total number of
physical processors on a system, and it was easy to assume parity for performance.
There are several methods you can use to determine your target CPU’s topology; the
general guidance is to use CPUID.
2.4.1
CPUID
As of March 2020, the Intel® Architecture Instruction Set Extensions And Future
Features Programming Reference provides the details necessary for using the CPUID
instruction to determine hybrid topology on a target system.
There are two new flags defined:
The Hybrid Flag can be obtained by calling CPUID with the value “07H” in the EAX
register and reading the 15th bit of the EDX register.
The Core Type Flag can be obtained by calling CPUID on each logical processor with a
value of “1AH” in the EAX register; this will return each processor’s type in bits 24-31 of
the EAX register.
You can use the core type to determine if a logical processor is either an E-core which is
type 20H, or a P-core which is type 40H. However, the return value for Core Type does
not differentiate between physical and simultaneous multithreading (SMT) cores for an
Intel® Core™ processor. Both will be represented as Intel® Core™ processors (40H). Ecore processors are not hyper-threaded and do not require special detection of logical
cores per physical processor.
Note: The width of the L2 CAT way mask differs between the P and the E cores.
Best Practices for Real-Time Optimizations for the 12th Generation Intel® Core™ Processors
White Paper
December 2022
8
Document Number: 737973-1.1
Architecture Overview
Table 2.
CPUID Functions for the Hybrid Flag and Core Type
Name
2.5
Function
(EAX)
Leaf
(ECX)
Register
BitStart
BitEnd
Comment
Hybrid
Flag
07H
0
EDX
15
15
If 1, the processor is identified
as a hybrid part. Additionally,
on hybrid parts
(CPUID.07H.0H:EDX[15]=1),
software must consult the
native model ID and core type
from the Hybrid Information
Enumeration Leaf.
Core
Type
1AH
0
EAX
24
31
Hybrid Information Sub-leaf
(EAX = 1AH, ECX = 0) EAX Bits
31-24: Core type
10H: Reserved
20H: Intel Atom® Processors
30H: Reserved
40H: Intel® Core™ Processors
Thread Affinity
To get the most optimized temporal determinism, and performance-utilization on a
processor that supports a hybrid architecture, you must consider where your threads
are being executed, as discussed above. Specifically, the developer can explicitly dictate
where critical threads will run, or the developer can allow the RTOS to schedule the
threads on specific cores as it deems appropriate.
For instance, when working with a critical thread (for example, processing an industrial
ethernet control frame), you may prefer it to prioritize execution on the logical P-cores.
However, it may be more optimal to run background worker threads on the E-cores. In
either case (explicit assignment, or RTOS control), it is recommended that the
developer utilizes the mechanisms provided by the RTOS to control affinity.
Historically the guidance has been to avoid hard affinities, however if the RTOS
schedules a thread across core types there can be significant run-to-run variation in the
execution time. As a result, using hard affinities prevents potential problems, and
eliminates any dependency on the RTOS schedular, and its support of hybrid
architectures.
When choosing your affinity strategy, it is important to consider the frequency of
thread context switching and cache flushing that may be incurred by changing a
thread’s affinity at runtime.
For the most demanding real-time applications hard affinity alone might be insufficient,
and one would dedicate a core to the real-time thread. In these scenarios the hybrid
architecture can provide an advantage by allowing a single efficiency core to be
Best Practices for Real-Time Optimizations for the 12th Generation Intel® Core™ Processors
December 2022
White Paper
Document Number: 737973-1.1
9
Architecture Overview
dedicated, rather than a performance core. The performance core would then be
available to run other threads. However, this assumes the real-time thread performance
is satisfied by the efficiency core and does not require a performance core.
The E-cores share the level two (L2) cache, and as a result real-time applications can
benefit from utilizing Intel’s Cache Allocation Technology (CAT) on the shared L2 cache
to minimize interference. Alternatively, on the P-cores the L2 cache is dedicated, and
CAT can be used to further isolate between threads running on the core. It is also
recommended to use CAT on the shared last level cache for optimal real-time
performance.
The P-cores support per core P-States via Hardware Power States allowing targeted
control of the processor’s frequency, so applications that benefit from higher
frequencies can leverage this capability to achieve higher performance.
By default, the system will choose its own strategy for determining power throttling,
unless you explicitly choose to enable or disable power throttling. For more
information about optimizing power management for real-time refer to the real-time
tuning guide.
2.6
Performance Optimizations
Since 12th Generation Intel® Core™ processors represents a significant change in CPU
architecture, you may need to fine-tune applications to take full advantage of both core
types. That will be particularly important for multithreaded applications with complex
job managers, and multiple-use middleware. Elsewhere, if the RTOS Scheduler (Linux*
5.18+ kernel) is optimized to account for 12th Generation Intel® Core™ processors’
performance hybrid architecture, the impact may be minimal.
Analysis of applications on hybrid architectures has shown the majority perform well,
with older single threaded applications favoring the P-cores. Applications that were
already built to heavily utilize multithreading, and that can scale to double-digit core
counts, were found to benefit from hybrid architecture due to better throughput.
However, there are inevitable performance inversions, attributed to either poor
multithreading architectures, poor RTOS scheduling, or increased threading overhead.
Such issues were addressed by introducing optimizations using the Intel® Thread
Director technology, along with hardware hints to the RTOS scheduler (in Linux* 5.18+
kernel). Make sure you keep your development environment current to achieve the best
performance on 12th Generation Intel® Core™ processors and other hybrid
architectures. The RTOS, tools, and libraries should all regularly updated.
Best Practices for Real-Time Optimizations for the 12th Generation Intel® Core™ Processors
White Paper
December 2022
10
Document Number: 737973-1.1
Architecture Overview
2.7
Specific Optimizations and Solutions for Real-Time Systems
2.7.1
Real-Time Thread
A compute bound real-time thread runs at its peak performance when running on PCores, it could handle the logic of computing actuator commands. Depending on the
number of axis it could represent a tremendous workload to process all the inputs and
compute all set points. Most real-time applications try to process the new set points as
early as possible to avoid deadline violations in the case of jitter. For example, as soon
as the sensor input is available processing can begin, however until the RTOS has
scheduled the real-time thread no processing can begin.
2.7.2
Efficient-Cores (E-cores)
E-cores run at a lower frequency, and thus execute instructions with lower
performance. In the 12th Generation Intel® Core™ processors are represented with
fewer P-Cores, E-cores are likely to be a major workhorse. Thus, developers will want to
move most non-critical parallel workloads to these cores.
2.7.3
Performance-Cores (P-cores)
P-cores run at a higher frequency, and thus execute instructions with higher
performance. P-Cores can offer the highest performance with dedicated L2 caches and
individual control of the processor frequency. However, care should be taken to
dedicating these cores to a real-time thread, and the cost in residual performance will
be high.
2.8
L2 Cache Allocation Technology (CAT)
12th Generation Intel® Core™ processor systems with a BIOS that supports Intel® Time
Coordinated Computing (Intel® TCC) will have a BIOS option to enable enumeration of
L2 Cache Allocation Technology:
Best Practices for Real-Time Optimizations for the 12th Generation Intel® Core™ Processors
December 2022
White Paper
Document Number: 737973-1.1
11
Architecture Overview
Figure 2. The “L2 QOS Enumeration” option in BIOS
When the “L2 QOS Enumeration” option is enabled, software will be able to detect
support for L2 CAT via CPUID as described in Chapter 17.19.4.1 “Enumeration and
Detection Support of Cache Allocation Technology”, Volume 3B, of the Intel® 64 and IA32 Architectures Software Developer Manuals.
When this option is enabled on heterogeneous systems with a 12th Generation Intel®
Core™ processor such as the Intel® Core™ i9-12900E and Intel® Core™ i7-12700E
processors, reduced E-core performance may be observed when running Linux with
kernel support for Resource Control (CONFIG_X86_CPU_RESCTRL).
2.8.1
Resource Control & Heterogeneous systems
When Resource Control (CONFIG_X86_CPU_RESCTRL) is enabled in the kernel, during
the boot sequence the capacity bitmask (CBM) registers for each cache hierarchy that
supports Intel® Resource Director Technology (Intel® RDT) Cache Allocation Technology
(CAT) are reinitialized. On heterogenous systems with a 12th Generation Intel® Core™
processor, the capacity bitmask length for the L2 cache may differ in length between
core types (P-core and E-core). As a result, and depending on which core type was used
for initial discovery, programming of a consistent mask length to all core types may
result in reduced performance or kernel error messages.
For example, consider the 12th Generation Intel® Core™ i9-12900E processor which
consists of 8 P-cores and 8 E-cores. The following table represents the capacity
bitmasks out of reset and after reinitialization by Resource Control.
Best Practices for Real-Time Optimizations for the 12th Generation Intel® Core™ Processors
White Paper
December 2022
12
Document Number: 737973-1.1
Architecture Overview
Model Specific Register (MSR) 0xD10 – CBM for L2 Cache
Core Type
Initial value out of
reset
Value after Resource Control
reinitialization
P-Core
0x3FF
0x3FF
E-Core
0xFFFF
0x3FF (results in reduced performance)
The above values demonstrate the case when the initial probe of CPUID to obtain the
capacity bitmask length was issued from a P-core. Reduced performance is a result of
changing the E-core L2 CBM from 16-ways (0xFFFF) to 10 ways (0x3FF), effectively
limiting the use of the L2 cache to 62.5% of its total capacity.
It is possible that CPUID may be issued from an E-core, resulting in a CBM length of
0xFFFF. Since it is not possible to program a CBM value of 0xFFFF to a P-core (whose
maximum value is 0x3FF), this will result in an unchecked MSR access error being
recorded in the kernel’s log messages.
2.8.2
Mitigations
There are multiple options for restoring full use of the L2 cache for the E-cores when
L2 QOS Enumeration is enabled.
1.
Disable Resctrl support of L2 CAT: Add rdt=!l2cat to the kernel boot
parameters.
2. Manually reinitialize L2 CBM: Write the full waymask to the
IA32_L2_QOS_MASK_0 – IA32_QOS_MASK_15 MSRs (0xD10 – 0xD1F) on at
least 1 E-core per a given cache_id.
3. Remove Resource Control support from the kernel: Recompile the kernel and
disable CONFIG_X86_CPU_RESCTRL.
2.9
Summary and Conclusion
Developers must fully enumerate the available logical processors on a system to
optimize for temporal determinism and performance. You should design your software
with performance hybrid architectures (such as the one in the 12th Generation Intel®
Core™ processors) in mind. Prepare your code to support performance hybrid
architectures by completing these steps:
1.
2.
3.
Enable hybrid topology detection.
Implement architecture choices that optimize performance.
Plan your thread scheduling in advance.
§
Best Practices for Real-Time Optimizations for the 12th Generation Intel® Core™ Processors
December 2022
White Paper
Document Number: 737973-1.1
13
Download