Compiler Optimization: Getting the Most out of High-Level Code Given the size of today’s projects, it is now imperative to write code in a high-level language, specifically C. But that makes code optimization all the more desirable. Here are some of the techniques and technologies of how a compiler optimizes your code for minimal footprint, highest performance, or in an optimized combination. by Shawn A. Prestridge, Mentor Graphics Microcontrollers used to be very simple. They had no pipelining, a limited set of registers and the only peripherals were I/O ports that the hardware designer tied to other pieces of hardware to make them work. As such, writing assembly language code was a relatively straightforward task. These days, architectures can have multistage pipelines, banked registers and many on-chip peripherals. Because of the rising complexity of the devices, C has become the language of choice to write software for microcontrollers. But can a compiler generate code as efficient as a human can with an assembler? Assuming the individual has unlimited time, the answer is no. However, in real-world conditions where you must meet schedules and achieve faster time-tomarket, a compiler can generate code far more efficiently than any human can. Conceptually, the operation of a compiler is simple. It can take C source code and compile it into object code. The object will later be linked together by a linker into an executable, but most optimizations for the code are performed by the compiler. The compiler has several stages of processing that it performs in order to turn your source code into object code (Figure 1). The first stage runs the source code through a parser, which parses the C statements into a binary tree. The result of this parsing is referred to as “intermediate code.” The first stage of optimization is performed on this intermediate code by the high-level optimizer (HLO). The HLO analyzes the code and performs transformations based upon C language constructs, so no target-specific optimizations are performed by the HLO. Even though the IAR Embedded Workbench products support over 30 different architectures, a large portion of the code in the compiler is the same from one IAR Embedded Workbench to another because the parser and HLO are the same. After the HLO optimizes the intermediate code, the code generator translates the optimized intermediate code into target-specific code. This target code is then optimized by the low-level optimizer, which performs architecture-specific optimizations. The optimized target code is then transformed into object code by a compiler-internal assembler. Optimization takes place in three phases: analysis, transformation and placement. The analysis portion of optimization tries to understand the intention of the source code that you wrote so that it can make intelligent decisions about how to transform your source code into more efficient C language constructions while preserving the original meaning of the code. These transformations are based on heuristics and generally lead to much tighter code. The compiler also performs register allocation, which is a key part of producing efficient code. Register allocation decides which variables should be located in registers rather than being in RAM. Having variables in a register allows you to quickly perform mathematical operations on them without having to read or write them from RAM. The problem is that the microcontroller only has a limited number of registers to hold these variables, so the code has to be analyzed carefully. The analysis is split into two parts: control flow and data flow. The control flow analysis is performed first and it is the basis for the data flow analysis. Control flow analysis detects loops, optimizes jumps and finds “unreachable” code. The data flow analysis finds constant values, useless computations and “dead” code. The difference between unreachable and dead code is that unreachable code cannot be executed based on the code structure while dead code cannot be reached based on the value of variables. The second stage of optimization is transformation. There are two different levels of transformation, high-level (which is architecture-independent) and low-level (which takes advantage of the facilities provided to it by the architecture). In Figure 2, we see some of the high-level transformations that can occur in the code. The first transformation is called “strength reduction” and aims to use an operation with fewer instructions and/or MCU cycles. The other transformations in Figure 2 seek to eliminate code that is either redundant (common subexpression elimination) or unnecessary (constant folding and useless computations). Loop transformations are also performed by the high-level optimizer and can be found in Figure 3. The first transformation in this figure is referred to as “loop-invariant code motion” and seeks to move code that is not impacted by the loop operations outside of the loop (as the name of the transformation implies). The second transformation is called “loop unrolling” and is used to amortize the overhead of the test-and-branch conditions associated with the loop at the expense of slightly larger code. Lastly, the high-level optimizer makes decisions about whether or not to inline a function call based upon the number of times the function is called and the size of the code contained within the function. Function calls are very costly partially due to the branch instructions needed to jump to a function and return from it, but mostly because of the overhead that the microcontroller’s application binary interface (ABI) enforces on the compiler. This ABI requires that certain registers are preserved across function calls, so every function call must be preceded by a push of those registers to the stack and followed by a corresponding pop of the registers back off the stack to save the context. If the function’s code is inlined, this overhead is eliminated and the function runs faster (and is sometime smaller!) than if the function is actually called. Inlining gives you the functionality of a macro, but makes the code type-safe. The low-level optimizer (LLO) uses the instruction set of the underlying architecture to find ways to optimize the code. The LLO examines the target code to find places where the architecture can accomplish the goal with a small series of assembler instructions. Figure 4 illustrates two such constructs that can be reduced to just a few fast-executing instructions. The LLO also looks at register allocation to decide which variables should be located in registers. Although this allocation is normally not considered an optimization per se, it has a dramatic effect on how fast the resulting code can execute since operations can be performed directly on the data in the register rather than having to first read the value from some other memory source. The LLO also decides where to place the code and data using a technique that is referred to as “static clustering,” which collects the global and static variables into one place. This has two important benefits: it allows the compiler to use the same base pointer for many memory accesses and it eliminates alignment gaps between the memory elements. There are limits to the optimization that can be performed. For example, common subexpression elimination can only be applied to parts of expressions not involving functions. The reason is that function calls may have side effects that cannot be determined at compile-time, therefore the compiler must play it safe and preserve all function calls. If the function is inlined, however, the compiler can more effectively examine the code and do common subexpression elimination to avoid unnecessary computations with the added benefit of avoiding needless function calls. The C language provides for the concept of separate compilation units, which means that source code files in the project can be compiled individually. While this is indeed a very handy feature for writing source files that are separated into common groups, it has the unfortunate side effect that the compiler may not be aware of what is happening in other source files, which causes the compiler to generate extra code in order to be conservative in its assumptions. This is particularly true if you are calling small functions that are defined in other pieces of source code. The IAR Embedded Workbench has a unique feature that allows you to choose “Multi-file compilation” where the compiler treats several pieces of source code as one monolithic piece of code so that the compiler has greater visibility into what the code is doing and can therefore make better decisions about how to optimize the code effectively. IAR Embedded Workbench allows you to control these optimizations at several different levels to give you optimum granularity in your code development. The project-level setting is a global setting that becomes the default for all files in the project. Several pieces of source code can be contained within a group and that group can override the inherited optimization settings. Similarly, optimization can be overridden at the file level or even at the function level by the use of pragma directives. Additionally, optimization can have different goals for the compiler to achieve: size, speed or a balanced approach. As the names of the first two imply, the compiler will optimize purely for size or speed, respectively. When you use the balanced setting, the compiler tries to strike a healthy balance between size and speed, sometimes giving a little on one to achieve a little of the other. Moreover, IAR Embedded Workbench products also allow control over which transformations are applied to the code so that you can get exactly what you need. Embedded compilers have evolved greatly over the last thirty years, especially as it pertains to their optimization capabilities. Many years ago, developers had to be very careful to structure their C code in such a way that it could be easily optimized by the compiler. However, modern compilers employ many different techniques to produce very tight and efficient code so that you can focus on writing your source in a clear, logical and concise manner. IAR Systems, Uppsala, Sweden. +46 18 16 78 00. [www.iar.com].