SIMG-706 02/15/16 Francois Alain 999-18-9888 Optimizing Computer Runtime Using Code Optimization and Data Prefetching Abstract: This report covers techniques that are available to us to optimize the use of computer resources when doing image processing. Computer simulations/calculations are more and more part of our life as engineers/scientists. However, the computer that run those simulation programs often find themselves waiting for the required data to be fetched from memory. For some scientific programs, up to 50% of the run time is spent waiting for data. Code optimization and data prefetching are techniques for runtime optimization. Introduction: Until the advent of fast microprocessors, most of the real-time image processing was done on custom-built hardware and signal processing chips. These processors did the job really well and within the allocated time frame. A few drawbacks are that these systems can only be used for what they are designed. Whenever not in use, they sit in the corner of the lab, gathering dust. They are also quite expensive, as this is generally the mark of custom-built equipment. They are sold in low volumes therefore the makers need to increase the price to cover for research and development costs. The fast pace of electronics advancement is changing (and has been changing) the way image processing is done in a few ways. “Multi-purpose” processors, like the ones powering personal computers, are getting ever faster. With those increased operating speeds, the range of image processing operations that can be done are increasing rapidly. This rapid evolution in processor development also means that the traditional equipment becomes obsolete a short time after its acquisition because a new and improved system is introduced. Due to its design characteristics, the traditional equipment can hardly be upgraded. A new system has to be purchased if an upgrade is desired. Background: When the cost of a system is the driving factor behind it’s purchase, the method of choice for image processing often turns out to be writing software that will run on a personal computer or a workstation. Although slower than the customized chip, the software solution has the advantage of being modifiable and reusable. A few simple modifications will allow the user to use its program for other needs. To solve the lack of upgradability of many imaging systems, many people call for an “open system” approach to designing these tools. Most of the materials used to build the equipment would be “off the shelf” components. The system’s operation is now determined by in-house software instead of by the unmodifiable custom built hardware chip that was the heart of the system until recently. The hardware costs are lower than custom-built components and the software can be created relatively quickly using the vast amount of software libraries available to programmers. An advantage to this technique is that whenever the system is not in use, it can be used for other things such as word processing. Another advantage is what took two different specialized machines can be done on a single computer, simply by using two different software components as well as peripherals. As long as one follows the open system approach, the hardware can be replaced as the old one becomes obsolete. The software can easily be ported from the old station to the new one, or recompiled to fit the architecture of the new host machine. Writing your own software also has the advantage of fostering code reusability. The code is also easily modifiable if it doesn’t fill the needs anymore. p. 1/5 SIMG-706 02/15/16 Francois Alain 999-18-9888 The problem with having people write their own software is that the results will vary a lot depending on the level of knowledge of the programmer. The run-time of a program is highly dependent on the skills of the programmer and the optimization techniques used. It can make a difference between a system being too slow for some applications and the same system (with different software) being acceptable for the task at hand. Throughout our field of study at RIT, most of us will be required to do image processing, as part of a research or as part of a class. We will use computer programs to simulate, process, or calculate image enhancement techniques. Time, yours or the processor’s, is a precious resource. As such, we must do our best to optimize it. Hardware and software techniques, as well as computer architectures are covered in this report. Code optimization and data prefetching are two of a multitude of techniques that can enhance the performance of software. The following is part literature review and part sharing my own experience in optimizing software for real-time systems. Code Optimization: The first subject that I will approach is that of code optimization. It can be defined as writing code so it runs as fast as possible on its host computer. The best way to achieve this goal is to write the code in a low-level language, such as assembly. Assembly language is a non-intuitive way of writing code. By this I mean its structure is less “language-like” than other high level languages. Because it is non-intuitive, the development time is longer, which drives up the costs of developing the software. Also, very few people are familiar with assembly language. The best of both worlds is to embed assembly language instructions in high level code. The programmer can then program most of the code in an intuitive high level language and then use assembly for small parts of the code, where code optimization would be required to improve the program run-time. Most high-level language compilers offer options as to what type of code to generate at compile time. A few options are optimization for run-time or for code size. An action as simple as checking the “optimize for time” option box could generate notable improvements in the processing time of a program. Programming has been around for quite some time. As such, many of the problems a programmer encounters have likely been solved by another programmer at one time. There exist code libraries that contain routines that probably do what you are looking for. Some people refer to these libraries as “numerical recipes” libraries. Much of the code for these libraries was developed for scientific programs. As such, most of the routines have been optimized for run-time. For those of us taking SIMG-782, Digital Image Processing, Dr. Rhody clearly made a point about using numerical recipes libraries or built-in routines when he mentioned that the routine he wrote for convolution ran 35 times slower than IDL’s builtin routine CONVOL. Dr. Rhody’s code appeared to be clean, with no unnecessary code or loops to slow it down. The difference probably relied on the fact that the CONVOL routine is probably written in low level and optimized for run-time. Dr. Rhody’s routine was written in IDL, the Interactive Data Language. If one doesn’t have access to numerical recipe libraries, or if they don’t exist for the language you are using, a vast amount of numerical recipe books have been published. They contain the algorithms from which the libraries were built. The books sometimes contain many algorithms that achieve the same results. One will be faster, on will use less memory or be more efficient in some way. The user only has to choose the one that fits his needs. The Fast Fourier Transform (FFT) is an operation for which many algorithms, or numerical recipes, exist. The following are proven principles that apply to optimization in any computer language. They were not classified in any order of importance. Don't optimize as you go: Write your program without regard to possible optimizations, concentrating instead on making sure that the code is clean, correct, and understandable. If it's too big or too slow when you've finished, then you can consider optimizing it. Remember the 80/20 rule: In many fields you can get 80% of the result with 20% of the effort (also called the 90/10 rule - it depends on whom you talk to). Whenever you're about to optimize p. 2/5 SIMG-706 02/15/16 Francois Alain 999-18-9888 code, find out where that 80% of execution time is going, so you know where to concentrate your effort. Always run "before" and "after" benchmarks: How else will you know that your optimizations actually made a difference? If your optimized code turns out to be only slightly faster or smaller than the original version, undo your changes and go back to the original, clear code. Use the right algorithms and data structures: For example, don't use an O(n2) bubblesort algorithm to sort a thousand elements when there's an O(n log n) quicksort available. Similarly, don't store a thousand items in an array that requires an O(n) search when you could use an O(log n) binary tree. Use efficient loops: Since loops repeat themselves, any efficiency will be compounded. An error as simple as initializing a variable inside the loop when it would have worked just fine outside the loop can increase the run-time dramatically. Define variables that are used at the same time sequentially: Computers must fetch data from memory. That memory is sometimes brought into the cache in blocks. If the variables are defined sequentially, there is a good chance that one data fetch will be sufficient to bring the data into memory. See the next topic - data prefetching – for more information. Do only the necessary input/output: Input/output to peripherals take time and should be limited to a minimum. A counter that says “XX% complete” is inefficient and should not be used inside a loop. Increase in run-time of one order of magnitude can be expected with such messages. If a warning to the user is required, use a general form like “please wait while this processes”. These are not a panacea but are a good indication of how well a program will perform. The extent of the optimization that can be done also depends on the processor used to execute the code. A computer chip with a math coprocessor will execute calculations faster than a computer without the processor. A recent change is the venue of Intel’s MMX™ Technology. It is designed to accelerate multimedia and communications applications. The technology includes new instructions and data types that allow applications to achieve a new level of performance. It exploits the parallelism inherent in many multimedia and communications algorithms, yet maintains full compatibility with existing operating systems and applications. As an example, images pixel data are generally represented in 8-bit integers, or bytes. With MMX technology, eight of these pixels are packed together in a 64-bit quantity and moved into an MMX register. When an MMX instruction executes, it takes all eight of the pixel values at once from the MMX register, performs the arithmetic or logical operation on all eight elements in parallel, and writes the result into an MMX register. To take advantage of MMX technology, the code has to be recompiled with a compiler that is MMX compatible. Special tool libraries will make this process transparent to the developer, unless the program is written in assembly language. In that case, the programmer will have more specialized instructions to worry about. The following table clearly demonstrates the run-time advantages MMX has for image processing. Intel Media Benchmark Performance Comparison Pentium® Pentium Pentium® Pro processor processor processor 200 MHz 200 MHz – MMX 200 MHz – technology 256KB L2 Overall 156.00 255.43 194.38 Video 155.52 268.70 158.34 p. 3/5 Pentium® II processor 233MHz -512KB L2 310.40 271.98 Pentium II processor 266 MHz – 512KB L2 350.77 307.24 SIMG-706 02/15/16 Francois Alain 999-18-9888 Image 159.03 743.92 220.75 1,026.55 1,129.01 Processing 3D Geometry* 161.52 166.44 209.24 247.68 281.61 Audio 149.80 318.90 240.82 395.79 446.72 Pentium processor and Pentium processor with MMX technology are measured with 512KB L2 cache When an operation requires many calculations, one might think about using a system that allows parallel processing. With parallel processing, two or more chips are used to perform calculations in parallel. It is this type of architecture that allowed IBM’s Deep Blue to defeat Gary Kasparov in a chess match, calculating millions of possible moves in a short time. In fact, many servers or high performance computers now come with two chips and can process information in parallel. Parallelism depends on the operating system. For example, Windows 95 doesn’t allow parallel processing but Windows NT does. In order to do image processing, one generally needs a lot of memory. A 256x256, 8 bits image needs 65KB of memory. Data prefetching is a technique that has been developed to optimize memory usage. Although not generally used at an “amateur” level, I wanted bring your attention to yet another possibility. Data prefetching: During the past decade, CPU performance has outpaced that of dynamic RAM, the primary component of main memory. It is not now uncommon for scientific programs to spend more than half their run-time stalled on memory requests. Data prefetching is one of the techniques used to reduce or hide the large latency of main-memory accesses. With data prefetching, memory systems call data into the cache before the processor needs it, while processor computation takes place. For recent personal computers, cache sizes vary from 32 to 512 Kbytes. As an example, if a program executes a FFT of an image of size 300 by 400 pixels, 120 Kbytes of data is required to store the image alone. Now suppose the result of the FFT is multiplied with a filter of the same size and the result is stored in a different array. It becomes obvious that most caches will not be enough to store that data and that the computer will require to use the main memory to store the data required for that calculation. The processor will execute the calculations, getting the necessary information from the cache until such time as the information cannot be found in the cache. When the information is not found, the processor requests the data to the cache controller, which fetches it from main memory. While this fetch is being executed, the processor is wasting precious memory cycles, thereby increasing the total program runtime. If it were possible to always have the required data in the cache, the run-time of the program would be improved. One has to use caution when using data prefetching since when one block of data is brought in the cache after a prefetching request, it is likely that one block will need to be evicted. If the data evicted is the data that is currently required, the processor will have to wait for it to be brought back in memory. Because of this, the program might actually run slower than it would have without the prefetching instruction. Prefetch timing is critical for prefetching to actually show notable improvements in the run-time of a program. Done too late, the computer will wait for data, too early, required data might be evicted from memory. Of the prefetching techniques available, I will discuss only software initiated prefetching. Obviously, one other prefetching technique is “hardware initiated”, which I won’t discuss because it doesn’t involve programmer or compiler intervention. Those who are interested can always consult the reference. With software prefetching, before the processor requires the data, a fetch instruction specifies the required address to the memory system, which forwards the word to the cache. Because the processor does not need the data yet, it can continue computing while the memory system brings the requested data to the cache. Before you plan to use data prefetching in your next program, you need to know if your microprocessor contains a fetch instruction. Also, some compilers have optimization schemes that include prefetching statements. If you want to include your own prefetching statements, you should limit yourself to loops. p. 4/5 SIMG-706 02/15/16 Francois Alain 999-18-9888 Predicting the memory access patterns for code other than loops is unreliable and could even result in longer execution time since a fetch instruction does utilize processor time. If the compiler you are using doesn’t include prefetching optimization, designing for data prefetching might not be the best solution. It is likely not a technique that will be profitable. Too much time will be spent designing the code, for only marginal improvements in run-time. Conclusion: Although you can achieve your results using customize processors, it is often more affordable to do it using “off the shelf” computers and customized software. Code optimization and data prefetching are two techniques that will enable those platforms to do image processing at speeds that are acceptable. Two programmers can produce the same results in image processing, but one could achieve its result 10 to 50 times faster than the first one. The use of code optimization, data prefetching, smart use of caches, memory and computer architecture can account for these results. These techniques can’t solve all the problems. Some applications will still require custom-built signal processing hardware that can process more bandwidth faster. But I am confident that as time goes, more and more of the components that were used to build specialized equipment will enjoy more popularity and become low cost, “off the shelf” components, or could be replaced by ever faster multi-purpose equipment, optimized for the task at hand. One will then be able to get the same results at a fraction of the cost. Acknowledgements: Thank you to Kevin Ayer and Erich Hernandez-Baquero that have kindly accepted to review this document. Thank you to Dr Gatley: Hopefully I will at least have learned how to write an abstract. One really only learns through practice. I believe this was good practice for an eventual thesis. Glossary: CPU: Central Processing Unit. FFT: Fast Fourier Transform. IDL: Interactive Data Language. RAM: Random Access Memory. Bibliography: R.C. Dorf (editor-in-chief), Electrical Engineering Handbook, CRC Press, 1993 S.P. Vander Wiel, D.J. Lilja, “When Caches Aren’t Enough: Data Prefetching Techniques”, Computer, July 1997, pp.23-30. Author unknown, MMXTM Technology Technical Overview, Internet, Intel Corporation, 1996. http://developer.intel.com/drg/mmx/Manuals/overview/index.htm M. Mittal, A. Peleg, U. Weiser, MMX™ Technology Architecture Overview, Internet, Intel Corporation, 1997. http://developer.intel.com/technology/itj/articles/art_2.htm J. Hardwick, General Rules for Optimization, Internet, 1996. http://www.cs.cmu.edu/~jch/java/rules.html p. 5/5