Optimizing Computer Runtime Using Data Prefetching and Code

advertisement
SIMG-706
02/15/16
Francois Alain
999-18-9888
Optimizing Computer Runtime Using Code Optimization and Data Prefetching
Abstract:
This report covers techniques that are available to us to optimize the use of computer resources when doing
image processing. Computer simulations/calculations are more and more part of our life as
engineers/scientists. However, the computer that run those simulation programs often find themselves
waiting for the required data to be fetched from memory. For some scientific programs, up to 50% of the
run time is spent waiting for data. Code optimization and data prefetching are techniques for runtime
optimization.
Introduction:
Until the advent of fast microprocessors, most of the real-time image processing was done on custom-built
hardware and signal processing chips. These processors did the job really well and within the allocated
time frame. A few drawbacks are that these systems can only be used for what they are designed.
Whenever not in use, they sit in the corner of the lab, gathering dust. They are also quite expensive, as this
is generally the mark of custom-built equipment. They are sold in low volumes therefore the makers need
to increase the price to cover for research and development costs.
The fast pace of electronics advancement is changing (and has been changing) the way image processing is
done in a few ways. “Multi-purpose” processors, like the ones powering personal computers, are getting
ever faster. With those increased operating speeds, the range of image processing operations that can be
done are increasing rapidly.
This rapid evolution in processor development also means that the traditional equipment becomes obsolete
a short time after its acquisition because a new and improved system is introduced. Due to its design
characteristics, the traditional equipment can hardly be upgraded. A new system has to be purchased if an
upgrade is desired.
Background:
When the cost of a system is the driving factor behind it’s purchase, the method of choice for image
processing often turns out to be writing software that will run on a personal computer or a workstation.
Although slower than the customized chip, the software solution has the advantage of being modifiable and
reusable. A few simple modifications will allow the user to use its program for other needs.
To solve the lack of upgradability of many imaging systems, many people call for an “open system”
approach to designing these tools. Most of the materials used to build the equipment would be “off the
shelf” components. The system’s operation is now determined by in-house software instead of by the unmodifiable custom built hardware chip that was the heart of the system until recently. The hardware costs
are lower than custom-built components and the software can be created relatively quickly using the vast
amount of software libraries available to programmers.
An advantage to this technique is that whenever the system is not in use, it can be used for other things
such as word processing. Another advantage is what took two different specialized machines can be done
on a single computer, simply by using two different software components as well as peripherals.
As long as one follows the open system approach, the hardware can be replaced as the old one becomes
obsolete. The software can easily be ported from the old station to the new one, or recompiled to fit the
architecture of the new host machine.
Writing your own software also has the advantage of fostering code reusability. The code is also easily
modifiable if it doesn’t fill the needs anymore.
p. 1/5
SIMG-706
02/15/16
Francois Alain
999-18-9888
The problem with having people write their own software is that the results will vary a lot depending on the
level of knowledge of the programmer. The run-time of a program is highly dependent on the skills of the
programmer and the optimization techniques used. It can make a difference between a system being too
slow for some applications and the same system (with different software) being acceptable for the task at
hand. Throughout our field of study at RIT, most of us will be required to do image processing, as part of a
research or as part of a class. We will use computer programs to simulate, process, or calculate image
enhancement techniques. Time, yours or the processor’s, is a precious resource. As such, we must do our
best to optimize it. Hardware and software techniques, as well as computer architectures are covered in this
report. Code optimization and data prefetching are two of a multitude of techniques that can enhance the
performance of software. The following is part literature review and part sharing my own experience in
optimizing software for real-time systems.
Code Optimization:
The first subject that I will approach is that of code optimization. It can be defined as writing code so it
runs as fast as possible on its host computer. The best way to achieve this goal is to write the code in a
low-level language, such as assembly. Assembly language is a non-intuitive way of writing code. By this I
mean its structure is less “language-like” than other high level languages. Because it is non-intuitive, the
development time is longer, which drives up the costs of developing the software. Also, very few people
are familiar with assembly language. The best of both worlds is to embed assembly language instructions
in high level code. The programmer can then program most of the code in an intuitive high level language
and then use assembly for small parts of the code, where code optimization would be required to improve
the program run-time.
Most high-level language compilers offer options as to what type of code to generate at compile time. A
few options are optimization for run-time or for code size. An action as simple as checking the “optimize
for time” option box could generate notable improvements in the processing time of a program.
Programming has been around for quite some time. As such, many of the problems a programmer
encounters have likely been solved by another programmer at one time. There exist code libraries that
contain routines that probably do what you are looking for. Some people refer to these libraries as
“numerical recipes” libraries. Much of the code for these libraries was developed for scientific programs.
As such, most of the routines have been optimized for run-time. For those of us taking SIMG-782, Digital
Image Processing, Dr. Rhody clearly made a point about using numerical recipes libraries or built-in
routines when he mentioned that the routine he wrote for convolution ran 35 times slower than IDL’s builtin routine CONVOL. Dr. Rhody’s code appeared to be clean, with no unnecessary code or loops to slow it
down. The difference probably relied on the fact that the CONVOL routine is probably written in low level
and optimized for run-time. Dr. Rhody’s routine was written in IDL, the Interactive Data Language.
If one doesn’t have access to numerical recipe libraries, or if they don’t exist for the language you are
using, a vast amount of numerical recipe books have been published. They contain the algorithms from
which the libraries were built. The books sometimes contain many algorithms that achieve the same
results. One will be faster, on will use less memory or be more efficient in some way. The user only has to
choose the one that fits his needs. The Fast Fourier Transform (FFT) is an operation for which many
algorithms, or numerical recipes, exist.
The following are proven principles that apply to optimization in any computer language. They were not
classified in any order of importance.
Don't optimize as you go: Write your program without regard to possible optimizations,
concentrating instead on making sure that the code is clean, correct, and understandable. If it's too
big or too slow when you've finished, then you can consider optimizing it.
Remember the 80/20 rule: In many fields you can get 80% of the result with 20% of the effort
(also called the 90/10 rule - it depends on whom you talk to). Whenever you're about to optimize
p. 2/5
SIMG-706
02/15/16
Francois Alain
999-18-9888
code, find out where that 80% of execution time is going, so you know where to concentrate your
effort.
Always run "before" and "after" benchmarks: How else will you know that your optimizations
actually made a difference? If your optimized code turns out to be only slightly faster or smaller
than the original version, undo your changes and go back to the original, clear code.
Use the right algorithms and data structures: For example, don't use an O(n2) bubblesort algorithm
to sort a thousand elements when there's an O(n log n) quicksort available. Similarly, don't store a
thousand items in an array that requires an O(n) search when you could use an O(log n) binary
tree.
Use efficient loops: Since loops repeat themselves, any efficiency will be compounded. An error
as simple as initializing a variable inside the loop when it would have worked just fine outside the
loop can increase the run-time dramatically.
Define variables that are used at the same time sequentially: Computers must fetch data from
memory. That memory is sometimes brought into the cache in blocks. If the variables are defined
sequentially, there is a good chance that one data fetch will be sufficient to bring the data into
memory. See the next topic - data prefetching – for more information.
Do only the necessary input/output: Input/output to peripherals take time and should be limited to
a minimum. A counter that says “XX% complete” is inefficient and should not be used inside a
loop. Increase in run-time of one order of magnitude can be expected with such messages. If a
warning to the user is required, use a general form like “please wait while this processes”.
These are not a panacea but are a good indication of how well a program will perform.
The extent of the optimization that can be done also depends on the processor used to execute the code. A
computer chip with a math coprocessor will execute calculations faster than a computer without the
processor. A recent change is the venue of Intel’s MMX™ Technology. It is designed to accelerate
multimedia and communications applications. The technology includes new instructions and data types that
allow applications to achieve a new level of performance. It exploits the parallelism inherent in many
multimedia and communications algorithms, yet maintains full compatibility with existing operating
systems and applications.
As an example, images pixel data are generally represented in 8-bit integers, or bytes. With MMX
technology, eight of these pixels are packed together in a 64-bit quantity and moved into an MMX register.
When an MMX instruction executes, it takes all eight of the pixel values at once from the MMX register,
performs the arithmetic or logical operation on all eight elements in parallel, and writes the result into an
MMX register.
To take advantage of MMX technology, the code has to be recompiled with a compiler that is MMX
compatible. Special tool libraries will make this process transparent to the developer, unless the program is
written in assembly language. In that case, the programmer will have more specialized instructions to
worry about. The following table clearly demonstrates the run-time advantages MMX has for image
processing.
Intel Media Benchmark Performance Comparison
Pentium®
Pentium
Pentium® Pro
processor
processor
processor
200 MHz
200 MHz – MMX
200 MHz –
technology
256KB L2
Overall
156.00
255.43
194.38
Video
155.52
268.70
158.34
p. 3/5
Pentium® II
processor
233MHz -512KB L2
310.40
271.98
Pentium II
processor
266 MHz –
512KB L2
350.77
307.24
SIMG-706
02/15/16
Francois Alain
999-18-9888
Image
159.03
743.92
220.75
1,026.55
1,129.01
Processing
3D Geometry* 161.52
166.44
209.24
247.68
281.61
Audio
149.80
318.90
240.82
395.79
446.72
Pentium processor and Pentium processor with MMX technology are measured with 512KB L2 cache
When an operation requires many calculations, one might think about using a system that allows parallel
processing. With parallel processing, two or more chips are used to perform calculations in parallel. It is
this type of architecture that allowed IBM’s Deep Blue to defeat Gary Kasparov in a chess match,
calculating millions of possible moves in a short time. In fact, many servers or high performance
computers now come with two chips and can process information in parallel. Parallelism depends on the
operating system. For example, Windows 95 doesn’t allow parallel processing but Windows NT does.
In order to do image processing, one generally needs a lot of memory. A 256x256, 8 bits image needs
65KB of memory. Data prefetching is a technique that has been developed to optimize memory usage.
Although not generally used at an “amateur” level, I wanted bring your attention to yet another possibility.
Data prefetching:
During the past decade, CPU performance has outpaced that of dynamic RAM, the primary component of
main memory. It is not now uncommon for scientific programs to spend more than half their run-time
stalled on memory requests. Data prefetching is one of the techniques used to reduce or hide the large
latency of main-memory accesses. With data prefetching, memory systems call data into the cache before
the processor needs it, while processor computation takes place. For recent personal computers, cache
sizes vary from 32 to 512 Kbytes.
As an example, if a program executes a FFT of an image of size 300 by 400 pixels, 120 Kbytes of data is
required to store the image alone. Now suppose the result of the FFT is multiplied with a filter of the same
size and the result is stored in a different array. It becomes obvious that most caches will not be enough to
store that data and that the computer will require to use the main memory to store the data required for that
calculation. The processor will execute the calculations, getting the necessary information from the cache
until such time as the information cannot be found in the cache. When the information is not found, the
processor requests the data to the cache controller, which fetches it from main memory. While this fetch is
being executed, the processor is wasting precious memory cycles, thereby increasing the total program runtime. If it were possible to always have the required data in the cache, the run-time of the program would
be improved.
One has to use caution when using data prefetching since when one block of data is brought in the cache
after a prefetching request, it is likely that one block will need to be evicted. If the data evicted is the data
that is currently required, the processor will have to wait for it to be brought back in memory. Because of
this, the program might actually run slower than it would have without the prefetching instruction. Prefetch
timing is critical for prefetching to actually show notable improvements in the run-time of a program.
Done too late, the computer will wait for data, too early, required data might be evicted from memory.
Of the prefetching techniques available, I will discuss only software initiated prefetching. Obviously, one
other prefetching technique is “hardware initiated”, which I won’t discuss because it doesn’t involve
programmer or compiler intervention. Those who are interested can always consult the reference.
With software prefetching, before the processor requires the data, a fetch instruction specifies the required
address to the memory system, which forwards the word to the cache. Because the processor does not need
the data yet, it can continue computing while the memory system brings the requested data to the cache.
Before you plan to use data prefetching in your next program, you need to know if your microprocessor
contains a fetch instruction. Also, some compilers have optimization schemes that include prefetching
statements. If you want to include your own prefetching statements, you should limit yourself to loops.
p. 4/5
SIMG-706
02/15/16
Francois Alain
999-18-9888
Predicting the memory access patterns for code other than loops is unreliable and could even result in
longer execution time since a fetch instruction does utilize processor time.
If the compiler you are using doesn’t include prefetching optimization, designing for data prefetching might
not be the best solution. It is likely not a technique that will be profitable. Too much time will be spent
designing the code, for only marginal improvements in run-time.
Conclusion:
Although you can achieve your results using customize processors, it is often more affordable to do it using
“off the shelf” computers and customized software. Code optimization and data prefetching are two
techniques that will enable those platforms to do image processing at speeds that are acceptable.
Two programmers can produce the same results in image processing, but one could achieve its result 10 to
50 times faster than the first one. The use of code optimization, data prefetching, smart use of caches,
memory and computer architecture can account for these results.
These techniques can’t solve all the problems. Some applications will still require custom-built signal
processing hardware that can process more bandwidth faster. But I am confident that as time goes, more
and more of the components that were used to build specialized equipment will enjoy more popularity and
become low cost, “off the shelf” components, or could be replaced by ever faster multi-purpose equipment,
optimized for the task at hand. One will then be able to get the same results at a fraction of the cost.
Acknowledgements:
Thank you to Kevin Ayer and Erich Hernandez-Baquero that have kindly accepted to review this
document.
Thank you to Dr Gatley: Hopefully I will at least have learned how to write an abstract. One really only
learns through practice. I believe this was good practice for an eventual thesis.
Glossary:
CPU: Central Processing Unit.
FFT: Fast Fourier Transform.
IDL: Interactive Data Language.
RAM: Random Access Memory.
Bibliography:
R.C. Dorf (editor-in-chief), Electrical Engineering Handbook, CRC Press, 1993
S.P. Vander Wiel, D.J. Lilja, “When Caches Aren’t Enough: Data Prefetching Techniques”, Computer, July
1997, pp.23-30.
Author unknown, MMXTM Technology Technical Overview, Internet, Intel Corporation, 1996.
http://developer.intel.com/drg/mmx/Manuals/overview/index.htm
M. Mittal, A. Peleg, U. Weiser, MMX™ Technology Architecture Overview, Internet, Intel Corporation,
1997.
http://developer.intel.com/technology/itj/articles/art_2.htm
J. Hardwick, General Rules for Optimization, Internet, 1996.
http://www.cs.cmu.edu/~jch/java/rules.html
p. 5/5
Download