Image Processing: A Study in Pixel Averaging Building a Resolution Pyramid with Parallel Computing By Denise Runnels & Farnaz Zand December 12, 2000 Abstract: A time comparison study for image processing via pixel averaging in a parallel system is made. The time to average the pixels of an image using one processor is compared to the time taken by multiple processors. An abbreviated resolution pyramid is the product of this study. We do not emphasize the quality of the resultant image we merely note the difference from one resolution to another due to the averaging of pixel values. A Beowulf cluster is used and the results of this study are presented below. Introduction: A resolution pyramid is a set of files, each file of which contains the same image although with differing resolutions. This pyramid is very useful in virtual reality applications, for example, that call for real-time rendering of images such as a topographical scene. The image files in a pyramid are simply tiles of that scene. When the application is running the image files/tiles with lower resolution are loaded in for rendering the horizon while image files/tiles with higher resolution are used to render the point of focus. These image files with different resolutions can be produced using several image processing techniques. One such technique is to average adjacent pixels of an image to produce an image with a lower resolution. Image processing can be a computationally expensive undertaking, especially when the original image is a very large image. Processing a digital image requires three basic steps: read the image data in, manipulate that data, write the manipulated data back out. This process is readily suitable for multiprocessing which is a natural step in the progress of computing and image processing. This study will compare the time needed for averaging the pixel data of an image using only one processor with the time needed to distribute and process this data using several processors. We will observe the image quality as the resolution decreases to determine the smallest resolution with a discernible image. Images in the monochrome targa format will be used for this study. An image will be loaded in, the pixel data will be averaged using a variable scalar value and an image in the monochrome targa format will be generated using the new, averaged pixel data. The goal of this study is to determine if multiple processors process an image more efficiently than one processor. 1 Experimental Setup: The system used in this experiment is a Beowulf cluster, which consists of 16 dual CPU Pentium II (233 MHz) nodes. Each node is connected via 3COM 3c905 10/100 PCI Ethernet using a Bay Networks 350T fast Ethernet switch. The main program used in this study is written in C++ with calls to the MPI library for message passing. When using multiple processors a master-slave relationship is developed so that the master processor will read in the original image, strip the header information and pass the data in a two-dimensional array to all of the slave processors. The master also sends the upper left corner values of a particular tile to a particular slave so that the slave processor will know what tile to work with. A call to the system clock starts timing the process just before the master’s first send. Upon receiving the data from the master, the slave processors transfer the data from the particular tile section of the original image dictated by the master into a temporary two-dimensional array. The averaging algorithm used sums each value of the temporary array row major order in scalar sized pieces, storing them temporarily in a result two-dimensional array at the position that will eventually hold the final averaged value. When all of the sums are stored in the proper result array location all of the values in the result array are divided by the square of the scalar value to determine the mean grayscale pixel value. The slave processor then sends this result array to the master whereupon the master consolidates the result arrays from each of the slaves into the original image now with a lower resolution. The master processor then calls the makeTGA function from the genTGA.h header file, which is written in C and mapped into the main program. The makeTGA function receives from the master a two-dimensional array of grayscale pixel values, the upper left corner value, the result resolution and a string indicating the prefix for the .tga filename to be generated. This function generates a monochrome .tga image file using the naming convention, prefix_cornerX_cornerY_resolution.tga. Upon return from this function another call to the system clock stops the timer and reports the time taken in microseconds. When using only one processor that processor reads in the image and strips away the header. The same averaging algorithm used by the slave processors on a portion of the original image is used on the entire original image. The result of the averaging algorithm is sent to the makeTGA function, which produces a resultant image of lower resolution. The system clock is called just before entering the averaging algorithm to start the timer and then again upon returning from the makeTGA function. These different versions are run five times for each of 1, 5, and 17 processors with three different sized images, 1024 X 1024, 512 X 512, and 256 X 256. Each original image is scaled down to 128 X 128 beginning with a scalar value of 2 and increasing the scalar by a power of two each iteration until the final 128 X 128 resolution is reached. That is, with scalar = 2 the resolution of the 1024 X 1024 image is reduced to a resolution of 512 X 512, then when scalar = 4 the resolution is reduced to 256 X 256 and finally with scalar = 8 the resolution is 128 X 128. The original images are constrained in size so that dimensions are a power of two because graphics hardware available requires this constraint. Also, for code simplification to ensure uneven edges are not a factor, we have limited the number of multiple processors to be even powers of two plus one, i.e. 22 + 1 and 24 +1. The maximum number of processors available is 32 so that 17 processors is the maximum used for this project. 2 Results: Following are representations of the data collected for each image processed. Original Image Size 256 X 256 90 80 70 60 50 Time in Milliseconds 40 Resolution 128 30 20 10 0 1 5 17 Number of Processors The above chart indicates the results obtained when reducing a 256 X 256 image to an image with resolution 128 X 128 using 1, 5, and 17 processors. With even this small of an image 5 processors are more efficient than 1, although 17 processors are far less efficient than either 1 processor or 5 processors. The images below are the original 256 X 256 image and the reduced resolution 128 X 128 image produced. It can be seen easily that merely averaging pixel values is not an adequate means of reducing resolution of an image. The reduced image quality is far from acceptable for a real application. Original Image 256 X 256 Processed Image 128 X 128 3 Original Image Size 512 X 512 250 200 150 Time in Milliseconds 1 5 17 100 50 17 5 0 1 256 Number of Processors 128 Reduced Resolution The original image size 512 X 512 is reduced first to a resolution of 256 X 256 then to a resolution of 128 X 128. Like the previous chart this indicates that 5 processors are more efficient than 1, and that 17 processors is very inefficient comparatively. It is also noted that in this case reducing the image to the lowest resolution, i.e. 128 X 128, is faster than reducing the image to a resolution of 256 X 256 regardless of number of processors. 4 Original Image Size 1024 X 1024 900 800 700 600 500 Time in Milliseconds 400 1 5 17 300 200 100 17 0 512 1 256 Reduced Resolution 5 Number of Processor 128 Again with an original image resolution of 1024 X 1024 reduced to resolutions of 512 X 512, 256 X 256, and 128 X 128 the results indicate that 5 processors are more efficient than 1, and 17 processors are the least efficient. In the previous graph, original image size 512 X 512, we observe a consistent pattern of decrease in time for generating consecutively lower resolution images. However, with an original image size of 1024 X 1024 this pattern of consistency is not observed. This inconsistency is seen with the reduction to a resolution of 256 X 256 where the time increases for 5 processors as well as for 17 processors. Conclusion: From these results we conclude that processing images with multiple processors can be more efficient than processing images with only one processor. We find however, that there are a threshold number of processors after which efficiency decreases due to the message passing overhead. It is our intent to follow this study with a more in depth look at certain aspects not considered here. We would like to compare the efficiency of sending the entire original image to each slave processor as we have done in this study with having the master processor divide the original image into partitions and send only the partition to the slave. Perhaps the bottleneck up front with the master sorting out partitions is still more efficient than sending such a large message (the entire original image) to all of the slave processors. We would also like to consider ways of utilizing different numbers of processors by maintaining different constraints on the original image so that we might narrow down what the threshold number of processors possibilities are. This research has shown promising results for implementing image processing techniques in a multiprocessing environment. Further research will undoubtedly reveal even more exciting outcomes. 5 References: Benner, D (2000). Unofficial delphi developers faq. Graphics. http://www.gnomehome.demon.nl/uddf/pages/graphics.htm (13 October 2000) Drakos, N (1997). Date and time functions -- <time.h>. http://iel.ucdavis.edu/people/cperez/test13/node3.html (22 September 2000) Seyfarth, R. Wiglaf, USM Beowulf cluster. A Class Handout. Snir, M. (1994). MPI: the complete reference. http://netlib2.cs.utk.edu/utk/papers/mpi-book/mpi-book.html (13 September 2000) 6