Implementation and Quantitative Analysis of a Shared-Memory Based Parallel Server Architecture for Aerospace Information Exchange Applications A. Alegre alexalegre@gmail.com S. Beltran sbeltran00@gmail.com J. Estrada j_lestrada@yahoo.com A. Milshteyn tashkent04@hotmail.com C. Liu cliu@calstatela.edu H. Boussalis hboussa@calstatela.edu Department of Electrical and Computer Engineering California State University, Los Angeles 5151 State University Drive Los Angeles, CA 90053 Abstract This paper focuses on the implementation and quantitative analysis of a high-performance parallel processing aerospace information server. An innovative model of software architecture is provided to effectively utilize the computational power of a parallel server platform for efficient, on-demand aerospace information exchange through the Internet. This is a representative application for servers whose features are common to the classical client-server model. The server architecture supports thread, core, and/or processor-level parallel processing for high performance computing. Memory devices (i.e. cache memory, main memory, and secondary memory) are either shared or distributed among the computational units. Such features facilitate our study of identifying and overcoming the architectural bottlenecks of current commercial server configurations. 1. Introduction In 1994, the National Aeronautics and Space Administration (NASA) provided funding towards the establishment of the Structures, Pointing, and Control Engineering (SPACE) Laboratory at California State University, Los Angeles. Objectives for the laboratory include the design and fabrication of platforms which resemble the complex dynamic behavior of a segmented space telescope, the James Webb Space Telescope [3], and its components. In recent projects, the laboratory has made efforts to use the most current computer technologies to develop a prototype information server for the dissemination of multimedia FITS (Flexible Image Transport System) files. The target audience includes communities ranging from professional and amateur space scientists to students, educators, and the general public. Such efforts meet NASA’s mission to encourage space exploration and research through education. To promote the awareness of NASA’s missions, current digital technologies can be used to facilitate the establishment of networks not only for scientists and engineers of today, but for generations to come. The Aerospace Information Server (AIS) [1] is a high performance parallel processing server which supports efficient, on-demand information dissemination. The design of the internet-based server is focused on, but not restricted to, astronomical image browsing. As such, it is necessary for the distributed tuple space server to be capable of processing several simultaneous image requests. The AIS must also be able to distribute those requests at various transfer rates dictated by the clients’ needs. In order to achieve these requirements, multiple server technologies have been incorporated into the design of the AIS as listed below. 1. Tuple space programming paradigm for parallel processing and automatic load balancing [7]. 2. Search algorithms utilizing a hash table for expedited access to database files [5]. 3. Wavelet transformation algorithms for progressive image (de)compression and image transmission [13], [14]. This paper focuses on the system performance analysis of the AIS in order to ensure its high speed execution. Performance bottlenecks severely detract from server efficiency by limiting the flow of data which passes through a given region of the system. This, in turn, impedes overall system execution. As such, it is imperative to locate and bypass known bottlenecks to alleviate data flow restrictions. The paper is organized as follows: Section 2 introduces the hardware of the server. Section 3 describes the software architecture of the server system. Section 4 describes the mapping of thread affinities. Section 5 details the runtime performance analysis of the parallel processing server. Section 6 concludes the paper. 2. Server Platform System Description In order to implement the key technologies stated above and maintain real-time performance, a state-ofthe-art distributed computing system must be utilized. The Dell PowerEdge 1855 Blade Server (Figure 1) was selected as the foundation for the AIS. The modular nature of this system allows for scalability while minimizing power consumption and physical footprint. Multiple server blades are housed in a chassis which that contains power supplies, communication modules, and cooling fans shared by the entire system. Each server blade contains two dual-core 64-bit Xeon Processors with up to 16 GB of DDR2 shared memory. The two Xeon processors are interconnected by a dual front-side bus running at 667 MHz (Figure 2). Each of the cores is outfitted with its own L1 Cache, while the two cores on each chip share a L2 Cache (2 Mb). All four cores have shared access to main memory. The Xeon processor supports Hyper-Threading technology, and hence, two software threads can be established simultaneously in each core of the processor [10]. As such, each adjacent thread pair will share the resources of the L1 Cache. Figure 1. Dell Poweredge 1855 Blade Server chassis (left) and single blade server (right) Figure 2. Two Intel dual-core Xeon processors on a single blade server 3. Software Architecture The Dell PowerEdge 1855 Blade Server offers many unique architectural features that optimize system performance, making it the ideal platform for the AIS. Dual-Core and Hyper-Threading technologies are used to implement parallelism and share memory spaces within the server. The software architecture of the AIS was developed to exploit these features and optimize performance. Figure 3 displays a flowchart of the AIS. Each dual-core Xeon processor contains four virtual software threads. The threads are assigned and perform individual tasks depending on their role. In the figure below (Figure 3), one of these threads is designated as a Controller thread and the other three are designated as Worker threads. The Controller thread is responsible for initially connecting to and receiving requests from a client. This processing thread also manages the tuple space, which stores a pool of requests made by clients. The three Worker threads operate in parallel and handle client requests (i.e. searching the server’s database for the corresponding image to a specific image query). When unoccupied, a Worker thread will acquire a tuple request from the tuple space region [7] and search the system’s database for the requested FITS (Flexible Image Transport System) file. File searches are expedited utilizing the CRC-32 [5] hashing algorithm in order to calculate hard-disk addresses rather than conducting time-consuming linear searches. After retrieving the high-resolution file, the Worker thread performs wavelet transformation algorithms [14] for image decomposition and file transmission packetization. As client usage of the server increases, the AIS must be able to handle the extra load of multiple, simultaneous requests within real-time restraints. The server accomplishes this through the scalable nature of the thread-assigning software. The amount of Worker threads can be scaled up or down depending on the load conditions of the server. For testing purposes, we vary the number of Worker threads from 1 to 3 in order to analyze system performance. 5. Runtime Performance Analysis Server request-handling runtimes were investigated to determine the response time of the server to client queries. The tests measured the time from when a client initially requested an image to when the image was received. This process included server acceptance of the request, insertion of the tuple request into tuple space, utilization of the CRC-32 hashing algorithm, database access, and wavelet-based decomposition and transmission. Figure 3. Flowchart of Thread Assignments 4. Thread Affinity Mapping Within the AIS platform, assigned to each thread is an affinity numbered from zero to seven (Figure 4). The affinities are positioned on the dual-core Xeon processors as to maximize processing efficiency through distribution. Distribution is preferred due to the elimination of resource sharing. As depicted below, two adjacent threads within the same core share an L1 cache. Two threads on adjacent cores will share an L2 cache. Two threads on different Xeon processors will each retain their own L1 and L2 cache. However, they will share main memory and shared memory regions. 5 L2 Cache L2 Cache Xeon Processor 1 Xeon Processor 2 3 7 L1 Cache 6 1 L1 Cache 2 L1 Cache 4 L1 Cache 0 Figure 5. Route of dataflow for runtime recording 5.1. Runtimes with Hard-Disk Access Various thread affinity combinations were selected in order to view the system performance when sharing resources. Tuple request-handling runtime tests were run on different Worker arrangements utilizing one, two, and three Worker threads. The purpose was to determine the amount of time necessary for the server to process the client request(s). Figure 6 below shows data gathered from three experiments. Main Memory Shared Memory Figure 4. Map of thread affinities The AIS is unique in that specific tasks are assigned to threads to maintain efficiency within the server. However, when allocating roles, the thread affinity architecture must be referenced in order to maximize system performance. Figure 6. Hard-disk access runtimes comparing various Workers utilized Figure 6 represents three different experiments, with Workers accessing FITS images on the server’s database and transmitting them back to the client. Due to the extra processing power, the utilization of two Worker threads would be expected to perform better than the utilization of only one Worker thread. This would be because the server is theoretically able to process and retrieve twice as many requests as compared with a single Worker thread. Yet, the obtained data does not reflect the hypothesis. As the number of Worker threads utilized increased, there was only a minimal boost in server performance. It was concluded that this was due to bottlenecks within the system design. Although there may be more processing power in a system design, which utilizes three Workers as compared with one, a system bottleneck would lessen that advantage. The bottleneck in the AIS was thought to be located within the hard-drive access. The hard-disk is the slowest performing device on the AIS due to its prolonged access times, and is a frequent bottleneck of any server. The present platform of the AIS does not have a distributed hard-drive scheme. As such, Worker threads trying to retrieve an image all have to take turns accessing the hard disk. This one-by-one database access dramatically slows down system performance since one thread must wait until a thread has relinquished its control of the hard drive. Figure 7. Hard-disk emulation runtimes comparing various Workers utilized 5.3. Runtimes with Dual-Thread Combinations The thread affinity locations in dual-thread combinations were also examined in order to verify the impedance of r see differences in thread affinities utilized were also looked into. For example, it is believed that utilizing a combination of Worker threads with affinities 6 and 7 (Figure 9) would provide better system performance than if affinities 3 and 7. This is because in the second example, the affinities share both an L1 and L2 cache. However, having the affinities on different Xeon processors (such as in example one), the threads would not have to share any resources, thus expediting overall system performance. 5.2. Runtimes with Hard-Disk Emulation Tests were performed by hard disk (HD) emulation within the shared memory region in order to verify the bottleneck. Here, the contents of the hard disk would be placed into a shared memory region. This would enable the Worker threads to bypass the bottleneck located at the hard disk access point. The time differences would be monitored as in previous tests, between the client request of an image and the completion of image retrieval. However, the difference between the utilization of different combinations of one, two, and three Worker threads would be more prevalent due to the absence of the hard-drive access bottleneck. Figure 8. Hard-disk emulation comparing Worker-pairs on different affinities The graph above (Figure 8) displays differences in the client request timings when distinct thread affinity combinations were used. In a two-Worker setup, the combination of thread affinities 3 and 7 (adjacent threads) resulted in the longest runtime (Figure 9). These time-consuming results, which are due to the combination of adjacent threads, emphasize the local resource sharing of L1 and L2 caches in Xeon Processor 2. 5 L2 Cache L2 Cache Xeon Processor 1 Xeon Processor 2 3 7 L1 Cache 6 1 L1 Cache 2 L1 Cache 4 L1 Cache 0 Education, Instructional Technology, Assessment, and Elearning, December 2007. [2] S. Balle, D. Palermo, “Enhancing an Open Source Resource Manager with Multi-Core/Multi-threaded Support,” Hewlett-Packard Company, 2007. [3] I. Dasheysky, V. Balzano, “JWST: Maximizing Efficiency and Minimizing Ground Systems,” Proceedings of the 7th International Symposium on Reducing the Costs of Space Craft Ground Systems and Operations (RCSGSO), Jun 2007. Main Memory Shared Memory Figure 9. Worker combination which produced the longest request timings 6. Conclusion Although various technologies have been integrated into the server, system bottlenecks severely limit performance. Theoretically, having extra threads aiding in overall processing power would lower iterative execution times of client request management. However, due to the undistributed nature of the harddisk access, threads are only able to retrieve database information on a one-by-one basis. This slows down the performance of the computer since Worker thread must now wait for in line to access the database. Additionally, runtime tests on various thread combinations for Workers have indicated that thread affinity plays a large role in the determination of system performance. Two Workers running within the same core and the same processor both share an L1 cache. This limits the amount of resources available to each of the cores. As such, system performance becomes downgraded as compared with two Worker threads on different Xeon processors, each having their own L1 and L2 caches. Future tests will include distributed databases for multiple hard-disk access and hard-disk RAM emulation for database hotspots. Also, performance registers will be examined in order to analyze the performance of hashing and wavelet-transformation technologies. This work was supported by NASA under Grant URC NCC 4158. Special thanks go to the faculty and students associated with the SPACE Laboratory. 7. References [1] A. Alegre, J. Estrada, B. Coalson, A. Milshteyn, H. Boussalis, C. Liu, “Development and Implementation of an Information Server for Web-based Education in Astronomy,” Proceedings of the International Conference on Engineering [4] J. Dong, P. Thienphrapa, H. Boussalis, C. Liu, et al, “Implementation of a Robust Transmission System for Astronomical Images over Error-prone Links,” Proceedings of SPIE, Multimedia Systems and Applications IX, 2006. [5] Z. Genova and K. Christensen, “Efficient Summarization of URLs using CRC32 for Implementing URL Switching,” Proceedings of IEEE Conference on Local Computer Networks (LCN), 2002. [6] S. Harris, J. Ross. Beginning Algorithms. Indianapolis, IN: Wiley Publishing, Inc., 2006. [7] K. Hawick, H. James, L. Pritchard, “Tuple-Space Based Middleware for Distributed Computing,” Technical Report DHPC-128, 2002. [8] E. J. Kim, K. H. Yum, C. R. Das, “Introduction to Analytical Models,” Performance Evaulation and Benchmarking. Ed. Lizy Kurian Johm and Lieven Eeckhout Taylor & Francis Group, 2006. [9] C. Liu, J. Layland, “Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment,” Journal of ACM (JACM), Vol. 20-1, pp. 46-61, January 1973. [10] D. Marr, F. Binns, D. Hill, G. Hinton, D Koufaty, J. Miller, M. Upton, “Hyper-Threading Technology Architecture and Microarchitecture,” Intel Technology Journal, Vol. 6-1, pp. 4-15, February 2002. [11] W. Martins, J. Del Cuvillo, F. Useche, K. Theobald, G. Gao, “A Multithreaded Parallel Implementation of a Dynamic Programming Algorithm for Sequence Comparison,” Proceedings of International Pacific Symposium on Biocomputing, January 2002. [12] A. Santosa, “Fast Mutual Exclusion Algorithms: The MPI Implementation,” unpublished. [13] J. Shapiro, “Embedded Image coding Using Zerotrees of Wavelet Coefficients,” IEEE Transactions on Signal Processing, Vol. 41-12. pp. 3445-3462, December 1993. [14] Y. Zhao, S. Ahalt, and J. Dong, “Content-based Retransmission for Video Streaming System with Error Concealment,” Proceedings of SPIE, 2004.