Publication File - California State University, Los Angeles

advertisement
Implementation and Quantitative Analysis of a Shared-Memory Based
Parallel Server Architecture for
Aerospace Information Exchange Applications
A. Alegre
alexalegre@gmail.com
S. Beltran
sbeltran00@gmail.com
J. Estrada
j_lestrada@yahoo.com
A. Milshteyn
tashkent04@hotmail.com
C. Liu
cliu@calstatela.edu
H. Boussalis
hboussa@calstatela.edu
Department of Electrical and Computer Engineering
California State University, Los Angeles
5151 State University Drive
Los Angeles, CA 90053
Abstract
This paper focuses on the implementation and
quantitative analysis of a high-performance parallel
processing aerospace information server.
An
innovative model of software architecture is provided
to effectively utilize the computational power of a
parallel server platform for efficient, on-demand
aerospace information exchange through the Internet.
This is a representative application for servers whose
features are common to the classical client-server
model. The server architecture supports thread, core,
and/or processor-level parallel processing for high
performance computing. Memory devices (i.e. cache
memory, main memory, and secondary memory) are
either shared or distributed among the computational
units. Such features facilitate our study of identifying
and overcoming the architectural bottlenecks of
current commercial server configurations.
1. Introduction
In 1994, the National Aeronautics and Space
Administration (NASA) provided funding towards the
establishment of the Structures, Pointing, and Control
Engineering (SPACE) Laboratory at California State
University, Los Angeles. Objectives for the laboratory
include the design and fabrication of platforms which
resemble the complex dynamic behavior of a
segmented space telescope, the James Webb Space
Telescope [3], and its components.
In recent projects, the laboratory has made efforts to
use the most current computer technologies to develop
a prototype information server for the dissemination of
multimedia FITS (Flexible Image Transport System)
files. The target audience includes communities
ranging from professional and amateur space scientists
to students, educators, and the general public.
Such efforts meet NASA’s mission to encourage
space exploration and research through education. To
promote the awareness of NASA’s missions, current
digital technologies can be used to facilitate the
establishment of networks not only for scientists and
engineers of today, but for generations to come.
The Aerospace Information Server (AIS) [1] is a
high performance parallel processing server which
supports
efficient,
on-demand
information
dissemination. The design of the internet-based server
is focused on, but not restricted to, astronomical image
browsing. As such, it is necessary for the distributed
tuple space server to be capable of processing several
simultaneous image requests. The AIS must also be
able to distribute those requests at various transfer rates
dictated by the clients’ needs. In order to achieve these
requirements, multiple server technologies have been
incorporated into the design of the AIS as listed below.
1. Tuple space programming paradigm for
parallel processing and automatic load
balancing [7].
2. Search algorithms utilizing a hash table for
expedited access to database files [5].
3. Wavelet transformation algorithms for
progressive image (de)compression and
image transmission [13], [14].
This paper focuses on the system performance
analysis of the AIS in order to ensure its high speed
execution. Performance bottlenecks severely detract
from server efficiency by limiting the flow of data
which passes through a given region of the system.
This, in turn, impedes overall system execution. As
such, it is imperative to locate and bypass known
bottlenecks to alleviate data flow restrictions.
The paper is organized as follows: Section 2
introduces the hardware of the server. Section 3
describes the software architecture of the server
system. Section 4 describes the mapping of thread
affinities. Section 5 details the runtime performance
analysis of the parallel processing server. Section 6
concludes the paper.
2. Server Platform System Description
In order to implement the key technologies stated
above and maintain real-time performance, a state-ofthe-art distributed computing system must be utilized.
The Dell PowerEdge 1855 Blade Server (Figure 1) was
selected as the foundation for the AIS. The modular
nature of this system allows for scalability while
minimizing power consumption and physical footprint.
Multiple server blades are housed in a chassis which
that contains power supplies, communication modules,
and cooling fans shared by the entire system. Each
server blade contains two dual-core 64-bit Xeon
Processors with up to 16 GB of DDR2 shared memory.
The two Xeon processors are interconnected by a dual
front-side bus running at 667 MHz (Figure 2). Each of
the cores is outfitted with its own L1 Cache, while the
two cores on each chip share a L2 Cache (2 Mb). All
four cores have shared access to main memory. The
Xeon processor supports Hyper-Threading technology,
and hence, two software threads can be established
simultaneously in each core of the processor [10]. As
such, each adjacent thread pair will share the resources
of the L1 Cache.
Figure 1. Dell Poweredge 1855 Blade Server
chassis (left) and single blade server (right)
Figure 2. Two Intel dual-core Xeon processors
on a single blade server
3. Software Architecture
The Dell PowerEdge 1855 Blade Server offers
many unique architectural features that optimize
system performance, making it the ideal platform for
the AIS. Dual-Core and Hyper-Threading technologies
are used to implement parallelism and share memory
spaces within the server. The software architecture of
the AIS was developed to exploit these features and
optimize performance. Figure 3 displays a flowchart
of the AIS. Each dual-core Xeon processor contains
four virtual software threads. The threads are assigned
and perform individual tasks depending on their role.
In the figure below (Figure 3), one of these threads is
designated as a Controller thread and the other three
are designated as Worker threads.
The Controller thread is responsible for initially
connecting to and receiving requests from a client.
This processing thread also manages the tuple space,
which stores a pool of requests made by clients.
The three Worker threads operate in parallel and
handle client requests (i.e. searching the server’s
database for the corresponding image to a specific
image query). When unoccupied, a Worker thread will
acquire a tuple request from the tuple space region [7]
and search the system’s database for the requested
FITS (Flexible Image Transport System) file. File
searches are expedited utilizing the CRC-32 [5]
hashing algorithm in order to calculate hard-disk
addresses rather than conducting time-consuming
linear searches. After retrieving the high-resolution
file, the Worker thread performs wavelet
transformation
algorithms
[14]
for
image
decomposition and file transmission packetization.
As client usage of the server increases, the AIS
must be able to handle the extra load of multiple,
simultaneous requests within real-time restraints. The
server accomplishes this through the scalable nature of
the thread-assigning software. The amount of Worker
threads can be scaled up or down depending on the
load conditions of the server. For testing purposes, we
vary the number of Worker threads from 1 to 3 in order
to analyze system performance.
5. Runtime Performance Analysis
Server request-handling runtimes were investigated
to determine the response time of the server to client
queries. The tests measured the time from when a
client initially requested an image to when the image
was received. This process included server acceptance
of the request, insertion of the tuple request into tuple
space, utilization of the CRC-32 hashing algorithm,
database access, and wavelet-based decomposition and
transmission.
Figure 3. Flowchart of Thread Assignments
4. Thread Affinity Mapping
Within the AIS platform, assigned to each thread is
an affinity numbered from zero to seven (Figure 4).
The affinities are positioned on the dual-core Xeon
processors as to maximize processing efficiency
through distribution. Distribution is preferred due to
the elimination of resource sharing. As depicted
below, two adjacent threads within the same core share
an L1 cache. Two threads on adjacent cores will share
an L2 cache.
Two threads on different Xeon
processors will each retain their own L1 and L2 cache.
However, they will share main memory and shared
memory regions.
5
L2 Cache
L2 Cache
Xeon
Processor 1
Xeon
Processor 2
3
7
L1 Cache
6
1
L1 Cache
2
L1 Cache
4
L1 Cache
0
Figure 5. Route of dataflow for runtime
recording
5.1. Runtimes with Hard-Disk Access
Various thread affinity combinations were selected
in order to view the system performance when sharing
resources. Tuple request-handling runtime tests were
run on different Worker arrangements utilizing one,
two, and three Worker threads. The purpose was to
determine the amount of time necessary for the server
to process the client request(s). Figure 6 below shows
data gathered from three experiments.
Main Memory
Shared Memory
Figure 4. Map of thread affinities
The AIS is unique in that specific tasks are assigned
to threads to maintain efficiency within the server.
However, when allocating roles, the thread affinity
architecture must be referenced in order to maximize
system performance.
Figure 6. Hard-disk access runtimes
comparing various Workers utilized
Figure 6 represents three different experiments,
with Workers accessing FITS images on the server’s
database and transmitting them back to the client. Due
to the extra processing power, the utilization of two
Worker threads would be expected to perform better
than the utilization of only one Worker thread. This
would be because the server is theoretically able to
process and retrieve twice as many requests as
compared with a single Worker thread. Yet, the
obtained data does not reflect the hypothesis. As the
number of Worker threads utilized increased, there was
only a minimal boost in server performance.
It was concluded that this was due to bottlenecks
within the system design. Although there may be more
processing power in a system design, which utilizes
three Workers as compared with one, a system
bottleneck would lessen that advantage.
The
bottleneck in the AIS was thought to be located within
the hard-drive access. The hard-disk is the slowest
performing device on the AIS due to its prolonged
access times, and is a frequent bottleneck of any server.
The present platform of the AIS does not have a
distributed hard-drive scheme. As such, Worker
threads trying to retrieve an image all have to take
turns accessing the hard disk. This one-by-one
database access dramatically slows down system
performance since one thread must wait until a thread
has relinquished its control of the hard drive.
Figure 7. Hard-disk emulation runtimes
comparing various Workers utilized
5.3. Runtimes with Dual-Thread Combinations
The thread affinity locations in dual-thread
combinations were also examined in order to verify the
impedance of r see differences in thread affinities
utilized were also looked into. For example, it is
believed that utilizing a combination of Worker threads
with affinities 6 and 7 (Figure 9) would provide better
system performance than if affinities 3 and 7. This is
because in the second example, the affinities share both
an L1 and L2 cache. However, having the affinities on
different Xeon processors (such as in example one), the
threads would not have to share any resources, thus
expediting overall system performance.
5.2. Runtimes with Hard-Disk Emulation
Tests were performed by hard disk (HD) emulation
within the shared memory region in order to verify the
bottleneck. Here, the contents of the hard disk would
be placed into a shared memory region. This would
enable the Worker threads to bypass the bottleneck
located at the hard disk access point. The time
differences would be monitored as in previous tests,
between the client request of an image and the
completion of image retrieval.
However, the
difference between the utilization of different
combinations of one, two, and three Worker threads
would be more prevalent due to the absence of the
hard-drive access bottleneck.
Figure 8. Hard-disk emulation comparing
Worker-pairs on different affinities
The graph above (Figure 8) displays differences in
the client request timings when distinct thread affinity
combinations were used. In a two-Worker setup, the
combination of thread affinities 3 and 7 (adjacent
threads) resulted in the longest runtime (Figure 9).
These time-consuming results, which are due to the
combination of adjacent threads, emphasize the local
resource sharing of L1 and L2 caches in Xeon
Processor 2.
5
L2 Cache
L2 Cache
Xeon
Processor 1
Xeon
Processor 2
3
7
L1 Cache
6
1
L1 Cache
2
L1 Cache
4
L1 Cache
0
Education, Instructional Technology, Assessment, and Elearning, December 2007.
[2] S. Balle, D. Palermo, “Enhancing an Open Source
Resource Manager with Multi-Core/Multi-threaded Support,”
Hewlett-Packard Company, 2007.
[3] I. Dasheysky, V. Balzano, “JWST:
Maximizing
Efficiency and Minimizing Ground Systems,” Proceedings
of the 7th International Symposium on Reducing the Costs of
Space Craft Ground Systems and Operations (RCSGSO), Jun
2007.
Main Memory
Shared Memory
Figure 9. Worker combination which produced
the longest request timings
6. Conclusion
Although various technologies have been integrated
into the server, system bottlenecks severely limit
performance.
Theoretically, having extra threads
aiding in overall processing power would lower
iterative execution times of client request management.
However, due to the undistributed nature of the harddisk access, threads are only able to retrieve database
information on a one-by-one basis. This slows down
the performance of the computer since Worker thread
must now wait for in line to access the database.
Additionally, runtime tests on various thread
combinations for Workers have indicated that thread
affinity plays a large role in the determination of
system performance. Two Workers running within the
same core and the same processor both share an L1
cache. This limits the amount of resources available to
each of the cores. As such, system performance
becomes downgraded as compared with two Worker
threads on different Xeon processors, each having their
own L1 and L2 caches.
Future tests will include distributed databases for
multiple hard-disk access and hard-disk RAM
emulation for database hotspots. Also, performance
registers will be examined in order to analyze the
performance of hashing and wavelet-transformation
technologies.
This work was supported by NASA under Grant
URC NCC 4158. Special thanks go to the faculty and
students associated with the SPACE Laboratory.
7. References
[1] A. Alegre, J. Estrada, B. Coalson, A. Milshteyn, H.
Boussalis, C. Liu, “Development and Implementation of an
Information Server for Web-based Education in Astronomy,”
Proceedings of the International Conference on Engineering
[4] J. Dong, P. Thienphrapa, H. Boussalis, C. Liu, et al,
“Implementation of a Robust Transmission System for
Astronomical Images over Error-prone Links,” Proceedings
of SPIE, Multimedia Systems and Applications IX, 2006.
[5] Z. Genova and K. Christensen, “Efficient Summarization
of URLs using CRC32 for Implementing URL Switching,”
Proceedings of IEEE Conference on Local Computer
Networks (LCN), 2002.
[6] S. Harris, J. Ross. Beginning Algorithms. Indianapolis,
IN: Wiley Publishing, Inc., 2006.
[7] K. Hawick, H. James, L. Pritchard, “Tuple-Space Based
Middleware for Distributed Computing,” Technical Report
DHPC-128, 2002.
[8] E. J. Kim, K. H. Yum, C. R. Das, “Introduction to
Analytical Models,” Performance Evaulation and
Benchmarking. Ed. Lizy Kurian Johm and Lieven Eeckhout
Taylor & Francis Group, 2006.
[9] C. Liu, J. Layland, “Scheduling Algorithms for
Multiprogramming in a Hard-Real-Time Environment,”
Journal of ACM (JACM), Vol. 20-1, pp. 46-61, January
1973.
[10] D. Marr, F. Binns, D. Hill, G. Hinton, D Koufaty, J.
Miller, M. Upton, “Hyper-Threading Technology
Architecture and Microarchitecture,” Intel Technology
Journal, Vol. 6-1, pp. 4-15, February 2002.
[11] W. Martins, J. Del Cuvillo, F. Useche, K. Theobald, G.
Gao, “A Multithreaded Parallel Implementation of a
Dynamic
Programming
Algorithm
for
Sequence
Comparison,” Proceedings of International Pacific
Symposium on Biocomputing, January 2002.
[12] A. Santosa, “Fast Mutual Exclusion Algorithms: The
MPI Implementation,” unpublished.
[13] J. Shapiro, “Embedded Image coding Using Zerotrees of
Wavelet Coefficients,”
IEEE Transactions on Signal
Processing, Vol. 41-12. pp. 3445-3462, December 1993.
[14] Y. Zhao, S. Ahalt, and J. Dong, “Content-based
Retransmission for Video Streaming System with Error
Concealment,” Proceedings of SPIE, 2004.
Download