Richard F. Freund NRaD” Howard Jay Siegel Purdue University

advertisement
Richard
F. Freund
NRaD”
Howard
Jay Siegel
Purdue
University
A
*Formerly
Center
the US
June 1993
System
recurring problem \\~th High-Perf~)rmance Computing (HPC) is that
advanced architectures pence-ally achieve only a small fraction of their
peak performance on many portions of real applications sets. The Amdahl‘s Law corollary of this is that such architectures often spend most of their time
on tasks (codes/algorithms and the data sets upon which they operate) for which
they are unsuited. The problem of developing systems that provide usable as
opposed
to peak performance
has been identified as a Grand Challenge for
computer architecture.’ Furthermore. programmers routinely engage in heroic
efforts to make a single advanced architecture achieve even this modest fraction
of peak performance on an application task that exhibits a variety of computational requirements.
The underlying mind-set that there exists one “true” architecture (and concomitant true compiler, operating system. and programming tools) that will handle all
tasks equally well is a key reason for these problems. Continuing efforts to coerce
tasks (or subtasks) to run on ill-suited machines cause poor performance by the
machine and great effort by the programmer. Part of the motivation for this mindset may be that many researchers still implicitly believe that all tasks need the same
kind of computation, with the only discriminant being how long they take to
execute. This belief is probably left over from the days when most uniprocessor
general-purpose machines provided similar types of computational power. with
the main discriminant being execution speed.
Today, and in the foreseeable future. the situation for HPC is quite different,
with supercomputers employing various parallel. pipelined. and special-purpose
architectures. The performance of such machines is a function of the inherent
structure of the computations to be executed and the data to be processed. This
0,) IX L~,hl,'~?,Oh,,,l~O,,,
3503.00
'a ,YY? ICEF
13
Profiling example on baseline serial system
30%
Vector
J
1%
20%
15%
MIMD
25%
Dataflow
SIMD
\
/
\
Execute on vector
supercomputer
10%
15%
\
15%
Execute on
heterogeneous
suite
4%
Two times faster
than baseline
Figure 1. Hypothetical
cessing.’
10%
Special
purpose
20 times faster
than baseline
example of the advantage of using heterogeneous
pro-
Z(I) = S*X(I) + Y(I)
4K DAP
8K Connection
Machine
Convex 210
1
’
10
, , , , I,,,
100
, , ( , ,,,,
, , , , , , ,,
1,000
, , , , , , ,,
10,000
100,000
, , , , ,,,1
1 ,ooo,ooo
Vector size
Figure 2. Variations in architectural performance
SAXPY, depending on vector size.
causes a need for new ways to discriminate among types of code, algorithms,
and data to optimize the matching of
tasks to machines. Researchers in the
field of heterogeneous processing believe that there is not. should not, and
will not be one single all-encompassing
architecture. This is primarily because
different tasks can have very different
computational characteristics that result indifferent processorrequirements.
Such variations in processing needs can
also occur among subtasks within a single application task. Thus, heteroge14
on a common algorithm,
neous processing researchers deem forcing all problem sets to the same fixed
architecture to be unnatural. A more
appropriate approach for HPC is a heterogeneous processing environment.
For the purposes of this special issue,
heterogeneous processing is defined as
the “tuned” use of diverse processing
hardware to meet distinct computational needs. in which codes or code portions are executed using processing
approaches that maximize overall performance. Implicit in this definition is
the idea that different tasks or subtasks
indicate different architectural requirements or different balances of requirements. The processors themselves are
generally parallel or other HPC machines. Typically. the processors are used
concurrently. and it is often the case
that tools and methodologies used for
parallel and vector processing (such as
code profiling. flow analysis. and data
dependencies) can be extended to the
larger, metaparallel processor. Heterogeneous processing may use a variety of
parallel. vector. and special architecturcs together as a suite of machines.
Figure I shows a hypothetical example of a task whose subtasks are best
suited for execution on different types
of architectures.’ Executing the whole
task on a pipelined supercomputer can
give twice the performance achieved by
some serial baseline machine. but the
use of five different machines. each
matched to the computational requirements of the subtasks for which it is
used. can result in an execution that is
20 times faster than the serial baseline.
Optimal performance requires consideration of the properties of the data
sets to be processed as well as the characteristics of the algorithm to be executed.’ For example, Figure 2 displays
the results of executing a 32-bit SAXPY
code on a pipelined Cray X-MP. pipelined Convex 210. SIMD CM-2 (with
SK processors and floating-point
coprocessor chips). and SIMD DAP (with
4K processors and coprocessors). It
should be noted that this experiment
was conducted in 1989 using the Fortran compiler in each case and reflects
the state of the art at that time (optimized library routines were not called).
This experiment used the same code on
different machines and showed that evaluating the algorithm alone is insufficient: attributes of the data set to be
processed (in this case, vector size) must
also be considered for optimal matching of task to processor.
To provide a framework for viewing
the field of heterogeneous computing. a
general four-stage evolution of system
technologies can be applied’:
l
l
l
l
developing point systems.
defining dimensions.
characterizing relationships, and
articulating models and laws.
In this context. point systems (or existing implementations
of application
tasks) are machines that support a sinCOMPUTER
gle mode of processing (such as SIMD)
to perform an application.
Next. by comparing the performance
of different systems on different applications, a set of dimensions (or features) of machines and computational
tasks that affect execution time can be
hypothesized. The next stage, characterizing relationships, is an attempt to
analyze and quantify these dimensions
in such a way as to facilitate a description of the correspondence
between
machine attributes and computational
requirements.” The final stage is articulating models and laws that govern the
performance of different types of machines that have different types of computational needs. The goal is to codify
these laws so that an application task
can be automatically decomposed into
components that are assigned to different machines to maximize performance
based upon some pre-established criteria.
In general, the field of heterogeneous
computing is approaching the final stage
of this evolutionary process.5 This special issue of Computer overviews some
research that is being done to move
from the third stage into the fourth. The
lead article, by Khokhar, Prasanna, Shaaban, and Wang, is an excellent introduction to the current status of heterogeneous processing.
Heterogeneous
computing
versus
load-balancing. A number of projects
and products have described themselves
as heterogeneous processing. Many of
these employ clusters of workstations
using the principle of opportunisticload~
balancing. In these cases the term heterogeneous is partially justified because
the workstations can be somewhat different and the connectivity
between
them can vary in either topology or
bandwidth. However, this approach,
while often useful for cycle-soaking,
consists of processing components essentially homogeneous with respect to
the computational facilities (the different workstations). This approach thus
lacks the full computational spectrum
of heterogeneous processing (as defined
above). This approach has often been
suggested as an alternative to traditional HPC, that is, use of a monolithic
vector machine or parallel processor.
The cycle-soaking philosophy of using opportunistic load- balancing among
Overview
Figure A illustraies an overall perspective of heterogeneous processing. The bottom portion shows how to select the machines for the heterogeneous suite on the basis of expected algorithm cLasses. The tasks and the data
upon which they operate are first profiled, as are fhheC&Xpabilities of various processor types. These profiled components are the ingredients for the optimal selection of
machines to include in the heterogeneous suite.
The top part of the figure shows the process of match-
ing machines to particular code segments and specific
data for prooeaeing. The code is profiled at compile time,
and the input param%ters are profiled at runtime. This information is used along with machine and network availability for optimat assignment of systems from the selected machine suite.
Requirements
Architectures
Heterogeneous
\
June 1993
suite
clusters of workstations is effective only
because the workstations are essentially homogeneous; that is, it makes no
significant difference in performance
which workstation is selected for any
particular process. However, the opportunistic load-balancing approach is
clearly inadequate on its own in a very
heterogeneous environment. such as one
that contains a SIMD CM-2 and a pipelined Convex. For example, if it was the
first available machine, one might assign an inherently sequential process to
the CM-2. Such an assignment would
cause it to limp along on a few (of its
64K) processors!
Although opportunistic load-balancing is simple and leads to busy machines,
it does not necessarily lead to faster
computation of any program or set of
programs. In a heterogeneous environment. the matching of machine features
to task (or subtask) computational requirements is the primary concern; opportunistic load-balancing is only of secondary importance.
This leads to our “Fundamental Heterogeneous Processing Heuristic”:
First evaluate suitability of tasks to
processor types, then load-balance
among selected machines for the final
assignment.
Goals of heterogeneous processing.
Mathematical programming offers one
way to look at heterogeneous processing. In a highly abstract sense. the main
goal is to minimize the objective function of computational time, subject to a
fixed constraint. such as cost:
minimize 1 I,,,
such that Cc, 5 C
where t,,,equals the time for machine i
on code segmentj, c, equals the cost for
machine i, and C equals the specified
overall cost constraint. If C is set at
some medium level, say. C < $1 million,
it adjusts the heterogeneous processing
environment for cost-effective computing. Typically, this would stem from
using a few minisupercomputers
and
small parallel machines at one site. in
place of one mid-sized supercomputer.
Alternatively.
if C is very large. say C >
$10 million. it tunes the environment
for the most effective supercomputing.
Typically. this would take the form of
several different HPC machines at different sites. linked by gigabit networks.
16
Another performance dimension to consider is whether the heterogeneous processing system is being tuned to achieve
maximal throughput for a set of processes, or maximum speed on a single
job.
can obscure results in loosely coupled
systems. Studies have clearly indicated
the value of mixed-mode algorithms,”
including cases in which the mode changes at an extremely low level, as within
Do-loops.
Connectivity, bandwidth, and granularity. Heterogeneous processing is an
abstract model with potentially many
different goals and instantiations. Within
this model, no specific connectivity.
bandwidth, or granularity is either required or excluded. Clearly. higher bandwidth or less data to transmit implies
less transfer-latency penalty and therefore greater ease in sharing subprocesses among different processors - thus
potentially a finer granularity of effort.
For example, a heterogeneous environment on a global wide area network
(WAN) with only medium bandwidth
may have to settle for coarse granularity of effort, therefore achieving only
throughput effectiveness rather than increased performance on most individual tasks. Increasing the bandwidth dramatically could also change the optimal
strategy to let the system decompose
tasks into subtasks for the purpose of
matching each one to the machine that
can execute it the fastest - leading to
the best overall performance for an entire individual task. This is also typically
the case for heterogeneous local area
networks (LANs). Because data is generally far larger than instructions. another level of heuristic is to first determine (when possible and by whatever
means) where to place data initially so
that it is least often moved.”
One specialized form of heterogeneous processing (distinct from WANand LAN-based forms) is mixed-mode
processing, in which more than one type
of computation is incorporated within a
single unit and it is possible to switch
among computational modes at instruction-level granularity
with generally
negligible overhead. Such capabilities
can be obtained from machines explicitly designed to provide mixed-mode
functionality. Examplesinclude PASM.
the Partitionable SIMDiMIMD
system.’
and workstations containing different
types of HPC board sets. Mixed-mode
machines are ideal for demonstrating
the viability of heterogeneous algorithms
because the different computational
capabilities are made available by using
units that are tightly coupled to avoid
the possible performance deficits that
High-level orchestration tools. Heterogeneity relies heavily on the development of network orchestration tools
like PVM (Parallel Virtual Machine).
PVM is described in the article by Beguelin. Dongarra, Geist, and Sunderam.
However. according to the “Fundamental Heuristic,” these tools need augmentation before they can provide effective heterogeneous
solutions.
A
missing ingredient is the rationale for
assigning different subtasks to different
processing components, that is, the theoretical and empirical understanding of
what kinds of computation fit which
architectures. The network orchestration tools and the rationale together
satisfy the two components
of the
Fundamental Heuristic. Such a tool.
DHSMS (distributed
hetereogeneous
supercomputing management system).
is described in the article by Ghafoor
and Yang. DHSMS-like tools are necessary in any case to satisfy the anticipated future explosion in HPC requirements. Today, most implementations of
applications requiring HPC rely on one
or more computer scientists in addition
to applications programmers. DHSMS
would aid applications experts in developing HPC programs. This step is clearly a necessity to process the expected 10
to 100 times more computationally complex HPC projects of the nineties.
Programming paradigms. Heterogeneous environments require heterogeneous programming
paradigms. One
level of this is the adaption of existing
languages for heterogeneous environments. Other paradigms expressly designed with heterogeneity in mind have
also been proposed. The article by Silberman and Ebcioglu details the development of a heterogeneous instruction
set supporting seamless migration of
tasks among various architectures. Jade.
discussed in the article by Rinard, Scales
and Lam. is a new. machine-independent language emphasizing data objects.
Nicol. Wilkes. and Manola discuss an
object-oriented approach to orchestrating heterogeneous. autonomous. and distributed resources.
Full-fledged heterogeneous environCOMPUTER
ments will soon require a menu of potential paradigms. The combination of
orchestration tools and new programming paradigms presents great technical challenges. However, they will substantially ease the burden of applications
development in future highly networked
and heterogeneous software/hardware
environments. Researchers in the field
of heterogeneous computing believe that
such networked environments are an
important and most effective way to
solve application problems requiring
HPC. The article by Rinaldo and Fausey discusses a heterogeneous environment and its potential for physics applications.
Is heterogeneity necessary? Heterogeneity aims to increase the efficiency
of computation and thereby the effectiveness and/or cost-effectiveness of both
machines and programmers. The kind
of efficiency that heterogeneity brings,
at the expense and effort of building
distributed operating systems, orchestration tools, profiling methodologies,
and so forth, is only needed in the face
of the scarce commodity of supercomputer power. Perhaps advances in architectures alone will someday supply
an excess of computational power at a
reasonable cost, like a $50,000 teraflop
workstation.Certainlysuchpowerwould
supply an excess of capability for the
problems being computed today. However, these problems are currently limited by the computing resources available. Based on the history of computing
and reasoned expectations, demand will
always grow faster than capability, and
computational demands will always far
exceed capacity, at least for Grand Challenges problems.9 Thus, heterogeneity
is-and will always be - necessary for
wide classes of HPC problems. n
Acknowledgments
We thank all who submitted manuscripts
to this special issue and the many referees
who helped select the articles. We also wish
to thank Jon Butler, the previous editor-inchief of Computer, under whose tenure the
concept for this issue arose, as well as Ted
Lewis, the current EIC, with whom we have
worked for the last several months. Jerry L.
Potter and Anthony A. Maciejewski provided us with useful comments about this introduction.
June 1993
We also acknowledge support from each
of our institutions, the Naval Command, Control and Ocean Surveillance Center, Research
Development Test and Evaluation Division,
and the School of Electrical Engineering at
Purdue University.
Upcoming
events
The Journal of Parallel and Distributed Computing (JPDC) has scheduled a
special issue on heterogeneous processing for January 1994.
References
1. H.J. Siegel and S. Abraham et al., “Report of the Purdue Workshop on Grand
Challenges in Computer Architecture for
the Support of High-Performance Computing,” J. Parallel and Distributed Computing, Vol. 16, No. 3, Nov. 1992, pp. 199211.
2. R.F. Freund, “SuperC or Distributed
Heterogeneous HPC,” Computing Systems Eng., Vol. 2, No. 4, 1991, pp. 349.
355.
Richard F. Freund has been a senior computer scientist at the NRaD facility (formerly
the US Navv’s Naval Ocean Svstem Center)
in San Diego since 1987. His research interests in heterogeneous computing include
crossovers, code profiling, network profiling, optimal matching, and analytic benchmarking. Freund is the technical founder of
the Heterogeneous Computing Project at
NRaD and a leader in the application of
heterogeneous processing to military command and control.
Freund is a subiect area editor for the
Journal of Parallel and Distributed Computing (JPDC). He is also the foundational or-
ganizer of the annual Heterogeneous Processing Workshop held in conjunction with
the International Parallel Processing Symposium.
3. G.M. Olson, L.J. McGuffin, E. Kuwana,
and J.S. Olson, “Designing Software for
a Group’s Needs: A Functional Analysis
of Synchronous Groupware,” in User Interface Software, L. Bass and P. Dewan,
eds.,John Wiley& Sons,New York, 1993.
4. L.H. Jamieson, “Characterizing Parallel
Algorithms,” in The Characteristics of
ParallelA1gorithms.L.H.
Jamieson. D.B.
Gannon, aid R.J. Douglass, eds.,‘MIT
Press, Cambridge, Mass., 1987, pp. 65100.
5. Proc. WHP 92: Workshop on Heterogeneous Processing, R.F. Freund and M.
Eshaghian, eds., IEEE Computer Society Press, Los Alamitos, Calif., 1992.
6. R.F. Freund and J.L. Potter, “Hasp Heterogeneous Associative Processing,”
Computing Systems Eng., Vol. 3, Nos. l4, 1992, pp. 25-31.
7. J.B. Armstrong, D.W. Watson, and H.J.
Siegel, “Software Issues for the PASM
Parallel Processing System,” in Software
for Parallel Computation, J.S. Kowalik,
ed., Springer-Verlag, Berlin, 1993.
8. H.J. Siegel, J.B. Armstrong, and D.W.
Watson, “Mapping Computer-VisionRelated Tasks onto Reconfigurable Parallel ProcessingSystems,”Computer, Vol.
25, No. 2, Feb. 1992, pp. 54-63.
Howard Jay Siegel is a professor and the
coordinator of the Parallel Processing Laboratory in the School of Electrical-Engineerine at Purdue Universitv in West Lafavette, Inhiana. His current research focuses
on interconnection networks, heterogeneous computing, and the use and design of
the PASM dynamically reconfigurable partitionable mixed-mode parallel computer
system.
Siegel received two BS degrees from the
MassachusettsInstitute of Technology (both
in 1972), MA and MSE degrees from Princeton University (both in 1974), and the
PhD degree from Princeton (in 1977). He
has coauthored over 150 technical papers,
edited/coedited five volumes, and authored
one book. Siegel was the coeditor-in-chief
of the Journal of Parallel and Distributed
Computing from 1989 to 1991. He is an
IEEE fellow and is on the editorial board of
IEEE Transactions
uted Systems.
on Parallel and Distrib-
9. Grand Challenges: High-Performance
Computing and Communications,
Com-
mittee on Physical, Mathematical, and
Engineering Sciences, National Science
Foundation, Washington, D.C., 1991.
Readers can contact Richard F. Freund
at (619) 553-4071 (voice), (619) 553-5136
(fax), or freund@superc.nosc.mil (e-mail).
17
Download