Richard F. Freund NRaD” Howard Jay Siegel Purdue University A *Formerly Center the US June 1993 System recurring problem \\~th High-Perf~)rmance Computing (HPC) is that advanced architectures pence-ally achieve only a small fraction of their peak performance on many portions of real applications sets. The Amdahl‘s Law corollary of this is that such architectures often spend most of their time on tasks (codes/algorithms and the data sets upon which they operate) for which they are unsuited. The problem of developing systems that provide usable as opposed to peak performance has been identified as a Grand Challenge for computer architecture.’ Furthermore. programmers routinely engage in heroic efforts to make a single advanced architecture achieve even this modest fraction of peak performance on an application task that exhibits a variety of computational requirements. The underlying mind-set that there exists one “true” architecture (and concomitant true compiler, operating system. and programming tools) that will handle all tasks equally well is a key reason for these problems. Continuing efforts to coerce tasks (or subtasks) to run on ill-suited machines cause poor performance by the machine and great effort by the programmer. Part of the motivation for this mindset may be that many researchers still implicitly believe that all tasks need the same kind of computation, with the only discriminant being how long they take to execute. This belief is probably left over from the days when most uniprocessor general-purpose machines provided similar types of computational power. with the main discriminant being execution speed. Today, and in the foreseeable future. the situation for HPC is quite different, with supercomputers employing various parallel. pipelined. and special-purpose architectures. The performance of such machines is a function of the inherent structure of the computations to be executed and the data to be processed. This 0,) IX L~,hl,'~?,Oh,,,l~O,,, 3503.00 'a ,YY? ICEF 13 Profiling example on baseline serial system 30% Vector J 1% 20% 15% MIMD 25% Dataflow SIMD \ / \ Execute on vector supercomputer 10% 15% \ 15% Execute on heterogeneous suite 4% Two times faster than baseline Figure 1. Hypothetical cessing.’ 10% Special purpose 20 times faster than baseline example of the advantage of using heterogeneous pro- Z(I) = S*X(I) + Y(I) 4K DAP 8K Connection Machine Convex 210 1 ’ 10 , , , , I,,, 100 , , ( , ,,,, , , , , , , ,, 1,000 , , , , , , ,, 10,000 100,000 , , , , ,,,1 1 ,ooo,ooo Vector size Figure 2. Variations in architectural performance SAXPY, depending on vector size. causes a need for new ways to discriminate among types of code, algorithms, and data to optimize the matching of tasks to machines. Researchers in the field of heterogeneous processing believe that there is not. should not, and will not be one single all-encompassing architecture. This is primarily because different tasks can have very different computational characteristics that result indifferent processorrequirements. Such variations in processing needs can also occur among subtasks within a single application task. Thus, heteroge14 on a common algorithm, neous processing researchers deem forcing all problem sets to the same fixed architecture to be unnatural. A more appropriate approach for HPC is a heterogeneous processing environment. For the purposes of this special issue, heterogeneous processing is defined as the “tuned” use of diverse processing hardware to meet distinct computational needs. in which codes or code portions are executed using processing approaches that maximize overall performance. Implicit in this definition is the idea that different tasks or subtasks indicate different architectural requirements or different balances of requirements. The processors themselves are generally parallel or other HPC machines. Typically. the processors are used concurrently. and it is often the case that tools and methodologies used for parallel and vector processing (such as code profiling. flow analysis. and data dependencies) can be extended to the larger, metaparallel processor. Heterogeneous processing may use a variety of parallel. vector. and special architecturcs together as a suite of machines. Figure I shows a hypothetical example of a task whose subtasks are best suited for execution on different types of architectures.’ Executing the whole task on a pipelined supercomputer can give twice the performance achieved by some serial baseline machine. but the use of five different machines. each matched to the computational requirements of the subtasks for which it is used. can result in an execution that is 20 times faster than the serial baseline. Optimal performance requires consideration of the properties of the data sets to be processed as well as the characteristics of the algorithm to be executed.’ For example, Figure 2 displays the results of executing a 32-bit SAXPY code on a pipelined Cray X-MP. pipelined Convex 210. SIMD CM-2 (with SK processors and floating-point coprocessor chips). and SIMD DAP (with 4K processors and coprocessors). It should be noted that this experiment was conducted in 1989 using the Fortran compiler in each case and reflects the state of the art at that time (optimized library routines were not called). This experiment used the same code on different machines and showed that evaluating the algorithm alone is insufficient: attributes of the data set to be processed (in this case, vector size) must also be considered for optimal matching of task to processor. To provide a framework for viewing the field of heterogeneous computing. a general four-stage evolution of system technologies can be applied’: l l l l developing point systems. defining dimensions. characterizing relationships, and articulating models and laws. In this context. point systems (or existing implementations of application tasks) are machines that support a sinCOMPUTER gle mode of processing (such as SIMD) to perform an application. Next. by comparing the performance of different systems on different applications, a set of dimensions (or features) of machines and computational tasks that affect execution time can be hypothesized. The next stage, characterizing relationships, is an attempt to analyze and quantify these dimensions in such a way as to facilitate a description of the correspondence between machine attributes and computational requirements.” The final stage is articulating models and laws that govern the performance of different types of machines that have different types of computational needs. The goal is to codify these laws so that an application task can be automatically decomposed into components that are assigned to different machines to maximize performance based upon some pre-established criteria. In general, the field of heterogeneous computing is approaching the final stage of this evolutionary process.5 This special issue of Computer overviews some research that is being done to move from the third stage into the fourth. The lead article, by Khokhar, Prasanna, Shaaban, and Wang, is an excellent introduction to the current status of heterogeneous processing. Heterogeneous computing versus load-balancing. A number of projects and products have described themselves as heterogeneous processing. Many of these employ clusters of workstations using the principle of opportunisticload~ balancing. In these cases the term heterogeneous is partially justified because the workstations can be somewhat different and the connectivity between them can vary in either topology or bandwidth. However, this approach, while often useful for cycle-soaking, consists of processing components essentially homogeneous with respect to the computational facilities (the different workstations). This approach thus lacks the full computational spectrum of heterogeneous processing (as defined above). This approach has often been suggested as an alternative to traditional HPC, that is, use of a monolithic vector machine or parallel processor. The cycle-soaking philosophy of using opportunistic load- balancing among Overview Figure A illustraies an overall perspective of heterogeneous processing. The bottom portion shows how to select the machines for the heterogeneous suite on the basis of expected algorithm cLasses. The tasks and the data upon which they operate are first profiled, as are fhheC&Xpabilities of various processor types. These profiled components are the ingredients for the optimal selection of machines to include in the heterogeneous suite. The top part of the figure shows the process of match- ing machines to particular code segments and specific data for prooeaeing. The code is profiled at compile time, and the input param%ters are profiled at runtime. This information is used along with machine and network availability for optimat assignment of systems from the selected machine suite. Requirements Architectures Heterogeneous \ June 1993 suite clusters of workstations is effective only because the workstations are essentially homogeneous; that is, it makes no significant difference in performance which workstation is selected for any particular process. However, the opportunistic load-balancing approach is clearly inadequate on its own in a very heterogeneous environment. such as one that contains a SIMD CM-2 and a pipelined Convex. For example, if it was the first available machine, one might assign an inherently sequential process to the CM-2. Such an assignment would cause it to limp along on a few (of its 64K) processors! Although opportunistic load-balancing is simple and leads to busy machines, it does not necessarily lead to faster computation of any program or set of programs. In a heterogeneous environment. the matching of machine features to task (or subtask) computational requirements is the primary concern; opportunistic load-balancing is only of secondary importance. This leads to our “Fundamental Heterogeneous Processing Heuristic”: First evaluate suitability of tasks to processor types, then load-balance among selected machines for the final assignment. Goals of heterogeneous processing. Mathematical programming offers one way to look at heterogeneous processing. In a highly abstract sense. the main goal is to minimize the objective function of computational time, subject to a fixed constraint. such as cost: minimize 1 I,,, such that Cc, 5 C where t,,,equals the time for machine i on code segmentj, c, equals the cost for machine i, and C equals the specified overall cost constraint. If C is set at some medium level, say. C < $1 million, it adjusts the heterogeneous processing environment for cost-effective computing. Typically, this would stem from using a few minisupercomputers and small parallel machines at one site. in place of one mid-sized supercomputer. Alternatively. if C is very large. say C > $10 million. it tunes the environment for the most effective supercomputing. Typically. this would take the form of several different HPC machines at different sites. linked by gigabit networks. 16 Another performance dimension to consider is whether the heterogeneous processing system is being tuned to achieve maximal throughput for a set of processes, or maximum speed on a single job. can obscure results in loosely coupled systems. Studies have clearly indicated the value of mixed-mode algorithms,” including cases in which the mode changes at an extremely low level, as within Do-loops. Connectivity, bandwidth, and granularity. Heterogeneous processing is an abstract model with potentially many different goals and instantiations. Within this model, no specific connectivity. bandwidth, or granularity is either required or excluded. Clearly. higher bandwidth or less data to transmit implies less transfer-latency penalty and therefore greater ease in sharing subprocesses among different processors - thus potentially a finer granularity of effort. For example, a heterogeneous environment on a global wide area network (WAN) with only medium bandwidth may have to settle for coarse granularity of effort, therefore achieving only throughput effectiveness rather than increased performance on most individual tasks. Increasing the bandwidth dramatically could also change the optimal strategy to let the system decompose tasks into subtasks for the purpose of matching each one to the machine that can execute it the fastest - leading to the best overall performance for an entire individual task. This is also typically the case for heterogeneous local area networks (LANs). Because data is generally far larger than instructions. another level of heuristic is to first determine (when possible and by whatever means) where to place data initially so that it is least often moved.” One specialized form of heterogeneous processing (distinct from WANand LAN-based forms) is mixed-mode processing, in which more than one type of computation is incorporated within a single unit and it is possible to switch among computational modes at instruction-level granularity with generally negligible overhead. Such capabilities can be obtained from machines explicitly designed to provide mixed-mode functionality. Examplesinclude PASM. the Partitionable SIMDiMIMD system.’ and workstations containing different types of HPC board sets. Mixed-mode machines are ideal for demonstrating the viability of heterogeneous algorithms because the different computational capabilities are made available by using units that are tightly coupled to avoid the possible performance deficits that High-level orchestration tools. Heterogeneity relies heavily on the development of network orchestration tools like PVM (Parallel Virtual Machine). PVM is described in the article by Beguelin. Dongarra, Geist, and Sunderam. However. according to the “Fundamental Heuristic,” these tools need augmentation before they can provide effective heterogeneous solutions. A missing ingredient is the rationale for assigning different subtasks to different processing components, that is, the theoretical and empirical understanding of what kinds of computation fit which architectures. The network orchestration tools and the rationale together satisfy the two components of the Fundamental Heuristic. Such a tool. DHSMS (distributed hetereogeneous supercomputing management system). is described in the article by Ghafoor and Yang. DHSMS-like tools are necessary in any case to satisfy the anticipated future explosion in HPC requirements. Today, most implementations of applications requiring HPC rely on one or more computer scientists in addition to applications programmers. DHSMS would aid applications experts in developing HPC programs. This step is clearly a necessity to process the expected 10 to 100 times more computationally complex HPC projects of the nineties. Programming paradigms. Heterogeneous environments require heterogeneous programming paradigms. One level of this is the adaption of existing languages for heterogeneous environments. Other paradigms expressly designed with heterogeneity in mind have also been proposed. The article by Silberman and Ebcioglu details the development of a heterogeneous instruction set supporting seamless migration of tasks among various architectures. Jade. discussed in the article by Rinard, Scales and Lam. is a new. machine-independent language emphasizing data objects. Nicol. Wilkes. and Manola discuss an object-oriented approach to orchestrating heterogeneous. autonomous. and distributed resources. Full-fledged heterogeneous environCOMPUTER ments will soon require a menu of potential paradigms. The combination of orchestration tools and new programming paradigms presents great technical challenges. However, they will substantially ease the burden of applications development in future highly networked and heterogeneous software/hardware environments. Researchers in the field of heterogeneous computing believe that such networked environments are an important and most effective way to solve application problems requiring HPC. The article by Rinaldo and Fausey discusses a heterogeneous environment and its potential for physics applications. Is heterogeneity necessary? Heterogeneity aims to increase the efficiency of computation and thereby the effectiveness and/or cost-effectiveness of both machines and programmers. The kind of efficiency that heterogeneity brings, at the expense and effort of building distributed operating systems, orchestration tools, profiling methodologies, and so forth, is only needed in the face of the scarce commodity of supercomputer power. Perhaps advances in architectures alone will someday supply an excess of computational power at a reasonable cost, like a $50,000 teraflop workstation.Certainlysuchpowerwould supply an excess of capability for the problems being computed today. However, these problems are currently limited by the computing resources available. Based on the history of computing and reasoned expectations, demand will always grow faster than capability, and computational demands will always far exceed capacity, at least for Grand Challenges problems.9 Thus, heterogeneity is-and will always be - necessary for wide classes of HPC problems. n Acknowledgments We thank all who submitted manuscripts to this special issue and the many referees who helped select the articles. We also wish to thank Jon Butler, the previous editor-inchief of Computer, under whose tenure the concept for this issue arose, as well as Ted Lewis, the current EIC, with whom we have worked for the last several months. Jerry L. Potter and Anthony A. Maciejewski provided us with useful comments about this introduction. June 1993 We also acknowledge support from each of our institutions, the Naval Command, Control and Ocean Surveillance Center, Research Development Test and Evaluation Division, and the School of Electrical Engineering at Purdue University. Upcoming events The Journal of Parallel and Distributed Computing (JPDC) has scheduled a special issue on heterogeneous processing for January 1994. References 1. H.J. Siegel and S. Abraham et al., “Report of the Purdue Workshop on Grand Challenges in Computer Architecture for the Support of High-Performance Computing,” J. Parallel and Distributed Computing, Vol. 16, No. 3, Nov. 1992, pp. 199211. 2. R.F. Freund, “SuperC or Distributed Heterogeneous HPC,” Computing Systems Eng., Vol. 2, No. 4, 1991, pp. 349. 355. Richard F. Freund has been a senior computer scientist at the NRaD facility (formerly the US Navv’s Naval Ocean Svstem Center) in San Diego since 1987. His research interests in heterogeneous computing include crossovers, code profiling, network profiling, optimal matching, and analytic benchmarking. Freund is the technical founder of the Heterogeneous Computing Project at NRaD and a leader in the application of heterogeneous processing to military command and control. Freund is a subiect area editor for the Journal of Parallel and Distributed Computing (JPDC). He is also the foundational or- ganizer of the annual Heterogeneous Processing Workshop held in conjunction with the International Parallel Processing Symposium. 3. G.M. Olson, L.J. McGuffin, E. Kuwana, and J.S. Olson, “Designing Software for a Group’s Needs: A Functional Analysis of Synchronous Groupware,” in User Interface Software, L. Bass and P. Dewan, eds.,John Wiley& Sons,New York, 1993. 4. L.H. Jamieson, “Characterizing Parallel Algorithms,” in The Characteristics of ParallelA1gorithms.L.H. Jamieson. D.B. Gannon, aid R.J. Douglass, eds.,‘MIT Press, Cambridge, Mass., 1987, pp. 65100. 5. Proc. WHP 92: Workshop on Heterogeneous Processing, R.F. Freund and M. Eshaghian, eds., IEEE Computer Society Press, Los Alamitos, Calif., 1992. 6. R.F. Freund and J.L. Potter, “Hasp Heterogeneous Associative Processing,” Computing Systems Eng., Vol. 3, Nos. l4, 1992, pp. 25-31. 7. J.B. Armstrong, D.W. Watson, and H.J. Siegel, “Software Issues for the PASM Parallel Processing System,” in Software for Parallel Computation, J.S. Kowalik, ed., Springer-Verlag, Berlin, 1993. 8. H.J. Siegel, J.B. Armstrong, and D.W. Watson, “Mapping Computer-VisionRelated Tasks onto Reconfigurable Parallel ProcessingSystems,”Computer, Vol. 25, No. 2, Feb. 1992, pp. 54-63. Howard Jay Siegel is a professor and the coordinator of the Parallel Processing Laboratory in the School of Electrical-Engineerine at Purdue Universitv in West Lafavette, Inhiana. His current research focuses on interconnection networks, heterogeneous computing, and the use and design of the PASM dynamically reconfigurable partitionable mixed-mode parallel computer system. Siegel received two BS degrees from the MassachusettsInstitute of Technology (both in 1972), MA and MSE degrees from Princeton University (both in 1974), and the PhD degree from Princeton (in 1977). He has coauthored over 150 technical papers, edited/coedited five volumes, and authored one book. Siegel was the coeditor-in-chief of the Journal of Parallel and Distributed Computing from 1989 to 1991. He is an IEEE fellow and is on the editorial board of IEEE Transactions uted Systems. on Parallel and Distrib- 9. Grand Challenges: High-Performance Computing and Communications, Com- mittee on Physical, Mathematical, and Engineering Sciences, National Science Foundation, Washington, D.C., 1991. Readers can contact Richard F. Freund at (619) 553-4071 (voice), (619) 553-5136 (fax), or freund@superc.nosc.mil (e-mail). 17