MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY A.I. Memo No. 1487 May, 1994 Partial Evaluation for Scientic Computing: The Supercomputer Toolkit Experience Andrew Berlin berlin@parc.xerox.com Rajeev Surati raj@martigny.ai.mit.edu This publication can be retrieved by anonymous ftp to publications.ai.mit.edu. The pathname for this publication is: ai-publications/1994/AIM-1487.ps.Z Abstract We describe the key role played by partial evaluation in the Supercomputer Toolkit, a parallel computing system for scientic applications that eectively exploits the vast amount of parallelism exposed by partial evaluation. The Supercomputer Toolkit parallel processor and its associated partial evaluation-based compiler have been used extensively by scientists at M.I.T., and have made possible recent results in astrophysics showing that the motion of the planets in our solar system is chaotically unstable. c Massachusetts Institute of Technology, 1994 Copyright This report describes research done at the Articial Intelligence Laboratory of the Massachusetts Institute of Technology. Support for the laboratory's articial intelligence research is provided in part by the Advanced Research Projects Agency of the Department of Defense under Oce of Naval Research contract N00014-92-J-4097 andy by the National Science Foundation under grant number MIP-9001651. Andrew Berlin was also supported in part by an IBM Graduate Fellowship in Computer Science. 1 Introduction a data-independent program has the eect of removing all data abstractions and program structure, producing a purely numerical program that fully exposes the lowlevel parallelism inherent in the underlying computation. For the scientic applications we were targeting, such as orbital mechanics calculations, partial evaluation of data-independent calculations produced purely numerical programs containing several thousands of oatingpoint operations, with the potential for parallel execution of 50 to 100 operations simultaneously. However, the parallelism exposed by partial evaluation is dicult to exploit, because it is extremely ne-grained, at the level of individual numerical operations. In 1989, researchers at M.I.T. and Hewlett-Packard began a joint eort to create the Supercomputer Toolkit , a set of hardware and software building blocks to be used for the construction of special-purpose computational instruments for scientic applications. Earlier work ([6],[7]) had shown that partial evaluation of numerical programs that are mostly data-independent converts a high-level, abstractly specied program into a low-level, special-purpose program, providing order-ofmagnitude performance improvement and exposing vast amounts of low-level parallelism. A central focus of the Supercomputer Toolkit project was to nd a way to exploit this extremely ne-grained parallelism. By combining the performance improvements available from partial evaluation with novel parallel compilation techniques and a parallel processor architecture specically designed to execute partially evaluated programs, the Supercomputer Toolkit system enabled scientists to run an important class of abstractly-specied programs approximately three orders of magnitude faster than a conventionally compiled program executing on the fastest available workstation. This paper presents an overview of the role played by partial evaluation in the Supercomputer Toolkit system, describes the novel parallelism grain-size adjustment technique that was developed to make eective use of the ne-grained parallelism exposed by partial evaluation, and summarizes the various real-world scientic projects that have made use of the Supercomputer Toolkit system. 3 The Supercomputer Toolkit System The Supercomputer Toolkit is a parallel processor consisting of eight independent processors connected by two independent communication busses. The Toolkit system makes eective use of the parallelism exploited by partial evaluation in two ways. First, within each processor, ne-grain parallelism is used to keep the pipeline of a oating-point chip set fully utilized. Second, multiple operations can execute in parallel on multiple processors. The compilation process consists of four major phases. The rst phase begins by using partial evaluation to convert each data-independent section of a program into a data-ow graph that consists entirely of numerical operations. This is followed by traditional compiler optimizations, such as constant folding and dead-code elimination. The second phase analyzes locality constraints within the data-ow graph and groups ne-grain operations together to form higher grain-size instructions known as regions. In the third phase, criticalpath based heuristic scheduling techniques are used to assign each coarse-grain region to a processor. Finally, the region boundaries are broken down, and instructionlevel scheduling is performed to assign computational resources to the ne-grain operations that have been assigned to each processor. A very detailed discussion of the compiler and all of its phases can be found in [3] and [5]. Before discussing the details of the Supercomputer Toolkit architecture and compilation techniques, we present a set of measurements intended to provide an idea of the relative importance of the various sources of performance improvement achieved by the Toolkit system, using a 9-body orbital mechanics program1 as an example. 1 The performance improvement provided by using partial evaluation to convert a high-level, dataindepen-dent program into a low-level, purely numerical data-ow graph was measured by expressing the data-ow graph in an rtl-style program expressed in the C programming language, by using a C vector to store the numerical value produced by each node in the dataow graph. Comparison of this low-level (partially evaluated) C program 2 Motivation Scientists are faced with a dilemma: They need to be able to write programs in a high-level language that allows them to express their understanding of a problem, but at the same time they need their programs to execute very quickly, as their problems often require weeks or even months of computation time. In the astrophysics community, the situation had become critical: programs would be written in a few days in a high level language, only to have weeks or even months invested in reexpressing the problem so that it could make better use of a vectorizing subroutine library; rewriting the entire program in assembly language; or in extreme cases, constructing special-purpose hardware to solve the problem. ([16]) Although partial evaluation promised to provide a solution to this dilemma for an important class of numericallyintensive programs, the parallel hardware and compilation technology required to take full advantage of the potential of partial evaluation did not exist. Much of the design of the Supercomputer Toolkit was based on the observation (See [7]) that numerical applications are special in that they are for the most part data-independent, meaning that the sequence of numerical operations that will be performed is independent of the actual numerical values being manipulated. For instance, matrix multiply performs the same sequence of numerical operations regardless of the actual numerical values of the matrix elements. Partial evaluation of 1 Specically, ve time-steps of a 12th-order Stormer integration of the gravity-induced motion of a 9-body solar system. 1 This focus on data-independent programs was carried to an extreme, leading to a system that provided extraordinary performance on data-independent code, but which required that code containing data-dependent branches be left residual. In most partial evaluation systems, the partiallyevaluated program is expressed in the same programming language as the source program, allowing code that is left residual to intermingle with code that is partiallyevaluated. However, in our system, partially-evaluated code is executed on a specialized numerical processor that does not support the original source language. Each piece of code that is not partially evaluated must be converted (either by hand or by an application-specic program generator) into the low-level assembly language of each Toolkit processor. Thus in order to use the Supercomputer Toolkit compiler on a data-dependent program, the program must rst be divided up into dataindependent subprograms, each of which are then compiled (via partial evaluation and parallel scheduling) to form a high-performance subroutine. For the numerical applications the toolkit was intended to be used for, such as the integration of ordinary dierential equations, the division of programs into dataindependent subprograms did not pose a major problem, as the complexity inherent in these problems tends to be isolated in one or two well-dened data-independent subprograms. However, when people from communities outside of the Toolkit's originally intended user base began to use the Toolkit for problems exhibiting greater data dependence, the poor handling of data-dependent branches posed a serious obstacle. It is important to note that there is no technical obstacle that prevented better handling and limited partialevaluation of data-dependent branches. Indeed, our original intention was to implement a compilation process that combined aggressive partial evaluation-based optimization of data-independent subprograms with traditional code generation techniques that would handle the data-dependent branches. However, this integration with traditional techniques was never completed: as soon as the portion of the compiler that handles dataindependent programs became operational, the allure of the dramatic performance increases available motivated scientists to start using the system immediately, using a few lines of assembly language to implement the residual data-dependencies, and invoking the compiled data-independent subprograms from assembly language as subroutines. Eventually, a number of the users built on top of the Toolkit compiler their own applicationspecic program generators that automatically created the few lines of assembly-language instructions required to implement the data-dependent branches of their programs. to the original Scheme program (compiled by the LIAR Scheme compiler) revealed speed-ups which typically ranged from 10 to 100 times faster. In the case of the 9-body program, partial evaluation provided a speedup factor of 38x. This speedup factor can be realized through execution in C on traditional sequential machines as well as through execution on the Supercomputer Toolkit. 2 The performance improvement provided by the ability of each Toolkit processor to make eective use of ne-grain parallelism to keep the oatingpoint pipeline full was measured by comparing the sustained rate attained by each Toolkit processor (12.9 Mops) to the sustained rate attained by the fastest workstation available at the time2 (which happened to make use of the same oating-point chip set as the Supercomputer Toolkit processor) executing hand-optimized code expressed in Fortran (2 Mops). Thus the Toolkit's processor architecture achieved approximately a 6x performance improvement by enabling multiple ne-grained instructions to execute in parallel within the oatingpoint chip set. 3 The eectiveness of the static scheduling and grainsize adjustment parallel compilation techniques to make use of multiple toolkit processors simultaneously was measured by comparing the execution time of the 9-body program executing on eight Toolkit processors in parallel to a virtually optimal uniprocessor implementation of the 9-body program. A factor of 6.2x performance improvement was attained by making use of eight processors in parallel. The speedups available from partial evaluation, from the use of ne-grain parallelism within each processor, and from multiprocessor execution are orthogonal. Thus from the \black box" point of view of our scientic user community, the 9-body program executed in parallel on the Supercomputer Toolkit 1413x faster than did the traditionally compiled high-level Scheme program executed on a high performance workstation. Of this speedup, a factor of 38 resulted directly from partial evaluation and could have been achieved by executing the partiallyevaluated program in C on a workstation, while a factor of 37.2 of the speedup resulted from the ability of the Supercomputer Toolkit hardware to make use of the parallelism exposed by partial evaluation. 4 Design Goal: Optimization of Data-Independent Programs The Supercomputer Toolkit system was designed based on the observation that in the scientic applications we were most interested in, such as the integration of ordinary dierential equations, the data-dependent portions of a program tend to be very small, typically taking the form of error checks or \Is it good enough yet?" style loops, with the vast majority of the computation occuring in the data-independent portions of the program. 2 5 The Partial Evaluator The Supercomputer Toolkit compiler performs partial evaluation of data-independent programs expressed in the Scheme dialect of Lisp by using the symbolic execution technique described in previously published work by Berlin ([6]). Using this technique, the input data An HP9000/835 2 structures for a particular problem are provided at compile time, using placeholders to represent those numerical values that will not be available until execution time. Partial evaluation occurs by executing the program symbolically at compile time, creating and accessing datastructures as necessary, and performing numerical operations whenever possible. The partial evaluator only leaves as residual those operations whose numerical input values will not be available until execution time. The partially-evaluated program consists entirely of numerical operations: the execution of all loops, data-structure references and creations, and procedure manipulations occurs at compile time. Our partial evaluation strategy proved quite eective on the ordinary dierential equation style applications we originally envisioned that the Toolkit would be used for. As a wider scope of applications began to develop, the most serious deciency in our system proved to be the lack of support for leaving selected data-structure operations residual in the partial evaluation process. For instance, although users might want an operation such as matrix multiply to be completely unrolled, they might still want the resulting data to be stored in a particular matrix format. Our system eliminated all datastructures, making it dicult to perform certain programming tricks that rely on the location of a piece of data in memory, and requiring a data-rearrangement when interfacing with subroutines that had particular memory-storage expectations. SEQUENCER CONTROL STORE 16k x 168 bits ADDRESS GEN ADDRESS GEN MEMORY 16k x 64 MEMORY 16k x 64 I/O I/O REGISTER FILE 32 x 64 + Figure 1: This is the overall architecture of a Supercomputer Toolkit processor node, consisting of a fast oating-point chip set, a 5-port register le, two memories, two integer alu address generators, and a sequencer. for use by the next operation. By utilizing the parallelism exposed by partial evaluation, the Toolkit compiler was able to schedule operations during these intermediate cycles, thereby keeping the oating-point chip set fully utilized. Indeed, on a wide variety of applications, the Supercomputer Toolkit compiler was able to sustain oating-point unit usage rates in excess of 99%. In theory, up to twelve Toolkit processors may be combined to form a parallel computing system, although the largest system ever constructed is an eight processor system. Each Toolkit processor has its own programcounter and is capable of independent operation. Special synchronization and branch control hardware provide the program-counters of the various processors with the ability to track one another, eectively allowing a single program to make use of multiple processors simultaneously. The experimental results presented in this paper were performed on an eight processor Supercomputer Toolkit, congured so that two independent interprocessor communication channels were shared by all eight processors. 6 The Toolkit Processor Architecture Each Supercomputer Toolkit processor is a Very Long Instruction Word (VLIW) computer. The processor architecture is designed to make eective use of the negrain parallelism exposed by partial evaluation by keeping a pipelined high-performance oating-point chip set fully utilized. In general, the oating-point chip set produces a 64-bit result during every cycle, and requires two 64-bit inputs during each cycle. Constructing a processor that can move around enough data to keep the oating-point chips busy required the inclusion within each processor of two independent memory systems, as illustrated in Figure 1. Each memory system has its own dedicated integer ALU and register le for generating memory addresses, while a third integer ALU handles program-counter sequencing operations. To support interprocessor communication, each processor has two high-speed Input/Output ports attached directly to its main register les. For a more detailed description of the Supercomputer Toolkit processor architecture, see [2]. Since partial evaluation eliminated all data-structures and higher-order procedure calls, the compiler was able to predict the data needs of the oating-point chips at compile time, giving it the freedom to decide which of the two memory systems each result would be stored in, and to begin the data movement necessary to support a particular oating-point operation many cycles in advance of the actual start of the operation. Due to the pipeline structure of the oating-point chip set, it is possible to initiate an operation during each cycle, but the result of that operation is often not available 7 Parallel Compilation Technology We have developed parallel compilation software that automatically distributes a data-independent computation for parallel execution on multiple processors. Dividing up the computation at compile time is practical only because partial evaluation eliminates the uncertainty about what numerical operations the compiled program will perform, by evaluating conditional branch instructions related to data-structures and strategy selection at compile time. In other words, all branches of the form \Have we reached the end of the vector yet?" and \Have we been through this loop 5 times yet?", are eliminated at 3 480 | 420 | 360 | 300 | 240 | 180 | 120 | 60 | Processors 540 | compile time, leaving for run-time execution only those branches that actually depend on the numerical values of the results being computed. Thus the partial evaluation process is similar to loop unrolling, but is much more extensive, as partial evaluation also eliminates inherently sequential procedural abstractions and data structures, such as lists, that would otherwise act as barriers to parallel execution. In the compiler community, a sequence of computation instructions ending in a conditional branch is known as a basic block. The largest basic blocks produced by traditional compilers are usually around 10-30 instructions in length, and reect the calculations expressed within the innermost loop of a program. In contrast, the basic blocks of a partially evaluated program are usually several thousand instructions in length. For example, the basic-block associated with the 9-body program mentioned earlier consisted of 2208 oating-point instructions. A limitation of the partial evaluation approach is that for programs that manipulate large amounts of data, the basic blocks may actually get too long to t in memory, at which point it is necessary for the programmer to declare that certain data-independent branches, such as outermost loops, should be left intact, limiting the scope of partial evaluation. Each basic block produced by partial evaluation may be represented as a data-independent (static) data-ow graph whose operators are all low-level numerical operations. Previous work ([6]) has shown that this graph contains large amounts of low-level parallelism. For instance, the parallelism prole for the 9-body program, illustrated in Figure 2, indicates that partial evaluation exposed so much low-level parallelism that in theory, parallel execution could speedup the computation by a factor of 69x faster than a uniprocessor execution. However, achieving this theoretical maximum speedup factor would require using 516 non-pipelined processors capable of instantaneous communication with one another.3 In practice, much of the available parallelism must be used within each processor to keep the oatingpoint pipeline full, it does take time (latency) to communicate between processors. As the latency of interprocessor communication increases, the maximum possible speedup decreases, as some of the parallelism must be used to keep each processor busy while awaiting the arrival of results from neighboring processors. Bandwidth limitations on the inter-processor communication channels further restrict how parallelism may be used by | 0| 0 | 2 | 4 | 6 | | | | | | | | | | | | | | 8 10 12 14 16 18 20 22 24 26 28 30 32 34 Operation Level Parallelism Profile Cycles Figure 2: Parallelism prole of the 9-body problem. This graph represents all of the parallelism available in the problem, taking into account the varying latency of numerical operations. requiring that most numerical values used by a processor actually be produced by that processor. 8 Parallel Scheduling Techniques Previously published work by Berlin and Weise ([4]) suggested the use of critical-path based parallel scheduling techniques to take advantage of the low-level parallelism exposed by partial evaluation. Critical-path based techniques, which give priority to the longest computations in a program, are very eective at overcoming latency limitations, but do not consider bandwidth limitations at all. In other words, a critical-path based scheduler will seek to schedule a non-critical path operation on any processor that happens to be available, without regard to the fact that the operands and result of that operation may need to be transmitted between processors. This approach is only eective in situations where a large amount of inter-processor communication bandwidth is available, making it feasible for many results to be transmitted between processors. Each of the Supercomputer Toolkit 's two interprocessor communication channels can accept one result every other cycle. As a result of this communication bandwidth limitation, on an eight processor system, only one out of every eight results produced by a processor can be transmitted to other processors. Thus on the Toolkit system, roughly seven out of every eight numerical results used by a processor must be produced by that processor. We rst attempted to generate parallel code for the Supercomputer Toolkit using critical-path based scheduling techniques similar to those suggested 3 We originally chose the 9-body program as an example to ease comparison with previously published work that also studied this program, including [11], [6], and [4]. However, there are numerical discrepancies between the theoretical speedup factors published in this paper and those presented in our previously published work, due to improvements that were made to the constant-folding phase of our compiler. As a result of these improvements, the data-ow graph of the 9-body program being discussed in this paper has fewer operations than the data-ow graph used in [6] and [4]. All graphs and statistics presented in this paper, including the parallelism prole, have been updated to account for this change. 4 by Berlin and Weise. Due to communication bandwidth limitations, the results were dismal: On the 9-body program, a speedup factor of only 2.5x was achieved using eight processors. A 9 Grain-Size Adjustment B * / 1 To overcome the scheduling diculties associated with limited communication bandwidth, we developed a technique that adjusts the grain-size of the ne-grain parallelism exposed by partial evaluation to match the interprocessor communication capabilities of the architecture. Prior to initiating critical-path based scheduling, we perform a locality analysis that groups together operations that depend so closely on one other that it would not be practical to place them in dierent processors. Each group of closely interdependent operations forms a larger grain-size instruction, which we refer to as a region.4 In essence, grouping operations together to form a region is a way of simplifying the scheduling process by deciding in advance that certain opportunities for parallel execution will be ignored due to limited communication capabilities. Critical-path based scheduling is performed and works eectively at the region level, assigning regions to processors, rather than assigning ne-grain instructions to processors. Since all operations within a region are guaranteed to be scheduled onto the same processor, the maximum region size must be chosen to match the communication capabilities of the target architecture. For instance, if regions are permitted to grow too large, a single region might encompass the entire data-ow graph, forcing the entire computation to be performed on a single processor! Although strict limits are therefore placed on the maximum size of a region, regions need not be of uniform size. Indeed, some regions are large, corresponding to localized computation of intermediate results, while other regions are quite small, corresponding to results that are used globally throughout the computation. We have experimented with several dierent heuristics for grouping operations into regions. The optimal strategy for grouping instructions into regions varies with the application and with the communication limitations of the target architecture. However, we have found that even a relatively simple grain-size adjustment strategy dramatically improves the performance of the scheduling process. For instance, as illustrated in Figure 3, when a value is used by only one instruction, the producer and consumer of that value are grouped together to form a region, thereby ensuring that the scheduler will not place the producer and consumer on dierent processors in an attempt to use spare cycles wherever they happen to be available. Provided that the maximum region size C R2 5 D - + 1 1 R1 * 1 Figure 3: A Simple Region Forming Heuristic. A re- gion is formed by grouping together operations that have a simple producer/consumer relationship. This process is invoked repeatedly, with the region growing in size as additional producers are added. The region-growing process terminates when no suitable producers remain, or when the maximum region size is reached. A producer is considered suitable to be included in a region if it produces its result solely for use by that region. (The numbers shown within each node reect the computational latency of the operation.) is chosen appropriately,5 grouping operations together based on locality prevents the scheduler from making gratuitous use of the communication channels, forcing it to focus on scheduling options that make more eective use of the limited communication bandwidth. Exploiting locality by grouping operations into regions forces closely-related operations to occur on the same processor. Although this reduces inter-processor communication requirements, it also eliminates many opportunities for parallel execution. Figure 4 shows the parallelism remaining in the 9-body problem after operations have been grouped into regions. Comparison with Figure 2 shows that increasing the grain-size eliminated about half of the opportunities for parallel execution. The challenge facing the parallel scheduler is to make effective use of the limited parallelism that remains, while taking into consideration such factors as communication latency, memory trac, pipeline delays, and allocation of resources such as processor buses and inter-processor communication channels. 10 Performance Measurements The nal result of compiling the 9-body program using the Supercomputer Toolkit compiler is shown in Figure The name region was chosen because we think of the grain-size adjustment technique as identifying \region" of locality within the data-ow graph. The process of grain-size adjustment is closely related to the problem of graph multisection, although our region-nder is somewhat more particular about the properties (shape, size, and connectivity) of each \region" sub-graph than are typical graph multisection algorithms. 4 The region size must be chosen such that the computational latency of the operations grouped together is wellmatched to the communication bandwidth limitations of the architecture. If the regions are made too large, communication bandwidth will be underutilized since the operations within a region do not transmit their results. 5 5 | 40 | 20 | | 4 | 8 Processors 2 0| 0 | | | | | | | | | | | | | | | 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 | | 0| 0 Processors | 60 4 | | 80 6 | | 100 | | 120 8 | 140 Cycles | | 30 60 | | | | | | | | | 90 120 150 180 210 240 270 300 330 Cycle Processor Utilization per Cycle Parallelism Profile Figure 4: Parallelism prole of the 9-body problem after op- Figure 5: The result of scheduling the 9-body program onto 5. 6 Notice how the compiler was able to take the available parallelism shown in Figure 4 and spread it across the processors. By utilizing eight processors in parallel, the compiler was able to achieve a speedup factor of approximately 6.2x faster than a nearly optimal implementation of this program running on a single Toolkit processor. tion of the entire Solar System, incorporating a postNewtonian approximation to General Relativity and corrections for the quadrupole moment of the Earth-Moon system. The longest previous such integration ([21]) was for about 3 million years. The integration performed on the Supercomputer Toolkit conrmed that the evolution of the Solar system as a whole is chaotic with a remarkably short time scale of exponential divergence of about 4 million years. A complete analysis of the integration results appears in [1]. A novel type of symplectic integration strategy was developed by Wisdom and Holman for use in this application, and was expressed in the Scheme language using an abstract programming style. Partial evaluation specialized this integration strategy for use on the solar system problem with a particular force law (gravitation) and a particular solar system conguration. The 100-million-year integration used eight Toolkit processors running in parallel. The computation was arranged so that each processor simulated a single solar system, but with each processor starting with slightly dierent initial conditions. Chaos was observed by comparing the dierences between the states that evolved from the slightly varying initial conditions. The Toolkit compiler was used to generate code for each processor independently. The compiled code for a single processor contains almost 10,000 Toolkit instructions for each integration step, more than 98% percent of which correspond to oating-point operations. This application posed somewhat of a challenge to our partial evaluation system, as it violated our simple model erations have been grouped together to form regions. Comparison with Figure 2 clearly shows that increasing the grainsize signicantly reduced the opportunities for parallel execution. In particular, the maximum speedup factor dropped from 69 times faster to only 34.5 times faster than a single processor. eight Supercomputer Toolkit processors. Comparison with with the region-level parallelism prole (gure 4) illustrates how the scheduler spread the coarse-grain parallelism across the processors. A total of 340 cycles are required to complete the computation. On average, 6.5 of the 8 processors are utilized during each cycle. 11 Applications A variety of scientic applications made use of the Supercomputer Toolkit system, ranging from numerical integration of the solar system to clinical genetic counseling. Some applications utilized only a single Toolkit processor, while others ran the same program on multiple processors simultaneously, or used the automatic parallelization features of the compiler to execute a single program on eight processors in parallel. We present an overview of these applications, focusing on the role played by partial evaluation, and on the advantages and diculties encountered. Chaos in the Solar System: The Supercomputer Toolkit application having the most scientic importance was a 100-million-year integra6 This gure represents a single time step of the integration, on which the compiler achieved a speedup factor of 6.5x using eight processors. The more conservative speedup factor quoted throughout this document for the 9-body problem refers to ve integration time steps, thereby including the overhead of moving data around to restart the computation after each time step. 6 of programs as consisting of data-independent inner loops surrounded by data-dependent branches. Specically, the new integration strategy took advantage of the elliptical nature of the planetary orbits, making extensive use of selection operations and scientic subroutines, some of which were heavily data-dependent. Thus this program had data-dependencies at the very core of its innermost loops. We chose to handle these innermost data dependencies by providing a mechanism for leaving subroutines residual. In our hybrid system, this amounted to allowing a partially-evaluated program to include a call to a data-dependent hand-coded routine, such as sin. By developing a small library of code that could be left \residual", that included the trigonometric functions as well as a few selection operations such as \return the second argument if the rst argument is greater than 0", we were able to abstract away these innermost data dependencies, eectively burying them inside of rather simple subroutines. Note that an alternative approach would have been to use techniques for extending the placeholder-based partial evaluation strategy to allow it to generate code that contains selection-style conditional branches, as described in [7]. We did indeed add these techniques to our front-end partial evaluator, but have not extended the code generation back-end to handle conditional branches, primarily because demand for this functionality from our scientic users dropped o once the subroutine library of selection operations became available. gramming style enabled by partial evaluation permitted quad-precision oating-point operations to be substituted for double-precision operations with the simple replacement of a few procedure denitions. The Orrery verication experiment ran on a single Toolkit processor, since the automatic parallelization portion of the Toolkit compiler was not yet operational at the time the experiment was performed. Once the automatic parallelizer was completed, we compiled a Stormer integration of a full 9 planet solar system, generating a program that utilized eight processors in parallel to achieve a factor of 6.2x speedup over the single processor Toolkit program. This program, which we refer to as an example earlier in this paper, was the rst to take full advantage of the parallelism exposed by partial evaluation, and to the best of our knowledge constituted the fastest integration of the solar system ever achieved. Circuit Simulation: Hal Abelson, Jacob Katznelson, and Ognen Nastov wrote several programs that utilized the toolkit to perform simulation of circuits like phase locked loops. Some of the problems they studied utilized a runge-kutta integrator, which was well suited to the Toolkit environment, including a Voltage Controlled Oscillator and a Phase Locked Loop. Both simulations when compiled by the toolkit compiler were shown to run approximately 6 times faster on a toolkit processor than on the best oating point workstation available at the time, an HP835 running a Fortran version of the same program. Partial evaluation was used to specialize the circuit simulator and integration method for the particular circuit being simulated. When a straightforward integration strategy such as 4th-order runge-kutta was used, the application was almost entirely data-independent, mapping very well onto the Toolkit architecture. However, simulation of many of the circuits studied required the integration of a sti system of dierential equations, using a complex and highly data-dependent Gear integration technique. The Gear integration technique uses a sparse linear equation solver, which involves signicant data-dependent control ow. It was possible to utilize the Toolkit compiler to produce code for the data-independent portions of these simulations, including the code that implements the dynamic equations of the circuit itself, but implementation of the highly data-dependent portions of the GEAR integrator had to be performed by hand in assembly language. This required the assembly language programer to have knowledge of the storage allocation strategy used by the compiler to store results in memory, which led to a fairly complex and not very well organized set of interactions. A much needed enhancement to our system would be to provide a way for the programmer to request that the compiler adhere to a particular data storage strategy, such as maintaining a particular data representation for a matrix, rather than the strategy used by our current implementation which leaves the compiler free to store data values in any place that is convenient, including processor registers. Interestingly, despite the use of partial evaluation, cir- Orrery Verication Experiment: Another astrophysics application involved verifying results that had been obtained in 1988 by G. Sussman and J. Wisdom using the Digital Orrery to demonstrate that the long-term motion of the planet Pluto, and by implication the dynamics of the Solar System, is chaotic ([15]). The Digital Orrery was a special-purpose parallel computer designed explicitly to integrate the solar system. Computations run on the Orrery were parallelized and programmed in microcode by hand, with one processor devoted to each planet. In contrast, the program that executed on the Supercomputer Toolkit was written in Scheme, and automatically compiled using the Toolkit's partial evaluation-based compiler. The Orrery integration required integrating the positions of the outer planets for a simulated time of 845 million years (note that this is only 6 planets, rather than the 9 in the whole solar system), which required running the Orrery continuously for more than three months. The same integrations utilizing a 6-body stormer integrator were performed on a single toolkit processor, showing that each toolkit processor coupled with the compiled partially evaluated code was about 3 times faster than the entire multiple processor Digital Orrery. This program mapped nearly perfectly onto the Toolkit system. The only data-dependent branches were located at the outermost \is it done yet?" loop. With the exception of this single instruction end-test, the entire program was partially evaluated. The abstract pro7 cuit simulations involving the Gear integrator ran slowly compared to other circuit simulators. Later investigation revealed that this was primarily because this simulator, and the Gear integrator in particular, did not employ some implementation tricks that are used by other circuit simulators such as SPICE. However, another factor limiting the performance of this application is that the interface between the compiled code implementing the circuit dynamics and the hand-written code implementing the Gear integrator involved a lot of copying of data. A better interface that allows the compiler to take the ultimate destination of a value into account would provide noticeable performance improvement. I/O connection to the Toolkit that would have solved this problem was designed, but was never constructed. Clinical Genetic Counseling: Finally, a program to calculate the probabalistic relationships over a Bayesian Network like a pedigree was written by Minghsun Liu. This program was designed to be used to answer the \What if?" types of questions that arise in genetic counseling when determining the probability that a potential child may have a particular defect. The computation time grows exponentially with the number of \unknown" nodes in the probability tree. However, if certain assumptions are made about the relative independence of some of these \unknown" nodes, partial evaluation can play an important role, signicantly reducing the size of the computation, as described in more detail in [17] and [18]. For any particular program invocation this program performed well. However, for successive invocations, execution speed was hampered by lack of the ability to perform incremental partial evaluation, so that the structure of the network could be locally changed without triggering the need to recompile entire probability network. Computation of Lyapunov Exponents: The toolkit was used in an experiment by Shyam Parekh to compute the Lyapunov exponents of non-linear systems. Lyapunov exponents characterize the divergence of the distance between two trajectories in a dynamical system and can serve as an indicator of chaotic behavior. The Supercomputer Toolkit system was used to do parameter space scans of chaotic circuits such as the double scroll circuit. These theoretical scans were compared against actual scans performed using a real circuit. The results and implementation details of these experiments can be found in [19]. 12 Conclusions and suggestions for future work An Integration System for Ordinary Dierential Equations: To the best of our knowledge, the Supercomputer Toolkit system is the rst to make eective use of the vast amount of low-level parallelism exposed by partial evaluation. Partial evaluation proved eective in virtually all of the applications encountered during the Supercomputer Toolkit project. In some cases, the Toolkit and its compiler created new opportunities to produce important results in science. In other cases, mostly due to shortcomings in the implementation of the compilation system, the applications did not map well onto the Toolkit. The range of applications that could be run on the Supercomputer Toolkit would have been greatly expanded had the Toolkit's compiler provided a way of leaving selected data-dependent branches and data-structures residual. In this way, heavily data-dependent applications such as the Gear integrator, that require the existence of data-structures in a particular format (sparse matrices) on the Toolkit itself could have been written without the need for hand-coding in Toolkit assembly language. The symbolic execution technique for performing partial evaluation of data-independent programs was simple to implement and worked well. We have already developed some ways (see [7]) to extend this technique to handle certain types of data-dependent branches, and can envision extending it to permit certain data-structures to be left residual. With recent developments in partial evaluation technology, the Toolkit's partial evaluator for dataindependent programs may appear somewhat primitive. However, a key design goal of our system was to be able to take existing highly complex and abstract Scheme programs from scientists, unaltered, and run them on Sarah Ferguson built a software system on top of the toolkit compiler that takes an equation as input, and automatically generates a Scheme program to integrate it. Sarah's system uses the partial-evaluation features of the Toolkit compiler to specialize the integrator for the particular equation being integrated, and to generate code for the main body of the integration. Her system also generates a few lines of Toolkit assembly language that implement a data-dependent branch that adjusts the integration step size based on how much integration error is being encountered. This system performed quite well, with the data-dependent branches playing a minor role that did not signicantly aect system performance. Elizabeth Bradley used Sarah Ferguson's integration system to perform dynamical simulations of chaotic systems as part of her research on control of chaotic systems ([20]), including the Lorenz system and the double pendulum system. These systems were a perfect match for both our partial evaluation technology and the Toolkit architecture, and executed extremely quickly. Unfortunately, the Toolkit was designed to support applications that run for a long time before producing a result, whereas Elizabeth Bradley needed to capture the intermediate results that were being produced rapidly. Although the computationally expensive integration routines mapped very well onto the Toolkit architecture, the symbolic routines that analyzed the numerical results could not be executed on the numerically-oriented Toolkit system and had to be run on the workstation host. The program thus became I/O limited, with the Toolkit computer producing data far more quickly than it could be transferred to the workstation host. A faster 8 the Supercomputer Toolkit. These programs often included global state, side-eects, manipulation of complex data structures such as streams, and the storing of higher-order procedures within data-structures. Such program features pose serious challenges to partial evaluation technology. It is remarkable that a partial evaluation system such as ours, capable of handling only dataindependent programs, could have so large an impact on science. As hardware technology evolves, the use of partial evaluation to expose parallelism will play an increasingly important role. As processor clock speeds increase, pipeline lengths will grow longer, and will require signicant amounts of parallelism to keep them full. But more importantly, as it becomes possible to build multiple processors on a single chip, the vast amount of parallelism exposed by partial evaluation will play a key role in computation, aecting programming language and library design as well as the compilation process itself. [6] [7] [8] [9] 13 Acknowledgements The Supercomputer Toolkit project was the result of a joint eort between M.I.T. and Hewlett-Packard. The following people were all involved in the eort to create and utilize this system: Hal Abelson, Andrew A. Berlin, Joel Birnbaum, Jacob Katzenelson, William H. McAllister, Guillermo J. Rozas, Gerald Jay Sussman, Jack Wisdom, Dan Zuras, Rajeev J. Surati, Karl Hassur, Dick Vlach, Albert Chun, David Fotland, Marlin Jones, John Shelton, Sam Cox, Robert Grimes, Bruce Weyler, Elizabeth Bradley, Shyam Parekh, Sarah Ferguson, Ognen Nastov, Mingsun Liu, Peter Szolovitz, Carl Heinzl, Darlene Harrell, and Rosemary Kingsley. [10] [11] References [12] [1] G. J. Sussman and J. Wisdom, \Chaotic Evolution of the Solar System," Science, Volume 257, pp. 256-262, July 1992. [2] H. Abelson, A. Berlin, J. Katzenelson, W. McAllister, G. Rozas, G. Sussman, \The Supercomputer Toolkit and its Applications," MIT Articial Intelligence Laboratory Memo 1249, Cambridge, Massachusetts. [3] R. Surati and A. Berlin, \Exploiting the Parallelism Exposed by Partial Evaluation", In Proc. of the IFIP WG10.3 International Conference on Parallel Architectures and Compilation Techniques, Elsevier Science, 1994 also available as MIT Articial Intelligence Laboratory Memo no. 1414, April 1993. [4] A. Berlin and D. Weise, \Compiling Scientic Code using Partial Evaluation," IEEE Computer December 1990. [5] R. Surati, \A Parallelizing Compiler Based on Partial Evaluation", S.B. Thesis, Electrical Engineering and Computer Science Department, Massachusetts Institute of [13] [14] [15] [16] [17] 9 Technology, May 1992. Also available as TR-1377, MIT Articial Intelligence Laboratory, July, 1993. A. Berlin, \Partial Evaluation Applied to Numerical Computation," Proc. 1990 ACM Conference on Lisp and Functional Programming, Nice France, June 1990. A. Berlin, \A compilation strategy for numerical programs based on partial evaluation," MIT Articial Intelligence Laboratory Technical Report TR-1144, Cambridge, MA., July 1989. E. Ruf and D. Weise, \Opportunities for Online Partial Evaluation", Technical Report CSL-TR-92-516, Computer Systems Laboratory, Stanford University, Stanford, CA. 1992. E. Ruf and D. Weise, \Avoiding Redundant Specialization During Partial Evaluation," In Proceedings of the 1991 ACM SIGPLAN Symposium on Partial Evaluationand Semantics-Based Program Manipulation, New Haven, CN. June 1991. L. Nagel, SPICE2: A Computer Program to Simulate Semiconductor Circuits, Electronics Research Laboratory Report No. ERL-M520, University of California, Berkeley, May 1975. J. Miller, \Multischeme: A Parallel Processing System Based on MIT Scheme". MIT Laboratory For Computer Science technical report no. TR-402. September, 1987. M. Katz and D. Weise, \Towards a New Perspective on Partial Evaluation," In Proceedings of the 1992 ACM SIGPLAN Workshop on Partial Evaluation and Semantics-Directed Program Manipulation, San Francisco, June 1992. J. Ellis, Bulldog: A Compiler for VLIW Architectures, MIT Press, Cambridge, MA, 1986. J.A. Fisher, \Trace scheduling: A Technique for Global Microcode Compaction." IEEE Transactions on Computers, Number 7, pp.478-490. 1981. G.J. Sussman and J. Wisdom, \Numerical evidence that the motion of Pluto is chaotic," Science, (to appear). Also available as MIT Articial Intelligence Laboratory Memo no. 1039, April 1988. J. Applegate, M. Douglas, Y. Gursel, P. Hunter, C. Seitz, G.J. Sussman, \A Digital Orrery," IEEE Trans. on Computers, Sept. 1985. P. Szolovits, \Compilation for Fast Calculation Over Pedigrees," Cytogenet Cell Genet Vol. 59, pgs 136 - 138, 1992 [18] P. Szolovits and S. Pauker, \Pedigree Analysis for Genetic Counseling," Medinfo 92 [19] S. Parekh, \Parameter Space Scans for Chaotic Circuits" S.B. Thesis, MIT 1992. [20] E. Bradley, \Taming Chaotic Circuits" Ph.D. Thesis, MIT 1992. [21] T. Quinn, S. Tremaine, and M. Duncan \A Three Million Year Integration of the Earth's Orbit," Astron. J., vol. 101, no. 6, June 1991, pp. 2287{2305. [22] C. Heinzl, \Functional Diagnostics for the Supercomputer Toolkit MPCU Module", S.B. Thesis, MIT, 1990. 10