PARALLEL ALGORITHM DEVELOPMENT FOR A SPECIFIC MACHINE (INMOS T800 TRANSPUTER) BY WILLIAM STAUB 1 This paper will desribe from start to finish the parallel algorithm program development for a specific transputer (Inmos T800) network. It is essential to understand how the hardware manages parallel tasks and how information is exchanged between the transputers before writing a parallel program. A transputer (figure1, page3) is a circuit containing a processor, some memory to store programs and data, and several ports for exchanging, or transferring information with other transputers or with the outside world. By designing these circuits so that they could be connected together with the same simplicity with which transistors can be in a computer, the transputer was born. One of the most important factor was the introduction of a high-level language, occam [MAY83], whose features were directly supported by this transputer’s hardware and that made the transputer a building block for parallel computers. However a Locical System C compiler was developed for wider know usage. A prominent factor to utilizing this circuitry was the ease with which transputers could be connected to each other with as little as a few electrical wires. The four bidirectional input/output (I/O) ports of the transputer are designed to interface directly with the ports of other transputers, his feature allows for several transputers to fit on a small footprint, with very little extra logic circuits, making it possible to easily fit four transputers with some memory on a PC daughter board (ISA bus). 2 Figure 1: Transputer block-diagram and examples of interconnection networks. Figure 1 shows the basic block diagram of a transputer, along with some interconnection schemes: a linear chain (a), offering the simplest of connections, a ring (b), and a mesh (c). The processor does not have dedicated registers, but a stack of registers, which allows for an implicit selection of the registers. The net result is a smaller instruction format which leads to a denser program code. The transputer adopts the RISC philosophy and supports a small set of instructions executed in a few cycles each. Multitasking is supported in microcode within the processor. The actions necessary for the transputer to swap from one task to another are executed at the hardware level, freeing the system programmer of this task, and resulting in fast swap operations. This is an important way to improve the performance of a processor by allowing it to start running another program if the one that it is currently executing cannot continue for awhile. Multitasking is the first form of parallelism supported by the 3 transputer, and is accomplished by having the processor maintain a list of tasks that must be executed. Each task executes for a small amount of time, called a quantum, before it is stopped, and swapped for another task (if one awaits). For the transputer, a quantum of time is typically on the order of two milliseconds (2 10-3 seconds)[1]. Tasks are executed in a round-robin fashion. When a task eventually terminates it is removed from the list. The transputer maintains the active tasks chained in a linked list, and two of its internal registers are used to point to the front and rear of the list. The actual list is stored in memory, and the registers contain the memory address of the cells defining the tasks. At any time, new tasks may be created and added to this list. A transputer task may be in either one of two states: Active: This state refers to a task that is being executed, or in the list of tasks waiting to be executed. Inactive This state refers to a task that is not in the list of active tasks, because either one of three conditions is preventing it from continuing execution: The task is waiting for an input from one of the I/O ports the task is waiting to output to one of the I/O ports, or the task has been asked to stay idle until a specified time in the future. 4 As far as RAM, the transputer can access a linear address space of 4 Gbytes. Of these 4 Gbytes, 4 Kbytes are built-in the T805 transputer circuit, and correspond to the lower part of the memory address space. Since the registers are 32 bits in length, the transputer accesses 4 bytes at a time when accessing memory. The combination of the architecture of the serial I/O ports and the way the transputer manages them contribute to making the transputer a unique circuit especially well tailored for multiprocessing. I/O ports are memory mapped. This means that programming a port and passing it the address and length of a message is done by writing these numbers to memory locations that are mapped to the registers of the port The four bi-directional serial comprise two wires. When two transputers are connected together, they exchange information through one of their links. The processes (or tasks) running on each of the transputers are then free to communicate by exchanging data or messages over the link. The term link is used to refer to the physical connection between two transputers, and the term channel to is used to describe the software connection between the two processes. The transfer of data over a serial link is synchronized and unbuffered. Synchronization refers to the fact that if Process P1 running on Transputer T1 needs to exchange (send or receive) information with Process P2 on Transputer T2, then it must wait until P2 is ready to participate in the exchange. The processes are said to be synchronized, since the sender cannot get rid of its data until the receiver is ready to accept them, and conversely. Unbuffered communication means that no temporary storage is needed to store incoming or outgoing messages. The transfer takes place directly between 5 the memory of the sender and the memory of the receiver. The I/O port is a Direct Memory Access Controller (DMAC) which is given the memory address of the first byte of the incoming or outgoing message, and the number of bytes contained in that message. Once initialized, the link is autonomous, and will wait for its partner to signal its readiness to participate in the communication. During the communication, the processes that initiated the transfer are blocked. Each process is placed at the rear of the list of inactive tasks. Because the processor and the links operate independently, the processor is free to run another process when one is blocked by communication. As a result the channels require no message queues or message buffers. Data are transferred one byte at a time, each byte generating an acknowledge from the receiving transputer. The typical transfer speed is 10 Mbits/sec, although speeds of 5 and 20 Mbits/sec are also supported. When two transputers exchange data in both directions, the bandwidth on the link can reach a maximum of 2.35 Mbytes/sec. A transputer system operates as an attached processor for the PC. As such, it does not replace the 80x86 processor in the PC, but works in conjunction with it. The PC is referred to as the host machine. None of the software running on the PC is affected by the presence of the transputer network. Only software written by the user with the Logical System Compiler will be able to make use of the multiprocessing capability. Figure 2, page 7 illustrates the general architecture of a PC with an attached transputer network. 6 Figure 2: The transputer as an attached-processor. One of the transputers in the network is a designated driver (or root) and connects to the PC in one of the 8-bit ISA I/O slots. The other transputers in the network only require a power source (+5V) and thus need not be connected to the system bus. There, the root transputer it dispatches transferring information from the PC to the transputer network (program and input data), or from the transputer network to the host (result data). Unlike the other transputers in the network which use their links to connect to other transputers, the root transputer dedicates one of its links (often Link 0) to communicate with the PC. This link is connected to an 8-bit I/O controller serving as an interface between the PC and its resources, and the transputer network. 7 This I/O controller is quite simple and consists of two 8-bit buffers, one for data input, the other for data output. Each buffer is associated with a 1-bit status indicating whether the buffer it represents is full or empty. The 80x86 host thus sees the interface as four 8-bit registers. The root transputer sees the port as another transputer with which it exchanges data serially. The I/O controller manages the parallel to serial conversion (and conversely) between the PC and the transputer. Consequentely, the transputer cannot directly control the resources of the host systems: user memory, file system, keyboard, or video display. Every access to the PC resources must go through the 8-bit I/O port, which is maintained, in software, by the 80x86. If the transputer needs to display a string of characters on the video screen, for example, it must send the 80x86 the string, along with a command telling the 80x86 what to do with it. The Logical Systems C compiler allows us to program the root transputer to execute a printf() statement, this statement cannot be carried out unless the 80x86 is running a program that can carry out the operation. For this reason the execution of parallel programs on the transputer networks will always involve the 80x86 host in some respect. Sometimes the host will become a simple subordinate of the root transputer, attending to its every needs and making all its resources available to the transputer: keyboard, disk, video display, speaker, etc. Logical Systems provides a special program when this mode of computation is required called CIO. At other times, though, it will be more advantageous to divide the computation between the transputer network and the 80x86, and leave all I/O related operations to the latter. This will result in a better utilization of the system. 8 Remember that running a program on the transputer requires running two programs: One on the host, and the other on the transputer network. With the help of batch files, the automation will make writing and running programs a smooth task. The software is an ANSI C compiler augmented with libraries written by Logical Systems to support parallel constructs. The package contains the standard preprocessor, compiler, assembler and linker programs that most C programmers will be familiar with, but also new utilities, such as a loader and a host driver, required by the parallel environment. The process (figure 3 below) takes the original file first.c and passes it through a preprocessor (1), a compiler (2), an assembler (3), a linker (4), and a loader (5), this last one loads the code in the transputer network and starts its execution. Figure 3 The Compilation Process 9 Loading the transputer program on the transputer network, and launching it requires ldnet to perform several tasks: It must load the program into the root transputer. Once the parallel program is loaded, the loader must give the signal to the root transputer to start execution. It must start the execution of a program on the PC host computer so that input and output requests by the root transputer can be carried out. Network information file The loader gets the information about how it should perform these three steps from a file, the network information file, which typically carries a .nif extension. The contents of the network information file consists of two parts. The first section contains four lines controlling the loading process. The second section contains information describing the graph representing the transputer network. This information is used by ld-net to determine which program or programs must be loaded onto the transputers. An example network information file for our program is shown in the listing below. buffer_size host_server level_timeout decode_timeout 200; cio.exe; 1000; 1000; 1, first, R0, 0, , ,; Listing 3-5: contents of nif file first.nif for program first.c. 10 Commands The first four lines of the file form the command part of the network information file. Buffer_size Buffer_size defines the number of bytes of transputer internal memory used when downloading programs. When a program running on several transputers is loaded, it is first passed to the root- transputer, which passes it on to neighboring transputers, which, in turn, pass the program down to their neighbors, and so on. A buffer is used to hold the data transferred. The default value is 255 bytes, and results in the fastest loading time possible. Smaller buffer sizes are possible, but can substantially slow down the loading of programs in large transputer networks. Host_server Host_server identifies the program that is to run on the PC host. We will first use the cio.exe program provided by LSC. Level_timeout To understand how level_timeout and decode_timeout (see below) operate, one has to understand the way a transputer network is loaded. The nodes are organized as a tree, with the root transputer coinciding with the root of the tree. Each node is loaded by its parent, and is responsible for passing on modules to be loaded on its child(ren). The level_timeout quantity represents the number of milliseconds required for a node to send a message (program to be loaded) down to its child nodes, and to receive an acknowledge of successful execution. The range is 25 to 1000 milliseconds. If the root transputer or any other transputer in the network does not receive a "transputer loaded" acknowledge from all its children during the allotted time, a time-out condition occurs and the loader stops. decode_timeout The decode_timeout quantity is similar to the level_timeout, except that it defines the maximum amount of time required by a single node for getting a message and executing it. An order to clear the memory, for example, is one that requires a nonnegligeable amount of time. The default value is 1000 (or one second), but the allowed range is from 25 to 20000. Most programs running on average size networks will load without problems using the default values. Larger programs, or complex networks may require experimenting with different values. Node Description The second part of the network information file simply describes the physical configuration of the transputer network, including neighbor-to-neighbor links, and the allocation of programs to transputers. The format for each line is the following: Node#, Program, Parent, [Link0], [Link1], [Link2], [link3]; which, in the context of our program first.c, results in: 1, first, R0, 0, , ,; 11 Node# Because first.c is written for one transputer only, our network has only one node, and therefore one line is sufficient to describe it[5]. Its first field, 1, indicates that we are using the root transputer. All transputers are assigned a different Id, with the root transputer always assigned Id 1. Program First indicates that the program to be loaded is first (the .tld extension is implicit). The parent field R0 indicates that the parent of this transputer is Node 0 (the PC host) and that the parent will reset this transputer through the master reset signal. This makes better sense when the network is a tree, for example, with the root transputer at the root of the tree. Then every transputer node has a parent. The parent of the root node is the PC host, which is always given Id 0. The R symbol indicates how the parent resets the current node. The letter R is used when the reset is passed from Parent to Child through the normal reset channel. The letter S indicates that the reset signal is generated by the parent, rather than passed from an ancestor. Finally, the remainder of the line, 0, , ,; defines how Transputer 1 connects to the rest of the network through its links. Here, Link0 is connected to the host, with Id 0. The other links are not connected to any transputer involved in the parallel program, and are left blank. To run the program, the comand <ld-net “first.tld” > reads the file first.nif and loads the executables within the transputer network and loads cio.exe within the host and starts execution. Here is a source file (prime_v1.c , page13) for finding prime numbers that forms a linear chain network topology. This will show transputer-to-transputer communication, examining how programs running on neighboring transputers exchange information. But first we need some link hardware labels (Figure 4, page13). 12 Figure 4: The transputer chain. Notation and convention. The prime_v1 algorithm below works as follows. To find if integer x is prime, the program attempts to divide x by all integers lower than sqrt(x) Testing up to sqrt(x) is sufficient, but requires a square root or a multiplication operation. Our current goal is not to compute primes efficiently, but to explore different ways to parallelize the program with Logical Systems' parallel library. /* ======================================================================= prime_v1.c DESCRIPTION: Computes primes on 2 nodes. Node 2 computes the primes and sends them to Node 1. Node 1 (Root) collects the numbers and prints them. TO COMPILE AND RUN: make -f prime_v1 chainnif -# 2 -1 prime_v1 -nif prime_v1.nif ASSOCIATED NIF-FILE buffer_size 200; 13 host_server level_timeout decode_timeout CIO.EXE; 400; 2000; 1, prime_v1, R0, 0, 2, ,; 2, prime_v1, R1, 1, , ,; ====================================================================== */ #include <stdio.h> #include <stdlib.h> #include <conc.h> /* transputer library */ /* ============================= GLOBALS ============================== */ #define INTSIZE((int) sizeof(int)) #define INTERVAL 100 #define SENTINEL -1 /* --------------------------------------------------------------------*/ /* MAIN */ /* --------------------------------------------------------------------*/ main() { int NoPrimes = 0, x, j; /* ===============================================================*/ /* NODE #1 */ /* ===============================================================*/ /*--- Receive information and prints it ---*/ if (_node_number==1) { /*--- while numbers are coming, get them and print them---*/ do { /*--- get input-channel address ---*/ ChanIn(LINK1IN, &x, INTSIZE); if (x!=SENTINEL) { printf("%8d", x); NoPrimes++; } } while (x!=SENTINEL); printf("\nReceived %d primes \n", NoPrimes); exit(0); } else /* ===============================================================*/ /* NODE #2 */ /* ===============================================================*/ { /*--- scan interval ---*/ for (x = 1; x<INTERVAL; x++) 14 { /*--- if prime then send number to Node 1 ---*/ if (IsPrime(x)) ChanOut(LINK0OUT, &x, INTSIZE); } /*--- signal Node 1 that we are done ---*/ x = SENTINEL; ChanOut(LINK0OUT, &x, INTSIZE); } } /* --------------------------------------------------------------------*/ /* ISPRIME */ /* Semi-efficient prime finding function which, given x, returns 1 if*/ /* it is prime, and 0 otherwise*/ /* --------------------------------------------------------------------*/ int IsPrime(int x) { int i; /* 0 1 2 3 4 5 6 7 8 9 */ static int SmallPrimes[10] = {0,0,1,1,0,1,0,1,0,0}; if (x<10) return SmallPrimes[x]; if (x%2==0) return 0; if (x%3==0) return 0; if (x%5==0) return 0; for (i = 2; i*i<=x; i++) if (x%i==0) return 0; return 1; } The output of the program is shown below. 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 Received 25 primes The first section contains the code that runs on the root transputer (Transputer 1), while the second one contains the code running on Transputer 2. The same program is loaded on both transputers, but a different section of the code is actually used by each. ChanIn() The ChanIn function stands for Channel-Input. It receives information from a channel, here LINK1IN (a constant predefined in conc.h), and stores it at the address specified by the type-cast pointer &x. LINK1IN is the memory address of the I/O Port corresponding to the input side of the link. We saw in Chapter 2 that the I/O ports are memory-mapped. LINK1IN is thus a macro representing the memory address of the 15 port. INTSIZE is a macro that is defined at the beginning of the program as representing the size of an integer. Hence the function of ChanIn is to receive some amount of information from a channel and to store that information at a location defined by a pointer. The size of the information received is defined by an integer representing the number of bytes contained in the message. The prototype for ChanIn is: void ChanIn(Channel *, void *, int) ChanOut() ChanOut stands for Channel Output. It is the companion of the ChanIn function executed by Node 1. Its syntax is exactly the same as that of ChanIn: void ChanOut(Channel *, void *, int) It needs a channel pointer indicating which channel the information is sent over. In our case, the predefined constant LINK0OUT is used. It corresponds to LINK0, and specifies the outgoing direction. This information is then followed by a pointer to the area of memory containing the information to send. Here again a void pointer identifies the memory address of the information to be sent, and the third argument specifies how many bytes constitute the message. Blocking Blocking means that no transfer can take place unless both tasks exchanging data are ready to do so. In the context of our example, this means that if Node 2 is the first one to reach the ChanOut(LINK0OUT,...) statement, then it will not send any information until Node 1 executes the matching ChanIn(LINK1IN,...) statement. Moreover, the task running on Node 2 goes to sleep as soon as it executes the ChanOut statement, and does not awake until the transfer has completed. This was implemented by Inmos to allow other tasks to run on a transputer while one is awaiting a transfer through a channel. Figure 5, page 17 shows timing of a ChanOut (NODE2) operation occurring before its matching ChanIn (NODE1, or ROOT). 16 Figure 4.1: Timing of a ChanOut operation occurring before its matching ChanIn. 17 CONCLUSION: The tansputer is a very flexible and cost effective way of implementing a powerful multicomputer network. It is flexible in that the two wire high speed bi-directional links allow easy configurations of various network topologies such as linear chains, rings, meshes, and hyper cubes. It being available as an ISA of PCI card allows easy interface to a PC compatible system bus. The cost effectiveness stems from the fact that the transputer is a very low cost isolated computer. It’s usage allows mastering multiple computer concepts with only a limited expenditure. 18 Bibliography: COX91 Ronald Cox, Parallel Programs for the Transputer. Englewood Cliffs, NJ: Prentice Hall, 1991. INMO88b Inmos, IMST800 Transputer, Document No. 42 1082 00, March 1988. Thiebaut D.Thiebaut, Parallel Programming in C for the Transputer, http://cs.smith.edu/~thiebaut/transputer/biblio/referenc.html, March 1995. 19