AMLAPI: Active Messages Over Low-level Application Program Interface CS262B Semester Project, Spring 2001. Simon Yau, smyau@cs.berkeley.edu Introduction: Modern large-scale parallel high performance machines make use of many different libraries for communication. Examples include Active Message (AM) from Berkeley, Virtual Interface Architecture (VIA) from Compaq and Intel, Low-level Application Program Interface (LAPI) from IBM, GM from Myrinet, and others. For application writers, the need to port their communication layer onto different platform can be a hassle. Therefore, communication layer “glue-ware” has been developed that emulates one communication layer using another. Using the emulated communication layers, parallel applications can then run on more machines. Earlier works include: AM over VIA [1], AM over UDP [2], and Myrinet’s VIA over GM project. This project attempts to emulate the AM using IBM’s LAPI on the SP3 in San Diego Supercomputing Center. Active Message and Low-level Application Program Interface Active Message (AM) is a communication protocol developed at Berkeley [3][4]. It is a “RISC-style” communication layer, which aims at providing the minimal functionality that a parallel application would need. The AM model is based on the idea of lightweight Remote Procedure Calls (RPC), in which the processors communicate by sending network messages that would cause a remote processor to execute certain procedures. [3] argues that this minimal set of communication function is enough to support the variety of communication needs for parallel applications. In AM2 [5], the model is extended to include endpoints, which are virtual network interfaces, to support the illusion that each application level thread has their own network interface. Each message is now sent from one endpoint to another, and the handler is executed in the context of an endpoint. This messaging model has been proven useful: many parallel systems, such as MPI, Split-c, Titanium, and others, use AM as their communication layer, or have it as an option. Low-level Application Program Interface (LAPI) is an AM-like messaging layer developed at IBM [6]. The design philosophy is very similar to that of the AM: LAPI aims at providing an extensible and flexible lightweight communication layer for parallel programs. As a result, AM and LAPI have very similar semantics: in both architectures, communication is done through a node sending a message to another node that cause a handler to be handled on another node. However, unlike AM, LAPI does not virtualize network interface for threads; and unlike AM, LAPI handlers execute in the context of a thread that is specially dedicated to execute handlers, whereas AM requires the application thread to periodically poll for AM messages. Such differences proved crucial to the implementation of an AM emulator on LAPI. Motivation: This project is motivated by the desire to run Titanium [7] on the SP3 [8]. Titanium is a high-performance parallel programming language that has java semantics based on the SPMD model. The current implementation of Titanium compiler uses the AM as its communication layer. In order to run Titanium programs on platforms that use other communication layers, either the communication backend need to be written [9][10], or a communication glue-ware need to be written [2] that maps AM onto a communication layer that runs on a machine. The Titanium group has been looking to run Titanium programs on IBM’s SP3 Blue Horizon, which uses LAPI as the communication layer. While previous attempts have been made to write a communication library specifically for Titanium in LAPI [9], since AM and LAPI have such similar specifications, this project opts to emulate AM functionality using LAPI. Implementation The are two main differences between AM and LAPI: 1) AM virtualizes the network interface for threads: each thread communicates with other thread using an interface called endpoints, to create the illusion that each thread has its own network interface. Communication is done between endpoints, instead of between nodes. 2) LAPI handlers execute outside the context of the application thread. LAPI handler is executed by a separate LAPI thread that gets spawned when LAPI is initialized. In AM, each endpoint belongs to a bundle. Messages that are sent to the endpoints in a bundle will only be executed if a thread polls that bundle for incoming active messages. To bridge the gap, AMLAPI must maintain AM’s semantics using LAPI. To implement endpoint in software, each node maintains a vector of endpoints. The vector is protected by a latch, so only one thread in a node can access the vector at any one point. When the application creates an endpoint, it will append the endpoint onto the end of the vector. When an endpoint is removed from the vector, the “hole” will not be filled, so each position in the endpoint vector is occupied by at most one endpoint during the run of a program. The endpoints can thus be globally identified by a <node, vectorposition> tuple. When the application calls AM send, AMLAPI will piggyback the endpoint identifier with the AM message and send it to the LAPI destination node. The LAPI handler on the destination node will unpack the information, and de-multiplex the message to the appropriate endpoint. In order to ensure AM handlers are executed in the context of application threads, a task queue is associated with each endpoint bundle. When the LAPI handler decides the destination endpoint of an AM message, it will append the handler on to the end of the task queue of the endpoint bundle to which the endpoint belongs. When the application polls the endpoint bundle for AM messages, it will extract a handle from the task queue and execute it. Evaluation: (a) Evaluation Platform: The SP3 Blue Horizon is a cluster of symmetric multi-processors (SMP). The nodes are connected to each other via a Colony network switch. The advertised latency of the network is 17ms, and bandwidth is 350MB/s. Each SMP contains eight Power3 processors, with 4 GB of memory per node. Each processor is super-scalar, pipelined, 64 bit RISC processor, with 8 instructions per clock at 375MHz, 64KB L1 cache, and 8MB L2 cache. Each processor communicates with other processors on the same node through shared memory using the pthread interface; and with processors on remote nodes using LAPI. The OS on each node runs the AIX parallel environment. The AMLAPI layer has been implemented on the SP3 platform and the round-trip latency and bandwidth have been measured. (b) Latency: Not surprisingly, the software overhead significantly increased the round-trip latency of sending a one-byte message. The LAPI round trip latency is 32 microseconds, while AM round trip latency is 470 microseconds. There are several factors contributing to the delay: 1) Message size: The message sent using AMLAPI is actually bigger than the LAPI message. In addition to the 1 byte, AMLAPI has to piggyback the endpoint information, the AM handler index, and the AM token on to the LAPI message. 2) Message packing overhead: In order to piggyback the AM information onto the LAPI messages, additional CPU cycles need to be spent for packing the information with the message. 3) Context switch and queuing overhead: The fact that AM handlers can only be executed in the context of application thread means that each AM message on the destination node has to be queued, wait for the context switch to application thread, wait for the application thread to start polling before it can get executed. The next section attempts to quantify the extent of each of these factors when sending long messages. (c) Bandwidth: Since AMLAPI adds a fixed amount of overhead to each AM message being sent, regardless of the size, one would expect the bandwidth of both AMLAPI and LAPI communication layers to converge to the same point in which the network is saturated. However, that is not the case. LAPI & AMLAPI Bandwidth on SP3 Bandwidth (KB/s) 140000 120000 100000 80000 LAPI Bandwidth (KB/s) AM Bandwidth (KB/s) 60000 40000 20000 0 0 100000 200000 300000 Message Size (bytes) Figure 1. In order to find out where the extra time went, we profiled the amount of time spent on emulating AM call with LAPI while sending various-sized messages: Time spent on transmission of a message 2.5 LAPI (communication) Time (ms) 2 Context Switch & Polling 1.5 Packing AM info 1 Copying to Endpoint VM Segment 0.5 0 0 50000 100000 150000 200000 250000 300000 Message size Figure 2. As shown in figure 2, in addition to the same overheads that increase latency (increased message size, context switching and queuing overhead, and message packing overhead), there is also an overhead associated with copying the message into the endpoint’s associated memory segment. AM spec specifies that AM transfer requests will copy a contiguous array of data from one node to a designated virtual memory segment associated with an endpoint. However, the current implementation of AMLAPI piggybacks the AM information with the large chunk of data in one contiguous array, so the LAPI handler at the destination node must unpack the information and copy the information into the endpoint’s virtual memory segment. (It is unknown why the amount of time spent context switching would go up with the message size, since it should stay constant regardless of message size). Figure below is a percentage breakdown of the significant pieces of an AM bulk memory transfer. Percentage breakdown of overhead at 262144 byte message Copying to Endpoint VM 22% Packing AM Info 17% Context switch 10% LAPI 51% Figure 3. As Figure 3 shows that the packing of copying the message to endpoint’s VM fragment takes up the bulk of the overhead. Since SP3 is an SMP, the LAPI threads and application thread run on different nodes. After the LAPI thread unpacks the data, it needs to flush its processor cache in order to move the data from LAPI thread’s processor to the application thread’s processor. This contributes a significant amount of overhead. Future work: a) Possible Optimizations: To decrease the AMLAPI overhead we propose the following implementation strategies: 1) Run AM handlers in LAPI thread context. This violates the AM specification, but if we block the progress of an polling AM thread while executing the LAPI thread, this has the same effect as executing the AM handler in application thread context (the LAPI thread and user thread share the same address space). This will eliminate the context switching and queuing overhead, but is more difficult to code. 2) More efficient piggybacking. LAPI has a two-phased handling protocol. A LAPI message is divided into header and body. The header is delivered first and passed to a header handler. The header handler will allocate the memory needed to hold the body. The body will then be copied into the allocated memory and the completion handler will be called. The current implementation piggybacks all AM information in the footer. However, for small messages, it would be possible to package the whole message into the header and have the header schedule a task on the queue. This will remove the overhead from context switching between LAPI handlers. 3) Since all AM information is piggybacked in the body, it is necessary to copy the bulk data from the LAPI footer into the endpoint’s virtual memory segment. If we switch to packaging AM information in the header, we can eliminate the extra level of copying by having the header handler specify the correct virtual memory segment to copy the bulk data into, and LAPI will copy the bulk data into that location. 4) AM semantics states an AM send call cannot return until the network has accepted the message. But LAPI calls are asynchronous. To guarantee that the network has accepted the message, we must wait until LAPI notify that the header handler has been executed on the remote node before returning. Instead of waiting, we can return immediately after LAPI call; and use a buffer pool to hold the LAPI messages (so the user thread cannot de-allocate them). 5) To eliminate the need to flush the LAPI processor’s cache to move unpacked AM data to the application cache, we can postpone unpacking of AM data to the application thread. b) Higher-level language support: Using AMLAPI, we have run several Titanium programs on SP3. However, the performance of these programs is still constrained by the performance of the LAPI layer. One can use existing Titanium programs to indicate the typical communication workload of a high level scientific application. Table 1 shows the message size breakdown of two Titanium programs, adaptive mesh refinement (AMR) and gas dynamic simulation (GAS) on four processors. Application Small (0-32B) Medium (32-544B) Large (>544B) Average Size GAS 131400 122640 0 31.982 Bytes AMR 31434 2574 0 93.760 Bytes Table 1 Message size break down for two Titanium programs As seen in the figure, most of the message size is under 544 bytes. Therefore, the communication layer should aim at optimizing for short messages. As suggested above, since most messages are small, AMLAPI should package all AM information in the header to avoid context switch overhead. Conclusion: This project shows that communication “glue-ware” can be a viable option to increase the portability of parallel programs, but needs to be done with great care. Although AM and LAPI have similar interfaces, the “fine prints” (such as endpoint APIs, and the requirement that handlers can only be executed in thread context) can pose a performance problem. However, besides portability, there are other advantages to using communication “glueware” for parallel programs. For example, Titanium’s communication library is written using AM; any changes that is made to that library will need to be re-done on other platforms that does not use AM as its communication layer. By emulating AM on the SP3, such changes will be reflected on the SP3 for free. Communication glue-wares represent a maintainability/portability-performance trade off. If the communication layer is emulated with relatively little performance loss, glue-ware would be the correct choice to port a parallel application. Acknowledgements: The author would like to thank Dan Bonachea for the data on Titanium message sizes and the micro-benchmark programs, Tyson Condie for his help on the implementation of AMLAPI, and the Chang Sun Lin Jr. for his help on setting up the SP3 environment. Reference: [1] Andrew Begel, Philip Buonadonna, David Culler, David Gay: An Analysis of VI Architecture Primitives in Support of Parallel and Distributed Communication. [2] Dan Bonachea and Daniel Hettena: AMUDP: Active Messages Over UDP. CS294-8 Semester Project, UC Berkeley, Fall 2000 [3] Thorsen von Eicken, David Culler, Seth Copen Goldstein, Klaus Erick Schauser: Active Messages: a Mechanism for Integrated Communication and Computation. Proceedings of the 19th International Symposium on Computer Architecture, ACM Press, May 1992 [4] Thorsten von Eicken: Active Messages: An Efficient Communication Architecture for Multiprocessors. PhD. Thesis. Dipl. Ing. (Eidgenossische Technishe Hochsule, Zurich) 1987. [5] Alan Mainwaring, David Culler: Active Message Applications Programming Interface and Communication Subsystem Organization. Draft Technical Report. [6] Gautam Shah, Jarek Nieplocha, Jamshed Mirza, Chulho Kim, Robert Harrison, Rama K. Govindaraju, Kevin Gildea, Paul DiNicola, Carl Bender: Performance and Experience with LAPI – a New High-Performance Communication Library for the IBM RS/6000 SP. IPPS’98 [7] Yelick, Semenzato, Pike, Miyamoto, Liblit, Krishnamurthy, Hilfinger, Graham, Gay, Colella, Aiken: Titanium, a High-Performance Java Dialect. Workshop on Java for High-Performance Network Computing, ACM 1998 [8] NPACI Blue Horizon User Guide: http://www.npaci.edu/BlueHorizon [9] Chang-Sun Lin, Jr.: The Performance Limitations of SPMD Programs on Clusters of Multiprocessors. UC Berkeley, Masters Project Report, 2000 [10] Carleton Miyamoto and Chang-Sun Lin, Jr.: Evaluating Titanium SPMD Programs on the Tera MTA. Supercomputing99. [11] David Culler, Richard Karp, David Patterson, Abhijit Sahay, Klaus Schauser, Eunice Santos, Ramesh Subramonian, Thorsten von Eicken: LogP: Towards a Realistic Model of Parallel Computation. ACM SIGPLAN Notices, 28(7):1-12, July 1993.