Tech. Rep. 723, Dec. '99 The Implementation of Cashmere½ Robert J. Stets, DeQing Chen, Sandhya Dwarkadas, Nikolaos Hardavellas, Galen C. Hunt, Leonidas Kontothanassis, Grigorios Magklis Srinivasan Parthasarathy, Umit Rencuzogullari, Michael L. Scott www.cs.rochester.edu/research/cashmere cashmere@cs.rochester.edu Department of Computer Science University of Rochester Rochester, NY 14627–0226 This work was supported in part by NSF grants CDA–9401142, CCR–9702466, and CCR–9705594; and an external research grant from Compaq. Abstract Cashmere is a software distributed shared memory (SDSM) system designed for today’s high performance cluster architectures. These clusters typically consist of symmetric multiprocessors (SMPs) connected by a low-latency system area network. Cashmere introduces several novel techniques for delegating intra-node sharing to the hardware coherence mechanism available within the SMPs, and also for leveraging advanced network features such as remote memory access. The efficacy of the Cashmere design has been borne out through head-to-head comparisons with other well-known, mature SDSMs and with Cashmere variants that do not take advantage of the various hardware features. In this paper, we describe the implementation of the Cashmere SDSM. Our discussion is organized around the core components that comprise Cashmere. We discuss both component interactions and lowlevel implementation details. We hope this paper provides researchers with the background needed to modify and extend the Cashmere system. Contents 1 Introduction 2 2 Background 3 3 Protocol Component 3.1 Page Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Protocol Acquires . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Protocol Releases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 5 8 8 4 Services Component 4.1 Page Directory . . 4.2 Twins . . . . . . . 4.3 Write Notices . . . 4.4 Memory Allocation . . . . 9 9 14 14 15 5 Message Component 5.1 Messages and Handlers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Round-Trip Message Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Adding a New Message Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 17 20 21 6 Synchronization Component 6.1 Cluster-wide Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Node-level Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 22 25 7 Miscellaneous Component 26 8 Software Engineering Issues 26 9 Debugging Techniques 26 A Known Bugs 31 B Cache Coherence Protocol Environment B.1 CCP Code Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 CCP Makefile Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3 Cashmere-CCP Build Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 32 32 34 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Introduction Cashmere is a software distributed shared memory (SDSM) system designed for a cluster of symmetric multiprocessors (SMPs) connected by a low-latency system area network. In this paper, we describe the architecture and implementation of the Cashmere prototype. Figure 1 shows the conceptual organization of the system. The root Cashmere component is responsible for program startup and exit. The Protocol component is a policy module that implements the Cashmere shared memory coherence protocol. The Synchronization, Messaging, and Services components provide the mechanisms to support the Protocol component. The Miscellaneous component contains general utility routines that support all components. Cashmere _csm_protocol Protocol _csm_services _csm_message _csm_synch Services Messaging Synchronization _csm_misc Miscellaneous Figure 1: Cashmere organization. The Cashmere prototype has been implemented on a cluster of Compaq AlphaServer SMPs connected by a Compaq Memory Channel network [3, 4]. The prototype is designed to leverage the hardware coherence available within SMPs and also the special network features that may lower communication overhead. In this paper, we will discuss the implementation of two Cashmere versions that are designed to isolate the impact of the special network features. The “Memory Channel” version of Cashmere leverages all of the special network features, while the “Explicit Messages” version relies only on low-latency messaging. As shall become clear, the difference between the two Cashmere versions is largely confined to well-defined operations that access the network. In the next section we will provide a general overview of the Cashmere protocol operation. The following sections will then describe the Protocol, Services, Messaging, Synchronization, and Miscellaneous components in more detail. Additionally, Section 8 discusses a few of the software engineering In this paper, we will refer interchangeably to components by their prose name and their implementation class name. 2 measures followed during development, and Section 9 discusses several Cashmere debugging techniques. This doument also contains two appendices. Appendix A discusses known bugs in the system, and Appendix B details the Cache Coherence Protocol (CCP) application build environment. This build environment provides a high-level macro language that can be mapped to a number of shared memory systems. 2 Background Before studying the details of this paper, the reader is strongly encouraged to read two earlier Cashmere papers [8, 7]. These papers provide good overviews of the Cashmere protocol and provide details of its implementation. A good understanding of the Cashmere protocol and also of the Memory Channel network [3, 4] are essential in understanding the implementation details. In the rest of this Section, we will briefly introduce terms and concepts that the reader must understand before proceeding. Memory Channel Cashmere attempts to leverage the special network features of the Memory Channel network. The Memory Channel employs a memory-mapped, programmed I/O interface, and provides remote write capability, inexpensive broadcast, and totally ordered message delivery. The remote write capability allows a processor to modify memory on a remote node, without requiring assistance from a processor on that node. The inexpensive broadcast mechanism combines with the remote write capability to provide very efficient message propagation. Total ordering guarantees that all nodes will observe broadcast messages in the same order: the order that the messages reach the network. Cashmere The Cashmere SDSM supports a “moderately lazy” version of release consistency. This consistency model requires processors to synchronize in order to see each other’s data modifications. From an implementation point of view, it allows Cashmere to postpone most data propagation until synchronization points. Cashmere provides several synchronization operations, all of which are built from Acquire and Release primitives. The former signals the intention to access a set of shared memory locations, while the latter signals that the accesses are complete. Cashmere requires that an application program contain “enough” synchronization to eliminate all data races. This synchronization must be visible to Cashmere. To manage data sharing, Cashmere splits shared memory into page-sized coherence blocks, and uses the virtual memory subsystem to detect accesses to these blocks. Each page of shared memory has a single distinguished home node and an entry in a global (replicated) page directory. The home node maintains a master copy of the page, while the directory entry maintains information about the page’s sharing state and home node location. Each page may exist in either Invalid, Read, or Read-Write state on a particular node. An access to an Invalid page will result in a page fault that is vectored to the Cashmere library. Cashmere will obtain an up-to-date copy of the page from the home node via a page update request. In this paper and the Cashmere code itself, a page update is also referred to as a “page fetch” operation. 3 If the fault was due to a read access, Cashmere will upgrade the page’s sharing state for the node and its virtual memory permissions to Read, and then return control to the application. In the event of fault due to a write access, Cashmere may move the page into Exclusive mode if the node is the only sharer of the page. Exclusive pages are ignored by the protocol until another sharer for the page emerges. If there is more that one node sharing the page, Cashmere will make a pristine copy, called a twin, of the page and place the page ID into the processor’s dirty list. These two steps will allow modifications to the page to be recovered later. After these steps, the Cashmere handler will upgrade sharing state for the node and VM permissions to Read-Write and return control to the application. During a Release operation, Cashmere will traverse the processor’s dirty list and compare the working copy of each modified page to its twin. This operation will uncover the page’s modified data, which is collectively called a diff. (In the rest of this paper, we refer to the entire operation as a “diff”.) The diff is then sent to the home node for incorporation in the master copy. After the diff, Cashmere will send write notices to all sharers of the page. At each Acquire operation, processors will invalidate all pages named by the accumulated write notices. The twin/diff operations are not necessary on the home node, because processors on the home work directly on the master copy of the page. To reduce twin/diff overhead, some Cashmere variants (and in particular both versions described in this report) migrate the page’s home node to active writers. In the next Section, we will provide a more detailed explanation of the Cashmere Protocol component. The following sections will then discuss the components that support the Protocol. 3 Protocol Component The organization of the Protocol component is pictured in Figure 2. The component consists of a set of support routines and then four aggregated classes. The prot acquire and prot release classes perform the appropriate protocol operations for the Acquire and Release synchronization primitives. The prot fault class implements Cashmere page fault handling. The prot handlers class provides a set of handlers for protocol-related messages. To manage its actions, the Protocol component maintains an area of metadata on each node. Some of this metadata is private to each processor, but most of the metadata is shared by processors within the node. The csm node meta t type in prot internal.h describes a structure containing: ¯ the node’s twins ¯ node-level write notices ¯ node-level page directory ¯ node-level dirty list (also referred to as the No-Longer-Exclusive list) ¯ per-page timestamps indicating the last Update and Flush (Diff) operations and the last time a Write Notice was received The Cashmere code also refers to this as Sole-Write-Sharer mode. 4 _csm_protocol Protocol _prot_acquire _prot_release Acquire Release _prot_fault _prot_handlers Fault handlers Message handlers Figure 2: Protocol component structure. ¯ timestamp of the last Release operation on the node ¯ a Stale vector kept for each page homed on this node, indicating which remote nodes have stale copies of the page The node-level dirty list contains pages that have left Exclusive mode. In addition to the node-level dirty list, each processor also maintains a private dirty list containing pages that have been modified. The two lists are maintained separately in order to allow the common case, modifications to the perprocessor dirty list, to proceed without synchronization. The timestamps are based on a logical clock local to the node. These timestamps are used to determine when two protocol operations can safely be coalesced. The Stale vector is used by the home migration mechanism to ensure that the new home node is up-to-date after migration. In the following sections, we will discuss the main protocol entry points and the corresponding implementation of the Cashmere Protocol component. 3.1 Page Faults All page faults are vectored into the Cashmere csm segv handler handler. This handler first verifies that the data address belongs to the Cashmere shared data range. Control is then passed to either the prot fault::ReadPageFault or prot fault::WritePageFault handler, depending on the type of the faulting access. Both of these handlers begin by acquiring the local node lock for the affected page. This node lock is managed by the Page Directory service and is described in Section 4.1. The lock acquire blocks other local processors from performing protocol operations on the page and is necessary to serialize page update and diff operations on the page. Both handlers then update the sharing set entry and transfer control to the prot fault::FetchPage routine. 5 FetchPage On home node? no FetchHomeNode yes Only sharer and write access? Is Fetch Needed? Yes Yes No Fetch page, make twin if necessary Enter Exclusive mode Write access? PerformFetch Yes Is home node a writer? No No Yes Send migration request, make twin if necessary Make twin No Figure 3: Flow chart of prot fault::FetchPage() and prot fault::FetchHomeNode(). 6 Figure 3 shows the control flow through FetchPage. If this node is the home for the page then the fetch process is relatively simple. In prot fault::FetchHomeNode, the protocol simply checks the current sharing status and enters Exclusive mode if this node is both writing the page and the only sharer. The page will leave Exclusive mode when another node enters the sharing set and sends a page update request to the home node. (As part of the transition out of Exclusive mode, the page will be added to the node-level dirty list of all writers at the home node. This addition will ensure that the writers eventually expose their modifications to the cluster.) If this node is not the home, then the fetch process is more complicated. The protocol must ensure that the node has an up-to-date copy of the page before returning control to the application. It first checks to see if a fetch is needed. An up-to-date copy of the page may already have been obtained by another processor within the node. The protocol simply compares the last Update and last Write Notice timestamps. If the last Write Notice timestamp dominates then a fetch is required and control transfers to the prot fault::PerformFetch routine. This routine will send a page update request to the home node and copy the page update embedded in the reply into the working copy of the page. The PerformFetch routine must apply care when updating the working copy of the page. If there are concurrent local writers to the page, the routine cannot blindly copy the page update to the working copy. This blind copy operation could overwrite concurrent modifications. Instead, the protocol compares the new page data to the twin, which isolates the modifications made by other nodes. These modifications can then by copied into the working copy of the page without risk of overwriting local modifications. This routine will also create a twin if the fault is due to a write access. Returning to Figure 3, if a page fetch is not necessary, then the protocol will check to see if the fault is due to a write access. If so, the protocol will either send a migration request or simply make a twin. The migration request is sent if the home node is not currently writing the page. The home node’s sharing status is visible in the Page Directory. (In the Explicit Messages version of Cashmere, the home node sharing status is visible on remote nodes, however the the status may be stale and can only be considered as hint.) A migration request is sent directly to the home node. A successful request results in a small window of time during which the home node is logically migrating between the nodes. During this window, page update requests cannot be serviced; the update requests must instead be buffered. (As described in Section 5, the Message component provides a utility for buffering messages.) Before sending a migration request, Cashmere will set a flag indicating that a migration request is pending by calling csm protocol::SetPendingMigr. The prot handlers::PageReqHandler handler will buffer all page update requests that arrive while a migration is pending for the requested page. The home node grants a migration request if it is not actively writing the page. The prot handlers::MigrReqHandler handler is triggered by the migration request. If the request is to be granted, the handler will first set the new home PID indication on its node. (This step begins the logical migration, opening the window of vulnerability described above.) Then, the handler will formulate an acknowledgement message with a success indication and also possibly the latest copy of the page. The latest copy of the page is necessary if the Stale vector indicates the new home node does not have an up-to-date page copy. Upon receipt of the acknowledgement message in the MigrReplyHandler, the requestor will update its node home PID indication, if the migration was successful, and also copy any page update to the 7 working copy. If the migration request was not successful, the requestor will create a twin. After completion of FetchPage, control will return to the appropriate page fault routine (ReadPageFault or WritePageFault). The ReadPageFault routine will release the node lock, change the VM permissions to Read-only and then return control to the application. The completion of WritePageFault requires a few more steps. First, the processor will release the associated node lock. Then, the processor will flush the message store in order to process any messages that may have been buffered during the fault handler’s migration attempt. Then the processor will add the page ID to its local dirty list. Finally, the processor can upgrade VM permissions to Read/Write and return control to the application. 3.2 Protocol Acquires Protocol acquires are relatively simple operations. During an acquire, Cashmere will invalidate all pages listed by the accumulated write notices. Before the invalidation can occur however, the write notice structure must be specially processed. As will be described in Section 4.3, write notices are stored in a two-level structure consisting of a global, per-node level and a local, per-processor level. To begin an Acquire, Cashmere will first distribute write notices from the global write notice list to the local write notice lists. (This step is not necessary in the explicit messages version of Cashmere where distribution occurs when the message is received.) Then, the processor will invalidate all pages listed in its local write notice list. The invalidation process is straightforward except in the case where the target page is dirty (i.e. modified) locally. In this case, the protocol will flush the modifications to the home node before invalidating the page. 3.3 Protocol Releases In response to a Release operation, Cashmere will expose all local modifications to the entire cluster. Beginning in prot release::Release, Cashmere will first distribute any pages in the node-level dirty list to the local dirty list. Then, the protocol can traverse the local dirty list and expose all modifications. At the home node, all processors work directly on the master copy. Therefore, all modifications are already in the home node copy, so a processor must only send out write notices to expose the changes. The write notices are sent to all nodes sharing the modified page. At a remote node, the modifications must be sent to the home for incorporation into the master copy. The modifications are sent to the home through a diff operation. If there are local concurrent writers, the protocol will also update the twin to reflect the current state of the working page. This update will eliminate the possibility of sending the same modifications twice. If there are no local concurrent writers, the protocol will release the twin, freeing it for use in some subsequent write fault on another page (see Section 4.2). The modifications can be pushed to the network either by using the Memory Channel’s programmed I/O interface or by copying them into a contiguous local buffer, which is then streamed to the network. The former is faster, while the latter (poorly) emulates a DMA-based interface. This choice is controlled by the CSM DIFF PIO macro in csm internal.h. 8 Cashmere leverages the hardware shared memory inside the SMPs to reduce the number of diff operations. First, if a page has multiple writers within the node and the Release operation is part of a barrier, the diff will be performed only by the last writer to enter the barrier. Second, before performing a diff, Cashmere always compares the last Release and the page’s last Flush timestamps. If the Flush timestamp dominates, another node is either currently flushing or has already flushed the necessary modifications. In this case, Cashmere must wait for the other processor to complete the flush and then it can send the necessary write notices. Cashmere can determine when the flush operations are complete by checking the diff bit in the page’s Directory entry (see Section 4.1). After the diff operation is complete, the processor can send out the necessary write notices and downgrade the page sharing and VM permissions to Read. This downgrade will allow future write accesses to be trapped. Ordering between Diffs and Write Notices Cashmere requires that the diff operation is complete before any write notices can be sent and processed. The Memory Channel version of Cashmere relies on the Memory Channel’s total ordering guarantee. Cashmere can establish ordering between the protocol events simply by sending the diff before sending the write notices. The Explicit Message version of Cashmere does not leverage the network total ordering, and so diffs must be acknowledged before the write notices can be sent. To reduce the acknowledgement latency, Cashmere pipelines diff operations to different processors. The code in Release basically cycles through the dirty list and sends out the diffs. If the diff cannot be sent out because the appropriate message channel is busy, the page is placed back in the dirty list to be re-tried later. The code keeps track of the number of outstanding diff messages and stalls the Release operation until all diff operations are complete. 4 Services Component The Services component provides many of the basic mechanisms to support the Protocol component. Figure 4 shows the structure of the Services component. In this Section, we will discuss the core classes that compose the Services component. The Page Directory component implements the global page directory. The Twins component provides a mechanism that manages and performs Twin operations. The Write Notices component enables processors to send write notices throughout the system, and the Memory Allocation component provides an implementation of malloc. In the following sections, we discuss the Page Directory, Twins, Write Notices and Memory Allocation components. 4.1 Page Directory The Page Directory maintains sharing information for each page of shared memory. The Directory has one entry per page, and is logically split into two levels. The global level maintains sharing information on a per-node basis, while the node level is private to each node and maintains sharing information for the local processors. The global level is implemented to allow concurrent write accesses without requiring locking, while the node level serializes accesses via fast hardware shared memory locks. 9 _csm_services Services _csm_twin _csm_write_notices_t Twins Write notices _csm_pagedir_t _csm_malloc Page directory Memory allocation Figure 4: Services component structure. Shared Node 0 Bits: Description 0-4: Home PID 5: Exclusive status 6-10: Last Holder 11: Home Node Bound Flag 30: Entry Locked Flag 31: Untouched Flag Node 1 Node 2 Bits: Description 0-2: Sharing status Figure 5: Logical structure of the global directory entry. 10 Global Page Directory Entry The logical structure of a global page directory entry is pictured in Figure 5. The entry consists of a Shared word and then a set of per-node words. As described below, the separation between the Shared word and the per-node words is necessary to support home node migration. The Shared word contains the following information: Home PID Processor ID of the home. Exclusive Flag Boolean flag indicating if page is in Exclusive mode. Last Holder The Processor ID that last acquired the page. This item is provided for Carnival [6] (performance modeling) support. Home Node Bound Flag Boolean flag indicating if home node is bound and cannot migrate. Entry Locked Flag Boolean flag indicating if entry is locked. Untouched Flag Boolean flag indicating if page is still untouched. The per-node words only contain the node’s sharing status (Invalid, Read, or Read/Write) and a bit indicating if a Diff operation is in progress. The global directory entry is structured to allow concurrent write accesses. First, the Shared word is only updated by the home node, and the other words are only updated by their owners. The entry can be updated concurrently by this strategy, however, a processor must read each of the per-node words in order to determine the global sharing set. We have evaluated this strategy against an alternative that uses cluster-wide Memory Channel-based locks to serialize directory modifications, and found that our “lock-free” approach provides a performance improvement of up to 8% [8]. The Shared word provides a single location for storing the home PID. A single location is necessary to provide a consistent directory view in the face of home node migration. It also provides a natural location for additional information, such as the Exclusive flag. Node Page Directory Entry The node level of the page directory entry maintains sharing information for processors within the node. The logical structure is very simple: the entry contains the sharing status (Invalid, Read, Read/Write) for each of local processors. Instead of picturing the logical structure of this entry, it is more informative to examine the implementation. Figure 6 shows the type definition. The state field is split into bit fields that indicate the sharing state for each local processor. The other fields are used during protocol operation: pValidTwin A pointer to the twin, if twin is attached. bmPendingMigr NodeLock A bitmask indicating which local processors have a migration operation pending. A local node lock used to provide coarse-grain serialization of protocol activities. A local node lock used only in low-level routines that modify the directory, thus allowing additional concurrency over the NodeLock. DirLock Padding A measure to eliminate false sharing. The padding fields are unused by the protocol, however they can be used to store information during debugging. access functions are included in the Cashmere code. 11 Several // Note: Paddings are calculated with other members of // _csm_pagedir_entry_t considered. typedef struct { // First cache line (see _csm_pagedir_entry_t) csm_64bit_t state; // Page state csm_64bit_t *pValidTwin; // If twin is currently valid, points to twin else NULL csm_32bit_t bmPendingMigr; // per-cid bitmask stating whether the processor // is waiting on a pending migration. This could // be integrated into the state field. csm_32bit_t padding2[5]; // Second cache line csm_lock_t NodeLock; csm_lock_t DirLock; // Locks page inside a node csm_64bit_t padding3[5]; } _csm_node_pagedir_entry_t; Figure 6: Type implementation of the Node Page Directory entry. The csm 32bit t and csm 64bit t types represent 32-bit and 64-bit integers, respectively. The csm lock t is a 64-bit word. 12 Directory Modifications The Page Directory is global and replicated on each node. The method used to propagate modifications depends on the specific Cashmere version. The Memory Channel version of Cashmere uses the Memory Channel’s inexpensive broadcast mechanism to broadcast changes as they occur. As described above, the structure of the directory entry allows concurrent write accesses. The Explicit Messages version of Cashmere, by necessity, avoids any broadcasts. Instead, a master copy of each entry is maintained at the entry’s home node. Remote nodes simply cache copies of the directory entry. This Cashmere version must ensure that the master directory entry on the home node is always kept up-to-date. Also, Cashmere must ensure that up-to-date information is either passed to remote nodes when needed or that the actions of a remote node can tolerate a stale directory entry. The three key pieces of information in the directory are the sharing set, the home PID indication, and the Exclusive flag. In the following paragraphs, we examine how the Explicit Messages version of Cashmere propagates changes to these pieces of information. The global sharing set needs to be updated when a new node enters the sharing set via a Read or Write fault or exits the sharing set via an Invalidation operation. In the former case, the node will also require a page update from the home node. Cashmere simply piggybacks the sharing set update onto the page update request, allowing the home node to maintain the master entry. In the latter case however, there is no existing communication with the home node to leverage. Instead, as part of every Invalidation operation, Cashmere sends an explicit directory update message to the home node. These steps allow both the master copy at the home node and the copy at the affected node to be properly maintained. Unfortunately, with this strategy, other nodes in the system may not be aware of the changes to the sharing set. In the Cashmere protocol, a remote node only needs the sharing set during one operation: the issuing of write notices. Fortuitously, this operation follows a diff operation that communicates with the home node. The home node can simply piggyback the current sharing state on the diff acknowledgement, allowing the processor to use the up-to-date sharing set information to issue write notices. The home PID indication also needs to be updated as the home node is initially assigned (or subsequently migrated). We use a lazy technqiue to update this global setting. A processor with a stale home PID indication will eventually send a request to the incorrect node. This node will forward the request to the home node indicated in its cached directory entry. This request will be repeatedly forwarded until it reaches the home node. The home node always piggybacks the latest sharing set and home PID indication onto its message acknowledgements. The message initiator can then use this information to update its cached entry. The csm packed entry t is provided to pack a full directory entry (home PID, sharing set, bound flag) into a 64-bit value, and this 64-bit value can be attached to the acknowledgements. It is beneficial to pack the entry because Memory Channel packets on our platform are limited to 32 bytes. An unpacked directory entry will result in an increased number of packets and higher operation latency. Management of the Exclusive flag is simplified by the Cashmere design, which only allows the home node to enter Exclusive mode. This design decision was predicated on the base Cashmere version, which allows home node migration. A page in Exclusive mode has only one sharer and that sharer has Read/Write access to the page. By definition then, the home node will have migrated to the single The packet size is limited by our Alpha 21164 microprocessors, which cannot send larger writes to the PCI bus. [3] 13 writer, and so Exclusive mode can only exist on the home node. When the home node is entering Exclusive mode, it can simply update the master entry. The remote nodes do not need to be aware of the transition; if they enter the sharing set, they will send a page update request to the home node, at which point the home can leave Exclusive mode. The remaining pieces of information in the directory entry are the Last Holder value, the Home Node Bound flag, the Entry Locked flag, and the Untouched flag. The Last Holder value is currently not maintained in the explicit messages version of Cashmere. (Carnival, in fact, has not been updated to run with any Cashmere versions.) The Home Node Bound flag is propagated lazily on the message acknowledgements from the home node. The remaining two flags are only accessed during initialization (which is not timed in our executions) and still use the Memory Channel. Future work should change these to use explicit messages. 4.2 Twins The Twins component manages the space reserved for page twins, and also performs basic operations to manage twins and diffs. The Twin component interface exports two sets of functions to handle the management and the basic operations. The exported interface and underlying implementation are found in csm twins.h and svc twins.cpp. The twin space is held in shared memory to allow processors within the node to share the same twins. Upon allocation, the twins are placed into a list (actually a stack) of free twins. The stack is stored inside the twins themselves. The csm twin::m pHead value points to the first twin in the stack, and then the first word of each twin in the stack points to the next twin in the stack. When a processor needs to attach a twin to a page, it simply pops a twin off the top of the free stack. If the stack is empty, the Twin component automatically allocates additional space. When finished using the twin, the processor releases the twin, placing it at the top of the free stack. The two relevant functions in the external interface are AttachTwin and ReleaseTwin. The interface also exports several functions to perform twinning and diffing of pages. The diffing functions include variants that handle incoming diffs (see Section 3). Diffing is performed at a 32-bit granularity, and is limited by the granularity of an atomic LOAD/STORE operation. The most complex part of the Twin component implementation is the management of the twin space. Due to the on-demand allocation of twin space and the sharing of twins between processors, each processor must ensure that it has mapped a particular twin before it can be accessed. Inside the Twin component, all accesses to a twin are preceeded by a AttachTwinSpace call that ensures proper mappings are maintained. Note that this call to AttachTwinSpace is even required when releasing a twin, because the twin space is used to implement the free list. 4.3 Write Notices The Write Notices component provides a mechanism to send and store write notices. The component uses a two-level scheme where write notices are sent from one node to another node, and then later Our diffing implementation is based on an XOR operation, but is still limited to a 32-bit granularity due to conflicting accesses between multiple concurrent writers within the node. 14 distributed to processors within the destination. The Memory Channel version of Cashmere uses a two-level data structure to implement the write notice support. The top level is a global structure that allows nodes to exchange write notices. The structure is defined by csm write notices t in csm wn.h. Each node exports a set of write notice bins into Memory Channel space. Each bin is owned uniquely by a remote node, and only written by that node. Each bin also has two associated indices, a RecvIdx and a SendIdx. These indices mark the next pending entry on the receiver and the next open entry for the sender. The two indices enable the bins to be treated as a wrap-around queue. To send a write notice in the Memory Channel version of Cashmere, a processor first acquires a node lock corresponding to the destination bin, and then uses remote write to deposit the desired write notice in the destination bin. The write notice is written to the entry denoted by the current SendIdx value, and then SendIdx is incremented. Finally, the sender can release the associated node lock. Periodically, the receiver will process each bin in the global write notice list. The receiver begins with the entry denoted by the RecvIdx, reads the write notice, sets the entry to null, and then increments the RecvIdx. The write notice is then distributed to node-level write notice lists (defined in csm wn.h) corresponding to the processors on the node. This node-level list also contains an associated bitmap with one entry per page. Each bitmap entry is asserted when a write notice is in the list and pending for that page. This bitmap allows Cashmere to avoid placing duplicate entries in the list. Both SendIdx and RecvIdx are placed in Memory Channel space, and modifications to these indices are reflected to the network. The sender and receiver always have current notions of these variables and can detect when the bin is full. In this case, the sender will explicitly request the receiver to empty its write notices bin. The explicit request is sent as a message via the RemoteDistributeWN function. The Explicit Messages version of Cashmere uses the messaging subsystem to exchange write notices between nodes, and therefore does not use the global level write notices. The write notices are simply accumulated into a local buffer, and then the caller is required to perform a csm write notices t::Flush. This call passes the local buffer filled with write notices to the Message component, which then passes the write notices to a processor on the destination node. The message handler on the destination simply distributes the write notices directly to the affected node-level write notice lists. 4.4 Memory Allocation The Memory Allocation component provides memory allocation and deallocation routines. This component is only available in the latest version of Cashmere. In our current environment, we perform frequent comparisons between the latest Cashmere and old legacy versions. For this reason, the new Memory Allocation component is not enabled; instead, we use an older malloc routine (found in csm message/msg assist.cpp) that is consistent with the malloc in our legacy versions. The new Malloc component has been tested, however, and is available for use in programs that require frequent allocation and deallocation. The component can be enabled by asserting the CSM USE NEW MALLOC macro in csm internal.h. 15 5 Message Component The Message component implements a point-to-point messaging subsystem. Each processor has a set of bi-directional links that connect it to all other processors in the system. The link endpoints are actually buffers in Memory Channel space that are accessed through remote write. _csm_message Messaging _csm_rtag _msg_store _msg_send_data Reply Tags Message Buffering Data Payload CListPool<_csm_dmsg_data_t> Data Buffer Pool Management Figure 7: Message component structure. The component is an aggregate of a set of core routines and a number of support components (see Figure 7). The core routines perform the message notification, detection, and transfer. The support is comprised of the Reply Tags, the Message Buffering, the Data Payload, and the Data Buffer Pool Management components. The Reply Tags component provides a mechanism to track pending message replies. The Message Buffering component maintains a queue of buffered messages that could not be handled in the current Cashmere state. The Data Payload component contains a set of routines to construct the data payloads of each message, and the Data Buffer Pool Management component maintains a pool of data buffers to be used in assembling message replies. The core routines can be found in msg ops.cpp, and the support classes can be found in msg utl.cpp. The msg assist.cpp file contains routines that serve as helpers for clients accessing the Message component. These helper routines translate incoming parameters into Message component structures and invoke the appropriate messaging interface. The message subsystem is based on a set of pre-defined messages and their associated message handlers. In the following section, we discuss the message types and the structure of the associated handlers. Section 5.2 then discusses the implementation of a round-trip message and Section 5.3 describes the steps necessary to add a new message. 16 5.1 Messages and Handlers The Message component defines a number of message types, each distinguished by a unique Message ID (MID) (see msg internal.h). Message types exist for requesting page updates, performing diff operations, and requesting migration, among other tasks. Messages are categorized into three delivery classes: Sync Synchronous message. The Messaging subsystem will send the message and wait for a reply before returning control to the sender. ASync Asynchronous message. The subsystem will return control to the sender immediately after sending the message. The receiver is expected to send a reply, but the sender must periodically poll for it with csm message::ReceiveMessages(). Very-ASync Very Asynchronous message. This class is the same as ASync, except that the receiver will not reply to the message. All messages have an associated set of handlers, as defined by the csm msg handler t structure in msg internal.h (see Figure 8). typedef csm_64bit_t (*_csm_handler_t)(const _csm_message_t *msg, _csm_message_t *reply, _csm_msg_post_hook_t *pPostHook); typedef int (*_csm_post_handler_t)(csm_64bit_t handlerResult, const _csm_message_t *reply, _csm_msg_post_hook_t *pPostHook); typedef int (*_csm_reply_handler_t)(const _csm_message_t *reply, _csm_msg_reply_hook_t *phkReply); typedef struct { csm_64bit_t mid; _csm_handler_t pre; _csm_post_handler_t post; _csm_reply_handler_t reply; } _csm_msg_handler_t; Figure 8: Type definitions for the structure used to associate handlers with a message. 17 The “Pre” Handler Incoming messages are initially handled by the Pre handler. This handler is responsible for reading the message body, performing the appropriate action, and then formulating a reply if necessary. The handler’s return code instructs the core messaging utility on how to proceed. A Pre handler has the choice of four actions: Reply return the formulated reply to the requestor Store this message cannot be processed currently, so store it in a message buffer for later processing Forward this message cannot be handled at this processor, so forward it to the specified processor No-Action No further action is necessary The “Post” Handler Like the Pre handler, the Post handler is executed at the message destination. The core messaging utility will act on the Pre handler’s return code (for example, send the reply formulated by the handler) and then invoke the Post handler. This handler is responsible for performing any necessary cleanup. The handler returns an integer value, however currently the value is ignored by the core messaging utility. The “Reply” Handler The Reply handler is called at the message source, whenever the reply is ultimately received. This handler is necessary to support the pipelining of messages. Pipelining across messages to multiple processors can be accomplished by using Asynchronous messages. As described above, the Message component returns control to the caller immediately after sending an Asynchronous message. The Reply handler provides a call-back mechanism that the Message component can use to alert the caller that a reply has been returned. After the Message component executes the Reply handler, it can then clear the reply buffer. The Reply handler thus serves the dual purpose of notifying the caller when a reply has arrived and notifying the Message component when a reply buffer can be cleared. The handlers are called directly from the core messaging routines, and so they must adhere to a well known interface. At times however, the handlers must be passed some type of context. For example, a Pre handler may need to pass information to the Post handler, or a Reply handler may need to be invoked with certain contextual information. Cashmere solves this problem by including a hook parameter in each handler call. There are two types of handler hooks. The csm msg reply hook t structure is passed to the Reply handler. A messaging client is responsible for allocating and destroying it. The csm msg post hook t structure is allocated as a temporary variable by the core messaging routines and passed from the Pre to the Post handler. The reply hook structure uses a more concise format since its members are dependent on the type of the hook. The message post hook structure should eventually be updated to use the same type of structure. The two types are shown in Figure 9. Example A page update message obtains the latest copy of a page and copies that data into the local working copy of the page. The Pre handler of this message first ensures that the local processor has read permission for the page, and then formulates a reply with the appropriate page data. The core message utility executes the Pre handler, returns the reply, and then calls the associated Post handler. Based on information stored in the message post hook, the Post handler restores the VM permission for 18 // _csm_msg_post_hook_t // This structure is used to pass information from the pre-hook to // the post-hook. The pre-hook only needs to fill in the information // which its corresponding post-hook will need. (Note that most of // this information is available in the incoming packet, but there // is no guarantee that the packet will be available at post time. typedef struct { csm_64bit_t csm_64bit_t int csm_64bit_t csm_64bit_t csm_64bit_t vaddr; pi; pid; MsgMask; hook; hook2; // // // // // // Virtual address Page index Processor ID Message mask Function-defined Function-defined } _csm_msg_post_hook_t; // // // // // // // _csm_msg_reply_hook_t Reply handlers are called from various places by the messaging subsystem. Clients of the messaging subsystem can pass arguments to the reply handler through this structure. The structure simply contains a type field and a void *. The void * should point to the argument structure, as defined by the client. ‘‘rhk’’ stands for reply hook. const const const const const int int int int int _csm_rhk_null _csm_rhk_req_page _csm_rhk_req_qlock _csm_rhk_malloc _csm_rhk_diff = = = = = 0; 1; 2; 3; 4; typedef struct { int void type; *psArgs; // one of the above _csm_rhk_* consts // Points to a structure determined by the type } _csm_msg_reply_hook_t; // Each type of reply hook has a special rhk type. A simple example: typedef struct { csm_64bit_t csm_64bit_t int csm_64bit_t unsigned long vaddr; tsStart; bWrite; action; start; // Timestamp at start of fetch // Is this a write access? // Action taken by home node } _csm_rhk_req_page_t; Figure 9: Message hook types used to pass context to the handlers. 19 the page, thus allowing the permission operation to overlap the reply handling on the source node. The Reply handler on the source node simply parses the reply and copies the page data to the local working copy. struct _csm_dmsg_header_t { csm_v64bit_t mid; csm_v64bit_t reqno; csm_v64bit_t size; csm_v64int_t iSrcPid; csm_v64bit_t csm_v64bit_t // // // // // bInUse; // options[11]; Message ID Request number Size of message Source of current msg (reqs and replies are separate msgs) True if this link is in-use }; const int _csm_dmsg_data_size = (2 * _csm_page_size) - sizeof(_csm_dmsg_header_t); typedef struct { csm_v64bit_t data[_csm_dmsg_data_size/sizeof(csm_v64bit_t)]; } _csm_dmsg_data_t; struct _csm_dmsg_t { _csm_dmsg_header_t _csm_dmsg_data_t }; // 2 Pages hdr; data; Figure 10: Message buffer structure. 5.2 Round-Trip Message Implementation Message buffers are implemented by the csm dmsg t type illustrated in Figure 10. Each buffer consists of a header and a data section. The header contains the message ID (MID), the message request number, the size of the message, and the processor ID of the message source. The bInUse flag is asserted if the link is currently in use. The options field is used only during initialization. The following text describes the implementation of a round-trip message. To begin, the sender will call a helper function in msg assist.cpp. The helper function will package the intended message data into a Message component structure and call the appropriate SendMessage wrapper function (MessageToProcessor, AsyncMessageToProcessor, VeryAsyncMessageToProcessor). The MessageToProtocol function is a holdover from earlier, protocol processor versions of Cashmere. In these versions, this helper function mapped the desired processor ID to the appropriate protocol processor ID. Without protocol processors, the mapping is an identity function. 20 In SendMessage, the core messaging utility will first create an rtag to mark the pending message reply. Then the messaging utility will copy the message header and data to the request channel’s Transmit region. Different methods are used to copy data (i.e. 64-bit copy, 32-bit copy, or a diff operation) depending on the MID. The different copy routines are contained in the msg send data class. In the final step of sending a message, the messaging utility will assert the recipient’s polling flag by writing to the associated Transmit region. The process of receiving a message begins when an application’s polling instrumentation detects an asserted polling flag. The instrumentation transfers control to the ReceiveMessages function. This function examines all the incoming request and reply channels and begins processing a message (or reply). The first step in processing is to compare the MID against the current message mask. Any MIDs that are currently masked are moved from the message buffer to a message store managed by the msg store class. The message must be removed from the buffer in order to avoid possible deadlock (especially in Cashmere versions that may require forwarding). When the mask is reset, the caller is responsible for explicitly flushing the message store via a call to FlushMsgStore. If the message is not masked then the core messaging utilities will process the message via the appropriate Pre and Post handlers. This step of processing depends on the MID, the message type, and the installed handlers (see csm message::HandleMessage()). Both requests and replies are processed through ReceiveMessages, although requests are handled through HandleMessage and replies are handled through ReceiveReply. A messaging client is responsible for calling RecieveMessages so that the incoming request or reply can be handled or stored. Once the message is removed from the buffer, either because processing is finished or because the message is stored, the channel’s bInUse flag will be reset. The sender can then push another message through the channel. 5.3 Adding a New Message Type There are four steps to adding a new message type: ¯ Add a new MID into msg internal.h. The new MID must not share any asserted bits (aside from the 0x0feed prefix) with existing mids. ¯ Create a helper function that builds the Message component data structures and invokes the appropriate SendMessage wrapper. This function should be placed in msg assist.cpp. ¯ Add the appropriate Pre, Post, and/or Reply handlers to msg assist.cpp. The necessary handlers will be determined by the type of message and the necessary processing. ¯ Register the handlers in the csm daemon register handlers function. The most difficult step is to determine the necessary handlers. Pre handlers are mandatory for all messages. Post handlers are only necessary when some type of “cleanup” is needed after the reply is sent. Reply handlers are needed only when the reply has data to be processed. 21 6 Synchronization Component The csm synch component provides for cluster-wide locks, barriers, and flags, and also for pernode locks and specialized atomic operations. Figure 11 shows the structure of csm synch. The synch flags, synch locks, and synch barrier classes are responsible for the implementation of the cluster-wide synchronization operations. _csm_synch Synchronization _synch_barriers _synch_locks _synch_flags Barriers Locks Flags _synch_segs Manage lock/flag segments Figure 11: Synchronization component structure. Cluster-wide synchronization operations are tied directly to the protocol; specifically, the clusterwide synchronization operations trigger protocol Acquire and Release operations. Lock Acquire and Release operations trigger the associated protocol operation. Barriers are implemented as a protocol Release, followed by global synchronization, and then a protocol Acquire. Waiting on a flag triggers an Acquires; updating a flag triggers a Releases. In the following discussion, we focus only on the implementation of the synchronization mechanisms, beginning with cluster-wide synchronization and ending with the per-node synchronization mechanisms. 6.1 Cluster-wide Synchronization The cluster-wide synchronization operations take advantage of the SMP-based cluster by using a twolevel synchronization process. A processor works at the node level first, using hardware shared memory mechanisms (see Section 6.2); it then completes its synchronization at the global level. The global synchronization operations have been developed in two versions: one version leverages the full remote-write, broadcast, and total ordering features of the Memory Channel; the other version is built with explicit messages only. The implementation of barriers and flags is very similar in the two versions. In the case of locks, however, the implementations are very different. We consider the individual synchronization operations below. We then detail the synch segs class, which manages the memory backing of the lock and flag structures. 22 Barriers Barriers are implemented with a single manager that gathers entrance notifications from each node and then toggles a sense variable when all nodes have arrived at the barrier. class _csm_global_barrier_t { public: csm_v64bit_t csm_v64bit_t csm_v64bit_t }; arrival_flags[csm_max_nid]; episode_number; global_sense; class _csm_barrier_t { public: csm_64bit_t csm_v64bit_t csm_v64bit_t _csm_global_barrier_t }; m_bmArrivalsNode; m_EntranceCtrl; m_ExitCtrl; *m_pGlobalBarrier; Figure 12: Type implementation of Cashmere barriers. The type definitions for a barrier ar shown in Figure 12. Again, the structure follows a two-level implementation. The csm barrier t is held in shared memory, while the csm global barrier t is mapped as a broadcast Memory Channel region. The m bmArrivalsNode and m EntranceCtrl fields in csm barrier t track the processors within the node that have entered the barrier. The former field is a bitmap identifying the particular processors; it is passed into the prot release::Release function so that the code can determine when the last sharer of a page has arrived at the barrier. (During a barrier, only the last sharer needs to perform a diff.) The latter field counts the local processors that have arrived at the barrier. By analogy, the m ExitCtrl field counts the number of processors that have left the barrier. The last processor on a node to arrive at the barrier notifies the manager of its node’s arrival. In csm global barrier t, the manager maintains an array of per-node arrival flags. In the Memory Channel version of Cashmere, these arrival flags are kept in Memory Channel space, so they can be updated with remote write. In the explicit messages version, the flags are updated via an explicit CSM DMSG BARRIER ARRIVAL message. When all nodes have arrived at the barrier, the master toggles the global sense field in csm global barrier t. Again, this field is updated either via remote write or by an explicit CSM DMSG BARRIER SENSE message, depending on the Cashmere version. In the explicit messages version, the master sends individual CSM DMSG BARRIER SENSE messages to the other nodes. Our prototype platform has eight nodes, so this simple broadcast mechanism is reasonable. If the number of nodes wee to increase, a tree-based barrier scheme might provide better performance. 23 Locks The synch locks class provides mutual exclusion lock operations. An application may perform an Acquire, a conditional Acquire, or a Release operation. The Memory Channel-based locks take full advantage of the special network features. In the code, these locks are referred to as ilocks. The locks are represented by an array with one element per node, and are mapped on each node into both Receive and Transmit regions. The regions are mapped for broadcast and loopback. A process begins by acquiring a local node lock. It then announces its intention to acquire an ilock by asserting its node’s element the array via a write to the appropriate location in the Transmit region. The process then waits for the write to loopback to the associated Receive region. When the write appears, the process scans the array. If no other elements are asserted, then the lock is successfully acquired. Otherwise, a collision has occurred. In the case of a normal Acquire, the process will reset its entry, back off, and then try the acquire again. In the case of a conditional Acquire, the operation will fail. The lock can be released simply by resetting the node’s value in the lock array. This implementation depends on the Memory Channel. First, it uses remote writes with broadcast to efficiently propagate the intention to acquire the lock. It also relies on total ordering to that each node sees the broadcast writes in the same order. The explicit messages version of locks does not leverage any of these features. This version is borrowed from TreadMarks [1]. It has been modified only to optimize for the underlying SMP-based platform. The locks are managed by a single processor that maintains a distributed queue of lock requestors. // Queue-lock values for the ‘‘held’’ field const int _csm_ql_free = 0; const int _csm_ql_held = 1; const int _csm_ql_wait = 2; // _csm_qlock_t: // typedef struct { int int volatile int int csm_lock_t csm_64bit_t } _csm_qlock_t; Queue-based locks. Does not rely on bcast or total ordering. tail; next[csm_max_cid]; held[csm_max_cid]; mgr; lNode; padding[2]; Figure 13: Definition of queue-based lock type used in Explicit Messages version of Cashmere. This name was introduced in the first version of Cashmere, where these locks were the internal locks. In the current Cashmere version, a more appropriate name may be mc-locks, since the implementation depends on the Memory Channel. 24 The type definition for queue-based locks is pictured in Figure 13. The tail field is valid on the manager node and points to the last processor in the request queue. The next array has one entry per processor on the node; each entry points to the succeeding PID in the request queue. The held field is a flag indicating the processor’s request status: free, held, or wait. The mgr field indicates the manager PID of the lock, while the lNode field is a node-level lock that allows any processor on the manager node to perform management operations. This optimization is allowed by the SMP-based platform. The steps to acquire a queue-based lock are as follows: 1. Place the requestor’s PID at the tail of the request queue. 2. Register the requestor’s intention to acquire the lock at the site of the old tail. 3. The requestor should wait for a lock handoff response from the old tail processor. If the requestor and lock manager are on different nodes, then the first step is accomplished by sending an explicit CSM DMSG REQ QLOCK message. Upon receipt of this message, the manager will append the requestor’s PID to the request queue by updating the tail field of its csm qlock t structure. If the requestor and lock manager are on the same node however, the requestor can update the tail field using the lNode lock to serialize access to shared memory. The second step can also be accomplished through either an explicit message or shared memory, depending on whether the manager and the tail are on the same node. Through either mechanism, the csm qlock t structure on the tail node should be updated so that the tail’s next field points to the requestor. In the third step, the requestor will simply spin until its held field changes to csm ql held. If the tail is on the same node as the requestor, the tail can set the requestor’s held field directly through shared memory; otherwise the tail will send an explicit CSM DMSG ANS QLOCK to the requestor. The associated message handler will set the held field. Flags The implementation of Cashmere flags is very straightforward. In the Memory Channel version, the flags are simply mapped as broadcast regions. In the explicit messages version, the broadcast is accomplished via explicit messages. In the Memory Channel version of Cashmere, both lock and flag structures are mapped as broadcast regions. And while the size of the structures varies, they both require the same VM support. The synch segs class manages the associated Memory Channel mappings and provides a mechanism to map lock or flag IDs to a physical location in memory. The class was designed to allow for dynamic allocation of lock and flag segments. 6.2 Node-level Synchronization Cashmere also provides a number of functions for synchronizing within a node. The functions provide general mutual exclusion locks and also atomic operations. The functions can be split into three categories: “ll”, “local”, and “safe”. A function’s category is specified in its name. The functions can be found in synch lnode.cpp and synch ll.s. 25 The “ll” functions are low-level functions written in assembly language using the Alpha’s Load Linked/Store Conditional instructions. The Load Linked instruction loads a memory address and sets a guard flag on that address. A Store operation to that address (by any local processor) automatically resets the guard flag. A Store Conditional operation succeeds only if the guard flag is still set. These two instructions can then be used to implement various atomic operations. The “ll” functions attempt to perform their atomic operation once, but may fail because of a concurrent Store. The higher level “local” functions repeatedly call the respective “ll” function until the operation succeeds. The “local” functions also call csm message::ReceiveMessages after a failed “ll” function in order to avoid deadlock. The “safe” functions are associated with locks. A safe acquire will succeed if the caller already owns a lock. These functions are used to help allow re-entrancy in message handlers. 7 Miscellaneous Component The Miscellaneous component contains support routines for optimized data copy operations, exception handlers, and statistics. The component also contains routines to allocate shared memory and Memory Channel space. See the code for more details. 8 Software Engineering Issues The Cashmere code is organized in a directory hierarchy reflecting the internal components. Files and directories in the hierarchy follow a naming convention that identifies their scope in the hierarchy. All files prefixed with a “csm ” are globally visible; other files are prefixed with the name of the component in which they are visible. To help with code readability, Cashmere uses a modified Hungarian Notation convention, which prefixes each variable name with a type abbreviation. Table 1 lists many of the prefixes used in Cashmere code. In addition to naming conventions, it is recommended that a Cashmere installation employ RCS or some other source code control system. At the University of Rochester, we have written a local script (bl csm) that builds a level of code. This script attaches a symbolic name to the current version of each Cashmere source file, thereby capturing a snapshot of the development process. The script also prompts for a set of comments describing the build level and stores these comments in the levels.txt file stored under the root RCS directory. By convention, a programmer builds a level when a new stable code version is reached or when the Cashmere system is benchmarked. 9 Debugging Techniques Cashmere bugs fall into two main categories based on their impact. Bugs in the first category cause Cashmere to fail immediately, either by crashing or by failing an assert check. Bugs in the second 26 Prefix p a i, n b,f m sc , s sql bm dbg ts prx ptx phk rhk idx pid nid cid Type pointer array integer Flag (boolean) value class member static class member sequential list internal to enclosing scope bitmask debug timestamp pointer to a Receive region pointer to a Transmit region message post hook message reply hook index variable processor ID node ID compute processor ID (ID within the node) Table 1: Cashmere modified Hungarian Notation. 27 category are caused by races and may not show on all executions or may not manifest themselves until some time after the bug occurs. The first category of bugs can be tracked by using a debugger. Unfortunately, Cashmere currently is not able to open each application process under its own debugger. There are a few workarounds to help with debugging, however. First, most debuggers can attach to running processes. If Cashmere is invoked with the “-D” flag, then the process will print a message and enter an infinite loop if an assert fails. The user can then attach a debugger to the failed process and check the process state. More simply, the Cashmere application itself can be started inside a debugger. Then process 0 can be controlled by the debugger. These workarounds are admittedly weak, but hopefully they can be helpful in limited cases. The second category of bugs are tracked only by post-mortem analysis of an event log. The event log is usually created by the CSM DEBUGF macro. (This macro is ignored in the optimized executable.) The macro takes a debugging level and additional, printf-style arguments; it prints those additional arguments if the program’s debugging level is greater than or equal to that specified by the initial argument. The debugging level can be set with the “-d:X” command line switch. Cashmere contains many CSM DEBUGF statements; the number has grown as the code has developed. Currently, the statements are too verbose to trace a lengthy program; they are mainly useful in small test cases. CSM DEBUGF tracing is especially ineffective when debugging difficult races. In most cases, the statements significantly perturb execution and eliminate the race. To help debug races, we have developed two test programs and a special debugging log function designed to minimize logging overhead. The test.cpp program is a relatively simply test of Cashmere. The wtest.cpp program is a much more complicated test. Both applications can be found in the CCP hierarchy, which is described in Appendix B. Both operate on a small array, which is partitioned according to the number of processors. The processors operate on all partitions, with the programs verifying data at each computational step. The sharing patterns are relatively easy to follow, but they strongly exercise the Cashmere sharing mechanism. A programmer can view the event logs and track down problems based on the known (and readily visible) sharing patterns. As an alternative to CSM DEBUGF statements, the programmer can rely in the test applications on a special csm dbg print test data function, defined in misc ops.cpp. This function is designed is dump a large amount of useful information in as little time as possible, thereby minimizing the perturbation due to logging. The function prints out the first data value in each processor partition, along with the barrier and lock IDs, the local logical clock value, and information specific to the call site. These calls have been placed in key points throughout the protocol, and can be enabled by asserting CSM DBG TEST in csm internal.h. The macro should be set to “1” in wtest or to “2” in test. To enable the full set of these calls, also assert the DBG macro in prot internal.h. Calls to csm dbg print are especially effective in debugging the wtest application. The main computation phase in this application has each processor visit each array element and assert the bit corresponding to its processor ID. Races are avoided by first obtaining a per-partition lock. This access pattern creates a large amount of very fine-grain sharing. When the application fails, it prints out the incorrect array element. The programmer can then determine which bit is incorrectly set, and walk backwards through the event log to determine the 28 cause of the failure. This technique is at times tedious, but is often the only way to discover and correct a subtle race condition in the implementation. In the rare case when even csm dbg print test data creates too much perturbation, a programmer can use the COutputBuffer class in misc uout.[h,cpp] to keep a log directly in memory. After failure, the programmer can arrange for the log to be dumped to screen or disk. Another useful tool is dis. This tool disassembles an executable, allowing the programmer to associate a program counter value with a line of source code (assuming the executable includes debugging information). Most Cashmere exception handlers print a program counter value so that the exception can be traced to the source code. 29 References [1] C. Amza, A. L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel. TreadMarks: Shared Memory Computing on Networks of Workstations. Computer, 29(2):18– 28, February 1996. [2] A. Eustace and A. Srivastava. ATOM: A Flexible Interface for Building High Performance Program Analysis Tools. In Proceedings of the USENIX 1995 Technical Conference, New Orleans, LA, January 1995. [3] M. Fillo and R. B. Gillett. Architecture and Implementation of Memory Channel 2. Digital Technical Journal, 9(1):27–41, 1997. [4] R. Gillett. Memory Channel: An Optimized Cluster Interconnect. IEEE Micro, 16(2):12–18, February 1996. [5] L. Kontothanassis, G. Hunt, R. Stets, N. Hardavellas, M. Cierniak, S. Parthasarathy, W. Meira, S. Dwarkadas, and M. L. Scott. VM-Based Shared Memory on Low-Latency, Remote-MemoryAccess Networks. In Proceedings of the Twenty-Fourth International Symposium on Computer Architecture, pages 157–169, Denver, CO, June 1997. [6] W. Meira. Understanding Parallel Program Performance Using Cause-Effect Analysis. Ph. D. dissertation, TR 663, August 1997. [7] R. Stets, S. Dwarkadas, L. I. Kontothanassis, U. Rencuzogullari, and M. L. Scott. The Effect of Network Total Order, Broadcast, and Remote-Write Capability on Network-Based Shared Memory Computing. In Proceedings of the Sixth International Symposium on High Performance Computer Architecture, Toulouse, France, January 2000. [8] R. Stets, S. Dwarkadas, N. Hardavellas, G. Hunt, L. Kontothanassis, S. Parthasarathy, and M. Scott. Cashmere-2L: Software Coherent Shared Memory on a Clustered Remote-Write Network. In Proceedings of the Sixteenth ACM Symposium on Operating Systems Principles, St. Malo, France, October 1997. [9] R. Stets. Leveraging Symmetric Multiprocessors and System Area Networks in Software Distributed Shared Memory. Ph. D. dissertation, Computer Science Department, University of Rochester, August 1999. 30 A Known Bugs This Appendix covers known bugs in the current Cashmere system. ¯ Cashmere with first-touch home nodes (i.e. no migration) crashes in barnes, clu, lu, and waterspatial. The version of the code will run, however, when the home node assignments used in an earlier evaluation [7] are specified via the “-l” switch. ¯ The wtest application does not run on 32 processors. ¯ The CC compiler outputs a large number of warnings when compiling the Cashmere library. 31 B Cache Coherence Protocol Environment In our research environment we commonly use several different shared memory protocols, including different versions of Cashmere. The Cache Coherence Protocol (CCP) environment provides a single set of standard code and makefile macros for application development. These macros are mapped at build-time to the target protocol. Currently, the CCP environment supports Cashmere, TreadMarks [1], Broadcast Shared Memory [9], and a plain hardware shared memory system. The environment is designed in a modular manner such that a new protocol can be incorporated simply by creating a makefile macro definition file and by adding appropriate code macros to an existing program. In the following sections, we will discuss the CCP code and makefile macros, including some usage examples. As one specific example, we will describe the Cashmere-CCP build environment. B.1 CCP Code Macros The CCP environment contains a set of macros for both C and Fortran source code. These macros provide access to the basic operations exported by shared memory protocols, such as initialization, memory allocation, and synchronization. To ensure portability, a CCP program must use these macros and avoid any direct invocations of a specific protocol’s interface. The program’s entry point, i.e. main (C) or program (Fortran), must also be named CCP MAIN or PROGRAM, respectively. At build-time, CCP replaces these labels with internal labels that allow the protocol to provide the program’s entry point. The complete set of CCP code macros is contained in the ccp.h header file. This file contains the appropriate macro mappings for each supported protocol. The specific mappings are selected by the CCP NAME macro, which is set on the build command line. Figure 14 contains a small subset of ccp.h. At the top of the file, each supported protocol is given a unique macro setting. The remainder of file is split into sections containing the code macro for the associated protocol. The Figure shows a subset of the macros in the Cashmere section, including the include statements, initialization and lock operations. Sections for the TreadMarks protocol are also partially shown in order to provide a flavor of the file’s organization. As noted above, CCP currently supports four distinct protocols and also several variants of each protocol. B.2 CCP Makefile Macros CCP provides a set of makefile macros that make the build process portable. These macros can be separated into two groups. Build macros allow the programmer to specify the desired system, along with other build parameters. Description macros allow the targets, dependencies, and commands to be specified in a manner that is portable across the various underlying protocols. Build Macros There are five build macros that control the selection of underlying protocols and other build options. Normally, these macros are set in the top of the application’s makefile or on the make command line. Based on the build macro settings, the description macros are automatically set. The five build macros are listed below: 32 /*-------------------------------------------------------* * ccp.h [lines omitted] #ifndef CCP_H #define CCP_H #define CCP_CASHMERE #define CCP_TREADMARKS [lines omitted] 00 01 #if CCP_NAME == CCP_CASHMERE # include "csm.h" # include <stdlib.h> #elif CCP_NAME == CCP_TREADMARKS [lines omitted] #endif #if CCP_NAME == CCP_CASHMERE # # # # define define define define CCP_MAIN ccp_init_memory_size(s) ccp_init_start(argc,argv) ccp_init(size,argc,argv) # define ccp_init_fstart(b,e) # define ccp_init_complete() # define ccp_distribute(pGlobal,size) # define ccp_time # define ccp_lock_acquire # define ccp_lock_release [lines omitted] #endif /* CCP_NAME == CCP_CASHMERE */ app_main_ csm_init_memory_size((s)) csm_init_start(&(argc), &(argv)) { csm_init_memory_size((size)); \ csm_init_start(&(argc), &(argv)); } csm_init_fstart((b), (e)) csm_init_complete() csm_distribute((pGlobal),(size)) csm_time csm_lock_acquire csm_lock_release #if CCP_NAME == CCP_TREADMARKS # # define ccp_init_memory_size(size) define ccp_init_start(argc,argv) Tmk_startup((argc), (argv)) [lines omitted] Figure 14: The basic structure of ccp.h. Many macro definitions have been omitted due to space constraints. The online version has the complete macro definitions. 33 PROT controls the choice of shared memory protocol. LIB VERS specifies the protocol library version. C VERS specifies protocol-dependent processing that must be performed on the application source. CLVL controls the compilation level (optimized or debug). LIB XTRA specifies extra libraries that should be linked. The file ccp build.txt in the online documentation directory describes the supported settings for these macros. Figure 16 shows a simple example that uses the CCP makefile macros, both the build and description macros, to build a Cashmere application. We will discuss this example further below. Description Macros The description macros are used to set up the targets, dependencies, and commands. The macros ensure that the proper support files are included or linked and that the proper tools are invoked with the proper flags. These macros are defined in shared.mk, which is shown in Figure 15. The description macros specify the compiler, linker, tool flags, file locations, and protocolspecific targets. The application makefile should use the macros prefixed with CCP. These macros are set to the protocol-specific settings by a low-level, protocol-specific makefile $(PROT).mk that is included at the top of shared.mk. PROT, of course, is a build macro set directly by the programmer. The protocol-specific makefiles are responsible for exporting all of the macros required by shared.mk. This requirement allows the same makefile to be used for any of the supported CCP protocols. Figure 16 shows an example CCP makefile for the sor (successive over-relaxation) application. The makefile is divided into two main sections. In the first section, the CCP makefile should set the five build macros and then include the shared.mk definition file. (In most versions of make, commandline assignments will override any makefile settings.) As described in the previous paragraph, the build macros will drive the description macro settings defined in shared.mk. In its second section, the CCP makefile defines the targets, dependencies, and commands necessary to build the application. B.3 Cashmere-CCP Build Process The primary tool in the Cashmere-CCP build process is the csm cc script. This script is actually a wrapper of the compiler and the linker, and is meant to be called in place of those two tools. The wrapper script performs a number of pre- and post-processing steps required to create the final Cashmere executable. Figure 17 shows the script’s basic control flow. As mentioned, the script performs either compilation or linking. The initial step in the script is to scan the command line for source files. If source files are present, the script will compile the files according to the steps shown in the left of the flow chart. 34 ############################################ shared.mk [lines omitted due to space concerns] SHARED = /s23/cashmere/ccp SHARED_INC = $(SHARED)/include include $(SHARED_INC)/$(PROT).mk ########################################### Exported macros #Include paths INC = -I$(SHARED_INC) # C/C++ compiler and linker CC = $($(PROT)_CC) CCP_CC = $($(PROT)_CC) LD = $($(PROT)_LD) CCP_LD = $($(PROT)_LD) CCFLAGS = $(INC) $($(PROT)_INC) $($(PROT)_CCFLAGS) LDFLAGS = $($(PROT)_LDFLAGS) # Fortran compiler and flags CCP_FC = $($(PROT)_FC) CCP_FCFLAGS = $($(PROT)_FCFLAGS) # Fortran linker, flags, and extra libraries CCP_FL = $($(PROT)_FL) CCP_FLFLAGS = $($(PROT)_FLFLAGS) CCP_FLLIBS = $($(PROT)_FLLIBS) # CCP include paths, C/C++ flags, linker flags CCP_INC = $($(PROT)_INC) CCP_CCFLAGS = $($(PROT)_CCFLAGS) CCP_LDFLAGS = $($(PROT)_LDFLAGS) # CCP C/C++ optimized, debug, profile settings CCP_CCOPT = $($(PROT)_CCOPT) CCP_CCDBG = $($(PROT)_CCDBG) CCP_CCPRF = $($(PROT)_CCPRF) # Paths for application, CCP, fortran support objects, binaries CCP_APPS_OBJ = $($(PROT)_APPS_OBJ) CCP_APPS_BIN = $($(PROT)_APPS_BIN) CCP_OBJS = $($(PROT)_OBJS) CCP_FC_OBJS = $($(PROT)_FC_OBJS) # Path for library and library specification CCP_LIB_PATH = $($(PROT)_LIB_PATH) CCP_LIB = $($(PROT)_LIB) # Additional protocol-specific targets CCP_TARGETS = $($(PROT)_TARGETS) Figure 15: The shared.mk file sets the macros used in CCP makefiles. 35 # SOR Makefile # # Choose the desired protocol. PROT = CSM_2L # Choose code level: optimized CLVL = OPT # Code version: polling C_VERS = _POLL # Protocol library selection: normal LIB_VERS = # Extra Library Selection: none LIB_XTRA = base: appl # Load in protocol-independent macros include ../../../include/shared.mk TARGET = $(CCP_APPS_BIN)/sor OBJS = $(CCP_APPS_OBJ)/sor.o $(CCP_OBJS) appl: banner $(TARGET) $(CCP_TARGETS) $(TARGET): $(OBJS) \ $(CCP_OBJS) \ $(CCP_LIB_PATH)/lib$(CCP_LIB).a if [ ! -e $(CCP_APPS_BIN) ] ; then mkdir $(CCP_APPS_BIN); fi $(LD) $(OBJS) -o $(TARGET) $(LDFLAGS) $(CCP_APPS_OBJ)/sor.o: sor.c $(SHARED_INC)/ccp.h if [ ! -e $(CCP_APPS_OBJ) ] ; then mkdir $(CCP_APPS_OBJ); fi $(CC) -c $(CCFLAGS) -o $@ sor.c clean: rm -f $(TARGET) $(TARGET)_s *.S $(CCP_APPS_OBJ)/*.o *.s *˜ tags: etags *.c *.h Figure 16: Example makefile for the SOR application. The build macros combine with the shared.mk include file to define the CCP description macros. 36 input files: either source or objects compile or link? compile link C fortran no C compiler yes common blocks? C or fortran? f2ccpf (fortran to ccp) ccp_common (create common block definitions) instrumented fortran assembly creates new assembly file linker dup_writes (add application variables and polling) main file? no instrumented assembly assembler fortran compiler yes ccp_appvars (create application variables) creates new assembly file object file binary executable "Shasta" poll? no yes xatom (add polling) binary executable Figure 17: Control flow for csm cc. Intermediate files are denoted by italicized labels on the transitions. 37 Compilation The csm cc script can compile both C and Fortran files. For a C file, the script will first use the C compiler to create an assembly file that can be instrumented with Cashmere-related code. The dup writes executable adds polling and/or write doubling instrumentation to the assembly file. The executable also sets a number of variables, which are called application variables by our Cashmere system, in the application’s data segment. Cashmere checks the application variables at runtime to determine how the application was built, i.e. the type of polling or write doubling instrumentation. The final step in a C source compilation is to run the instrumented assembly file through the compiler to produce the final object file. The compilation process for Fortran source is slightly different, due to differences in the capabilities of the C and Fortran compilers. Our base C compiler (gcc) allows registers to be reserved, while none of our Fortran compilers have this capability. The polling instrumentation performed by dup writes requires two reserved registers, and therefore this instrumentation cannot be used in our Fortran programs. Instead, our Fortran programs use Shasta-style polling, which walks through the binary executable and performs live register analysis to determine what registers are available at polling sites. This difference in compiler capabilities creates a significant difference in the build processes. A Fortran source file is first run through the f2ccpf script. This script locates the main entry point and changes the entry point into an app main subroutine, which is called directly by the Cashmere library. (The Cashmere library hijacks the program’s main entry point in order to perform special initialization.) Once the application’s main entry point is identified, the csm cc script will also create a simple assembly file containing the application variables. This assembly file is compiled into an object file and ultimately linked into the final binary executable. The final step in the compilation process is then to compile the instrumented Fortran source file into an object file. In addition, the Fortran code requires a special interface file to the Cashmere C code. The fort if.c file in the ccp/src directory provides simple wrapper routines that make the Cashmere library calls available to the Fortran source. In fact, fort if.c is written using CCP calls, so this interface file makes all CCP-supported protocol libraries available to Fortran applications. As described in the previous section, the protocol is chosen through the CCP PROT build macro. Once all the C or Fortran application object files are built, the files can be linked together using the csm cc script. After linking the executable, Shasta-style polling can also be applied. The following text describes the linkage operation. Linkage To begin the linkage process, the csm cc script uses the ccp common script to parse the object files for common blocks. (Common blocks are used extensively in Fortran, but the assembly language declaration for common blocks can be used by any high level language.) The ccp common script creates a small assembly file with definitions for each common block. These definitions ensure that the common block addresses fall within any address range required by the underlying protocol library. In the next step, the csm cc script simply links together the specified object files. This step produces a binary executable file. For certain polling choices, i.e. Shasta polling, a final post-processing step is required to add the polling instrumentation. This step uses the xatom binary modification tool to Write doubling is used by an early Cashmere version [5] to propragate shared data modifications as they occur. The xatom tool is a version of atom [2] that has been modified to allow the insertion of single instructions. 38 place inline polling instrumentation in the binary. As mentioned above, the instrumentation tool uses only dead registers in the polling instrumentation. 39