DRAFT Advanced Simulation Technology Thrust Parallel Processing & Efficient Time Management Techniques Final Report CDRL A0036 December 1999 MDA972-97- C-0017 Prepared for: U.S. Department of Defense Defense Advanced Research Projects Agency Prepared by: Science Applications International Corporation 1100 N. Glebe Road, Suite 1100 Arlington, VA 22201 The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official positions, either express or implied, of the Defense Advanced Research Projects Agency or the U.S. Government 2/13/16 DRAFT TABLE of CONTENTS Introduction & Overview ........................................................................................ 2 Architecture......................................................................................................... 4 Technical Objectives ........................................................................................... 6 Key Results from Research ................................................................................. 7 Overview of Report............................................................................................. 7 Topic 1 – The Investigation of Cluster Computing Hardware ............................... 8 General ................................................................................................................ 8 Findings and Results ........................................................................................... 8 Application ........................................................................................................ 11 Topic 2 – Distributed Access to Models via Remote Controllers ........................ 12 General .............................................................................................................. 12 Findings and Results ......................................................................................... 12 Application ........................................................................................................ 16 Topic 3 – Data Distribution Management Experimentation ................................. 17 General .............................................................................................................. 17 Global Addressing Knowledge (GAK): Functional Description .................. 18 GAK Algorithms ........................................................................................... 19 Static Mapping Algorithms ........................................................................... 19 Dynamic Mapping Algorithms ..................................................................... 20 Performance Metrics ..................................................................................... 21 GAK Natural Measure of Effectiveness (MOE) ........................................... 21 Experimental Hypotheses ............................................................................. 22 Generation of Data Sets .................................................................................... 23 Experimental Scenarios ................................................................................ 24 Findings and Results ......................................................................................... 28 Application ........................................................................................................ 29 Extensions to Future Work ................................................................................... 30 REFERENCES ..................................................................................................... 31 1 DRAFT Processing & Efficient Time Management Techniques Final Report Introduction & Overview The size and complexity of distributed simulation achieved as part of the DAPRA/USACOM Synthetic Theater of War (STOW97) Advanced Concept Technology Demonstration’s participation in Unified Endeavor 98-1 training exercise required many years of R&D and hundreds of millions of dollars. STOW97 was the largest distributed simulation to date. Technology limits were stretched in both network and host platform hardware. Furthermore, the exercise itself was scripted to fit within the capabilities of the network, with no ability for free play and modifications during the exercise itself were severely constrained. Combined, these factors make it quite difficult to insert a distributed simulation of that size and scale into an operational training environment. The problem is further complicated for systems such as the Joint Simulation System (JSIMS) that mandate much greater size and complexity than that achieved by STOW97 [Gajkowski98: STOW97 Lessons Learned]. To fit within JSIMS operational guidelines, a shift to more scalable approaches is necessary. Under the DARPA Advanced Simulation Technology Thrust (ASTT) Efficient Processing project, the utility of cluster based distributed simulation was investigated for large-scale distributed simulation. The particular goal was to assess the applicability of clustered computing techniques to the Joint Simulation System (JSIMS) as it was believed that cluster-based distributed simulation is an approach that should map well to JSIMS requirements of very large scale simulations with less manpower required to create and run an exercise. While modeling techniques expanded beyond what was used for STOW97 are likely to be employed in JSIMS (making a direct comparison difficult), a baseline of 100,000 discrete entities was used to evaluate the scalability of networking technologies to a system of JSIMS’ size and scale. Specifically, this project’s research focused on the situation in which the bulk of a simulation is executed within a resource-efficient central cluster environment. The environment consists of low cost workstations and a high speed, low latency Local Area Network (LAN) using off the shelf hardware. Distributed user interaction occurs through standard low bandwidth point-to-point communication links. Latency effects are minimized in much the same manner as a fully distributed simulation. Overall, avoidance of the customized high-bandwidth networks and hardware required to support previous large-scale distributed simulations should result in a system with a substantially lower cost to field and operate thereby meeting a primary requirement for JSIMS. Lower fielding and operational costs are based on an analysis of STOW97. Also considered for analytical purposes were the partially distributed training exercises conducted via the Army Joint Precision Strike Demonstration (JPSD) simulation system. As compared to the large-scale scenarios postulated for JSIMS, STOW97 simulated two 2 DRAFT to six thousand entities at the platform level. JPSD exercises range in size from six thousand to over twenty thousand via both platform level models and aggregate level models. From an operational standpoint, a tremendous level of effort was required to bring up the Wide Area Network (WAN) used for STOW97. Specifically, bandwidth and latency requirements as well as multicast usage required special hardware and operating system modifications in addition to detailed tailoring of the exercise to the infrastructure. Extrapolating from simulation sizes reached via the STOW97 fully distributed infrastructure to JSIMS sized exercises shows a lack of scalability in the network infrastructure and a level of effort to design and execute an exercise that is contradictory to the JSIMS requirement for lower manpower costs. As depicted in Figure 1, a cluster-based simulation maintains the distributed nature of users, but replaces the complex WAN with simple, low bandwidth links from the centralized models to their distributed users. Cluster-based simulation obtains its significant reduction in WAN cost and complexity from two areas: less traffic and replacing the heavy use of multicast with standard point-to-point links. Network traffic analysis shows that model to model communication (e.g., changes to simulation state, entity movements, fire events, etc.) compose the vast majority of network traffic in a Figure 1: Cluster-based Distributed Simulation distributed simulation. By executing models within the resource-efficient cluster, such traffic is eliminated from the WAN. The relatively lower network traffic generated by model to user and user to model interactions are easily supported via standard networking technology: no multicast support in the WAN is required. Further, interactions with users are the least sensitive to latency in terms of the overall network traffic set. Similar or superior latency behaviors can be expected from cluster-based simulation over fully distributed simulation. Simulations larger than STOW97 have been conducted via a partially distributed simulation system at the JPSD facility. The JPSD approach of using clusters of workstations connected via a high speed, low latency LAN with a hierarchical gateway architecture has considerable potential for the JSIMS problem as well. This research 3 DRAFT extends the JPSD approach [Powell96: JPSD Simulation Architecture] into a fully clustered architecture. ARCHITECTURE While scaling distributed simulation requires advancements in both infrastructure and modeling techniques, this program focused strictly on network-level infrastructure. Using abstractions as defined within the early JSIMS architectures and the RTI 1.3NG internal architecture, all modeling characteristics and data distribution management characteristics were abstracted into simple tagged packets requiring delivery at the network layer (shown in Figure 2). Local Models Data Transmission Optimizations remote shared state updates local shared state updates tagged local state changes Inter-Host Data Manager tags required current tag to channel map DD: Assigner tags produced DD: GAK tagged remote state changes tags required current tag to channel map global tag data packet DD: Obtainer packet Inter GAK API Channel Out Channel In packets packets packets Communication Fabric Figure 2: Experimental Architecture Components Hardware characteristics of various cluster options were explored, and dynamic routing schemes using point to multi-point network protocols were simulated. The architecture consists of: Entities. Entities are the modeling components of the simulation. Entities have both public and private state, and (in this abstract architecture) communicate with 4 DRAFT each other via changes to their public states. The union of all entities’ public state is referred to as the simulation shared state. Entities do not ‘pull’ shared state from other entities. Interest expressions are used by each entity to describe what types (and values) of shared state is currently required. The simulation infrastructure is responsible for ‘pushing’ all shared state matching an entity’s interest set to that entity. Entities describe their public state to the infrastructure in the same syntax as interest expressions, and notify the infrastructure when the state is modified. The infrastructure uses this information in ‘pushing’ state updates to the relevant consuming entities. A push-based approach is considered central to achieving scalability. Data Transmission Optimizations. A (non-infrastructure) component of the simulation is responsible for minimizing the number of required inter-host data transmissions caused by changes to the simulation’s shared state. A broad set of techniques to accomplish this task have been identified by the community and are grouped in this architecture under the heading of Data Transmission Optimizations (DTOs). DTOs range from load balancing (where entities which frequently communicate are allocated to the same host) to variable resolution data and predictive contracts. Key to successful DTOs is the notion of ‘sufficiently accurate representation of the global shared state,’ where slightly imprecise views of shared state are much cheaper to maintain but do not invalidate the models using the shared state. DTOs are not modeled under this program, but their effects are estimated by a simple reduction in the number of required inter-host data transmissions. Inter-Host Data Manager. The Inter-Host Data Manager is responsible for bringing data required by local models to the local host. It uses interest statements from its local clients to determine what shared state is required locally. These entity-level interest statements are translated by the Data Manager into some form of network tags that are abstracted representations of the interest statements. Tags are expected to be efficient to evaluate and require no knowledge of the application's data. During the abstraction process, tags are further expected to be low enough resolution such that tag subscriptions change infrequently, easing the load on Data Distribution [Mellon96. Hierarchical Filtering in the STOW System]. These tags are given to the GAK as descriptions of what data is required at this host. Using the same abstractions as in the translation of interest statements to network tags, the Data Manager also tags the local state changes by its client entities. Tagged state changes are then sent to the Data Distributor for assignment to a transmission channel. The Data Manager component is modeled, not implemented, in ASTT experiments. It is assumed that abstract tags are created by some exterior mechanism, such as routing spaces or predicates. Data Distribution: Global Addressing Knowledge (GAK). The GAK is responsible for an efficient mapping of tagged data to the available set of network channels. Static mappings may be used, or mappings may vary based on feedback from current network conditions. A range of mapping schemes may be 5 DRAFT found in GAK Algorithms under Test, below. Network factors that must be considered include raw costs of a transmission; number of channels effectively supportable by the hardware; cost of joining or leaving a channel; unnecessary receptions; and unnecessary transmissions. DD: Obtainer. Using the mapping provided by the GAK, the Obtainer simply attaches to the receiving end of all channels that may contain a tagged state update required at this host. Note that due to multiple tags being assigned to a single channel, state updates may arrive that are not required locally. Such updates are considered ‘false hits’ and are not handed up to the application. DD: Addresser. Using the mapping provided by the GAK, the Addresser simply places a tagged state update on the correct channel for transmission to other hosts. Channels are the communication abstraction for distributed hosts. Channels may have 1...N producers and 1...N consumers. Channels may restrict the number of producers and/or consumers to best match a given hardware system. Consequently, the GAK mechanism must account for this restriction in its use of a channel to interconnect hosts. Channels may bundle data packets for extra efficiency. Channels present a multiple recipient API to clients, that is then implemented in whatever manner the hardware efficiently supports. Due to hardware restrictions, there may also be a limit on the number of available channels. These details may be factored into a GAK algorithm through parameterization, and the algorithm will work within the limitations. Communication Fabric: The Communication Fabric is the underlying connection hardware between hosts in a distributed system. It may be shared memory, point to point messaging, etc. It may or may not support point to multipoint or IP multicast. The fabric is used by Channels, which implement their multiple recipient API in terms of the fabric's best possible delivery mechanism. Note that the GAK and Channel components were the primary focus areas for the research. A more detailed view of the architecture may be found in [ASTT-PP Final IPR Briefing] and [Mellon98, Clustered Computing in Distributed Simulation]. TECHNICAL OBJECTIVES The goal of this program was to see if distributed simulation could be supported via a clustered computing approach. The technical objectives established from this were to: (1) determine the relative WAN requirements for a clustered simulation versus a fully distributed simulation; (2) investigate the feasibility of masking the latency from a remote user to the clustered models; and (3) make efficient use of the LAN resources within the cluster (specifically, multicast groups and similar mechanisms). An experimental framework was implemented to investigate critical questions. The fundamental hypothesis, so called cluster hypothesis, of the research was: 6 DRAFT Clustered model execution will increase system scalability by greatly reducing the volume and complexity of WAN connectivity within a distributed simulation. No significant change will occur to the quality of user interaction or simulation validity. From this, a set of derived hypotheses were established: User to Model interactions may tolerate higher network latencies than Model to Model interactions. Grouping models within a cluster significantly reduces WAN network traffic, and Accurate data distribution is required to support scalability. KEY RESULTS FROM RESEARCH A number of important results were established as part of this effort. In particular, it was shown that 1. A distributed simulation system may be developed that supports distributed access by remote users while executing attached models in an efficient cluster computer environment. 2. A clustered approach greatly reduces the WAN Quality of Service (QoS) required. This follows since there is no large-scale WAN multicast usage, much lower WAN bandwidth, and simpler WAN latency requirements. 3. That the low latency between simulation hosts in a clustered environment allows significantly improved accuracy of data distribution via dynamic mapping of simulation data to a limited system resource (multicast groups within the cluster) over what is possible with the high latencies encountered in a fully distributed simulation. 4. Distributed users can be coupled to centralized models with the same level of effective latency provided by fully distributed simulation. OVERVIEW OF REPORT The report is organized into three main topics: cluster computer hardware research, remote controller algorithms, and DDM experimentation. These topics address the key functional areas required to demonstrate the viability of cluster-based simulation. For each topic, a technical overview and summary is presented; key results are summarized and a discussion of applicability of the techniques and methods. In addition, recommendations for future useful investigations are provided. 7 DRAFT Topic 1 – The Investigation of Cluster Computing Hardware GENERAL Supporting a large simulation-based training exercise requires access to a large base of computational power. Centralizing the bulk of the computationally intense models within a resource-efficient cluster-computing environment is one approach. A variety of new hardware and improved performance of existing hardware has been generated in the industry to increase the efficiency of clustered computing, albeit for a different target application. This topic dealt with the analysis of such cluster-oriented networking in terms of the network loading conditions and requirements imposed by military simulation. Hardware approaches analyzed included SCI (Scalable Coherent Interface: a shared memory API), SCRAMNet (also shared memory), Myrinet (point-topoint API), ATM (point to multi-point API) and Ethernet (multicast API, used as point to multi-point and multi-point to multi-point). Two particular concerns were the latency between simulation hosts (very high in a WAN-distributed simulation) and the point to multi-point nature of simulation state traffic. Specifically, could a cluster-based simulation provide significantly lower interhost latency, and how well would the atypical network traffic of a military simulation map to off-the-shelf industry networking hardware? Other concerns investigated were effective bandwidth available, hardware cost, and operational stability. FINDINGS AND RESULTS Across all platforms, sufficient ‘wire’ bandwidth was found to exist [ASTT-PP Briefing, Clustering Options, 1998]. As was also shown in STOW97 and earlier simulation systems, the bottleneck in distributed simulation was found to be the speed of a host in accessing the wire, not the amount of raw bandwidth across a network. Specifically, the speed at which a host could read and process a packet resulted in an upper limit of packets per second that could be read off of the network. This is due to the nature of most simulation traffic: small packets, sent very frequently and (usually) to more than one recipient. With such traffic patterns and the high cost of processing a packet, the upper limit of network hardware bandwidth is extremely difficult to reach. Minimizing the number of network accesses by a host is a much more significant goal than maximizing the physical hardware bandwidth. Latency measures across all platforms were found to be acceptable, especially when compared to the reference system (STOW97). STOW latencies were reported as an average of 60,000 microseconds between hosts (across STOW’s WAN and LAN based network). Inter-host latencies across various cluster network options were found to range from approximately 1 to 100 microseconds, dependent on packet size and hardware type. Table 1 shows the latency for a packet of minimal size across the cluster hardware options. Specialized cluster hardware such as SCI and Myrinet provided the best latency, 8 DRAFT while optimized device drivers to generic hardware such as Ethernet and ATM provided sufficiently low latency. Name ATM SCRAMNET (Systran) Myrinet (Myricom) U-Net High Performance Parallel Interface (HIPPI) Scalable Coherent Interface (SCI) Type Fast switched network Distributed Shared Memory Fast switched network User-space device drivers Fast switched network Latency 20 us 250 to 800 ns 7.2 us 29 us Bandwidth 155 Mbit/s 16.7 MB/s 160 ns 800 Mbit/s Distributed Shared Memory 2.7 us 1 Gbit/s 1.28 Gbit/s 90 Mbit/s Table 1 Sample clustering techniques Significant drawbacks were found in the specialized (i.e., extremely low latency) cluster hardware in the areas of cost, operational stability and point to multi-point support. All such network hardware is implemented in terms of point to point traffic (SCI, Myrinet) or broadcast (SCI, SCRAMNet). Given the point to multi-point nature of most simulation network traffic, multiple packets must be generated by either the host or the network card, each at a cost. This is a significant factor, especially for larger clusters, where correspondingly more packets must be generated. For example, packet_A is generated by host_A, for hosts B,C,D,E,F. Using a network capable of point to multipoint traffic, one host-wire access is required to generate the packet, and five to receive it. Under point to point, five host-wire accesses are required to generate the packet, and five to receive it. Given the identification of host-wire access as the primary bottleneck, this is clearly a significant drawback to such hardware. Further, such hardware is generally quite high in cost (compared to standard Ethernet), and most have not achieved the levels of operational stability of more standard networking gear such as Ethernet or ATM. Given the differing focus of industry cluster computing (point to point traffic supporting clustered database servers), it is unlikely that the proposed System Area Network standard will support simulation traffic and its heavy reliance on point to multi-point protocols. This limits the usefulness of such technology in cluster-based simulation. For the purposes of fielding a cluster-based simulation in an operational environment, the best current choice for performance and cost is Fast (switched, 100 Megabit) Ethernet, using the commercially available hubs (‘Edge Devices’) enhanced under the STOW program, combined with low latency device drivers created by the UNet program at each host. From a cost standpoint, this is a clear win. The fielding cost per computer is extremely low (under one hundred dollars), and a considerable legacy of tools and expertise are available to keep Ethernet operational costs low and stability high. Cost to field other cluster network options range from hundreds to thousands of dollars per computer. Operational costs for non-Ethernet hardware options are also projected to be higher, as the stability of such options is lower than Fast Ethernet (with the exception 9 DRAFT of the very stable – and most expensive – SCRAMnet system). Less expertise is also available to tune the network: a key factor as simulation traffic differs significantly in loading characteristics from the industry norm. From a performance standpoint, Fast (switched) Ethernet hubs are either superior or essentially equal to other cluster options. The driving factor here is the hardwaresupported access to thousands of multicast groups. First, multicast availability allows single send, multiple recipient traffic. As noted above, this increases scalability in the simulation hosts and – as a secondary feature – keeps bandwidth usage down. Second, the switched nature of the hubs prevents traffic from flowing to a host’s Network Interface Card (NIC) unless it is specifically addressed to that host. This is a significant performance factor, as it includes addressing via multicast groups: i.e., a host must specifically join a multicast group before the data is sent. This avoids a serious problem encountered in earlier simulation systems, where the NICs were forced into a performance-limiting promiscuous mode to deal with thousands of multicast groups. Third, thousands of multicast groups allow a finer division of simulation state data across groups. This allows hosts better control over the type – and thus volume – of data being received (the channel-bundling problem). Finally, the latency of the above described system is sufficient to meet the needs of a cluster-based simulation. Referring again to Table 1, we see that U-Net device drivers – optimized by means of mapping device control functions into user space – provide latency similar to other options, especially when compared to the STOW97 reference system (60,000 microsecond latency). While the best future option for clustering is likely to be ATM1 (based on its low latency, high bandwidth and point to multi-point support), the current best option for an operational system is Fast Ethernet with U-Net optimized device drivers. The reduction to a 29 microsecond inter-host latency is sufficient to allow accurate, dynamic data distribution management. No significant difference is projected from the lower inter-host latency (e.g., 7 to 10 microseconds) possible via higher performance clustering cards such as Myrinet, especially given the offsetting loss of access to the efficient multicast protocol. Further, the CPU cost to the host of accessing the wire via the specialized U-net drivers is also much lower than the standard Ethernet drivers used in STOW97. This frees additional cycles for the models and increases the efficiency of the system. 1. 2. 3. 4. In summary, under the cluster hardware analysis efforts we found that: High bandwidth, low latency can be achieved within a cluster (4-100 usec intermodel latency; 4-100 usec CPU cost; and 150 Mb to 1 Gb bandwidth). Cluster bandwidth easily supports model execution. For STOW97, much higher networking costs precluded a scalable system (latency ~60,000 usec across the WAN and CPU cost 200 - 400 usec/packet (Sparc)). There exists device driver and performance instability in extremely low latency LANs and that the new technology’s primary uses differ significantly from the characteristics of distributed simulation. 1 An ATM switch is used as the basis for the JPSD cluster, with multicast groups to segment data between simulation hosts. 10 DRAFT 5. There is a wide range of APIs (shared memory, point to point, etc.) that leads to degradation of performance for the simulation use case and software maintainability issues. 6. Add-on cards require specialized conditions and knowledge and resultantly added cost and added operational complexity. 7. There is limited to no support for multicast in the lowest latency cluster hardware options: a serious drawback due to the higher cost of sending the same message to many hosts (a typical case in simulation). This is not a focus area for industry and there is no near-term improvement projected. 8. There exist techniques for improved IP access. Specifically, there are results that show one can achieve ~20 usec latency via memory-mapped device drivers on standard Ethernet. Beowulf clusters primarily using Linux with an IP-style backbone have shown excellent results (switched 100Mbs Enet, FDDI, etc.) Moreover, there is lower hardware cost, and it is easier to use. 9. WAN bandwidth & QoS requirements for large scale traffic (100,000 entities) can be much lower via clustering 10. Ethernet performance is arguably comparable to specialized cluster hardware for the simulation use case. New types of (optimized) device drivers allow inter-host latencies in the 20 to 30 usec range for Ethernet within a cluster while maintaining support for multicast (heavily used in simulation traffic). Extremely-low latency networks studied (SCI, Myrinet, etc.) bring (effective) latencies down to the 5 to 15 usec range, but at the cost of losing access to multicast. 11. Extremely-low latency networks are still a little fragile and esoteric and currently they are not suitable for operational fielding 12. Multicast support is key to a scalable network, and the extremely low latency networks do not have it. 13. Given the better operational support, critical multicast support and essentially equivalent performance, fast (switched) Ethernet with optimized device drivers is currently the best choice for a cluster-based simulation. APPLICATION Cluster hardware is continuing to advance as areas of industry begin the standardization process. Unfortunately, the current direction is to support only point-topoint traffic. Expanding the industry view to include point to multi-point would be a significant advantage to the simulation community, however – as has been found in the past with large scale multicast applications – the simulation market is probably not large enough to drive industry standards. Expanding the use of optimized ATM as a LAN has significant advantages to the simulation community. Its use in the mainstream and point to multi-point support are strong advantages. Because of the way in which ATM may be used within simulation (point to multipoint links, source based trees), it is likely that the existing industry-driven advances to ATM functionality and performance will be in line with the needs of the simulation community. Stability and price are the current drawbacks. Work should be continued to monitor the stability and performance of this option as a high-speed LAN. Cost is expected to drop if ATM continues its advancements in the commercial marketplace. 11 DRAFT Topic 2 – Distributed Access to Models via Remote Controllers GENERAL While executing models in a centralized computing environment holds great promise for increasing the efficiency of a simulation, the training systems under consideration have a distributed execution requirement: it is impractical to bring trainers, trainees and response cells all to a single point to run an exercise. This leads to the derived requirement of remote access to the centralized models and the virtual world they are populating. This facet of the program dealt with the concept of providing the same effective levels of fidelity, latency and response times with cluster-based simulation that current fully distributed simulation techniques provide. The concept of a remote controller agent was introduced. This agent resides within the cluster environment, consuming the subset of virtual world data that its remote controller requires. That data is transmitted via standard WAN point-to-point links to a remote controller site, where the WAN latency is then masked with predictive contracts. An extension of the dead reckoning concept from DIS, predictive contracts reduce the network bandwidth required to support a given level of accuracy for distributed data, while masking latency by means of abstracted models of the data’s behavior over time. The WAN traffic patterns generated by remote controllers are expected to differ significantly from those of fully distributed simulations. Given that a cluster-based simulation restricts the bulk of simulation network traffic to be within the cluster, WAN traffic is primarily restricted to interactions between remote controllers and their agents resident in the cluster. Findings (below) indicate that WAN requirements for a large exercise are in general substantially lower for cluster-based simulation than fully distributed simulation. Further, the complex WAN multicast schemes required to support large exercises such as STOW97 are eliminated entirely. Obviously remote controller agents are highly client-specific. Given the set of JSIMS client applications and their data requirements was still undergoing definition at the time of this study, STOW97 applications were used as the reference point. STOW97 traffic patterns were extrapolated out to JSIMS-size exercises to provide a rough order of magnitude WAN traffic estimate. FINDINGS AND RESULTS Using STOW technology as the baseline, approximately 40,000 WAN multicast groups and 75 Mbs of bandwidth would be required to support a JSIMS-sized exercise. This number was extrapolated from both STOW experimental and analytical results for exercises in the 6,000 to 50,000 entity range. 100,000 entities was used as a baseline for JSIMS. This level of WAN technology is far in excess of what current networks can supply. In particular, no known attempts are underway to increase the number of supportable multicast groups beyond the 3,000 achieved in STOW. To support a 12 DRAFT similarly sized exercise via cluster-based simulation, approximately 0.1 Mbs of WAN bandwidth and no WAN multicast groups would be required at all2. These levels are well within commercial networking technology. The much lower volume and complexity of WAN traffic projected for clusterbased simulation are attributable to the different allocation of distributed components to physical infrastructure. As differing links between components are now carried across the WAN, the resultant network traffic is completely different in nature and volume. One result of this differing allocation is the use of point-to-point communication links, replacing the heavy use of multicast links in fully distributed simulation. In addition, only the low frequency communications required to update user’s displays and capture their inputs are transmitted via the WAN. These points are elaborated below. In fully distributed simulation, multicast groups are used to couple producers of simulation state with the appropriate consumers. The simulation state must be divided into sufficiently small pieces to prevent any given simulation host from being flooded with irrelevant data (the channel bundling problem). This division requires a large number of multicast groups, and the number of groups required scales upwards with the size of the exercise. Producing a static map of simulation state to multicast groups is a time-consuming task that greatly restricted STOW97 exercise designers. Further, the high latency between hosts precluded the use of dynamic mappings which would have allowed a more efficient use of the available multicast groups and eased the burden on exercise designers. Another complicating factor is that multiple recipients are the norm for multicast groups, and the set of recipients changes dynamically. The opposite of all these factors is the case for the WAN component of cluster-based simulation. WAN network links are static: data is transported from the cluster to a known set of remote controllers. Further, the multiple recipient problem in the WAN cloud does not exist. Controllers are linked to their agents via standard point to point communications: multicast is not used at all in the WAN. This greatly simplifies the WAN, lowers the cost of required equipment and bandwidth, and allows for easier exercise design by eliminating the need to tailor the exercise to multicast group availability. The levels of WAN network traffic generated by cluster-based simulation are substantially lower than the levels required to sustain an equivalently sized exercise with fully distributed technology. Analysis of fully distributed simulations such as STOW97 at the component level (not simply a network traffic level) shows that the majority of traffic is communication between Computer-Generated Force components (model to model). A much lower level of traffic is required to update the controller’s display (model to user). And an even lower level of traffic is generated from user input (user to model). A different allocation of functional components to physical infrastructure is at the heart of a cluster-based simulation. 2 Although hundreds to thousands of multicast groups could be used within a cluster, no WAN multicast groups are required all. See also, DDM within a cluster. 13 DRAFT Cluster-based simulation addresses a fundamental flaw in fully distributed simulation: that of co-locating the models with the distributed users. Co-locating the models in a resource-efficient cluster and linking users in via a remote controller mechanism results in lower WAN traffic with similar latency results to the end user as is possible with fully distributed simulation. In fully distributed simulation, each distributed user station consists of a display, an input device, a display controller, a set of local entities being modeled, and a view of the simulation’s shared state (the ‘ground truth’ data representing both local and remote entities). Predictive contracts are used to mask latency of and lower WAN transmissions required for shared state representing entities at remote sites. The display controller decides what the user will see, reads the appropriate information out of the simulation shared state, and transforms it into a visual display. In cluster-based simulation, a distributed user station consists of a display and an input device. Models and display controllers (remote controller agents) for all users are allocated to the central cluster. Shared state is maintained only in the cluster. The display controller still decides what the user will see, reading and transforming data out of the simulation’s shared state. Predictive contracts are now used to mask latency of and lower WAN transmissions required for display updates (communication between the display controller at the cluster and the display itself at the user station). The same functional components as in fully distributed simulation are used, but are simply mapped to different physical locations. By restricting model to model traffic within the cluster, only model to user and user to model traffic is carried by the WAN. The different levels of traffic between components is due to the behaviour and characteristics of the components. While numbers vary from system to system, CGF components in exercises such as STOW97 update their positions and behaviours on a very frequent basis (several times a second). Many of these updates require a network transmission. Screen updates occur on a much lower frequency: approximately once per second. Users primarily observe changes to the visual display, and occasionally enter a command (on the order of once per minute in peak loads). Latency is the final key factor in the use of remote controllers in cluster-based simulation. An equivalent level of latency is supportable via predictive contracts linking remote controllers to central models as is supportable via the dead reckoning algorithms used to link distributed models in STOW. Fidelity (as affected by inter-model latencies) is expected to be superior to STOW levels. This is due to models being co-located within a cluster. Cluster latencies – dependent on implementation – are generally under 100 microseconds, as compared to the 60,000 microsecond inter-model latencies supported by the STOW WAN. Latency between a model and its controller is increased as they are now separated by WAN distances; however much of this added latency has little impact on the actual observed latency. Updates from the model to the user are much less frequent, simply updating the display. The WAN jitter in these updates may be smoothed via predictive contracts. User to model updates are based on the user’s view of the model and are subject to human perception limitations. Using STOW97 latency numbers, the maximum timing loop for a remote controller is on the order of 120 milliseconds (model 14 DRAFT to screen update to user command). This falls well within the human perception range of approximately 200 milliseconds. In terms of latency, the advantage of cluster-based simulation is depicted through a comparison of Figures 3 and 4 below. Figure 3 depicts latency in a distributed simulation in which WAN latencies impact most components. This is contrasted in Figure 4 that depicts a clustered configuration. Here, WAN latencies impact only controller to controlled_entity links. It is noteworthy that no new components are added to the simulation system. WAN Shared State Local Entities Controller Display Figure 3: Latency in Distributed Simulation WAN link Agent Host Agent Host WAN link Cluster Shared State Local Entities Controller Display Figure 4: Latency in Distributed Simulation Controller Agent 15 DRAFT In summary, the key results from the Distributed Access to Models via Remote Controllers effort are: 1. Net latency effects observable at the user level are similar between fully distributed and cluster-based simulation. 2. Inter-model latency is improved from use of a cluster-based simulation. 3. Only traffic that this the least latency sensitive – i.e. traffic with human perception in the loop – is carried by the high-latency WAN. 4. WAN bandwidth requirements are much lower by use of cluster-based simulation. 5. Standard (point to point) links carry all WAN traffic, greatly simplifying cost and complexity of the WAN cloud required for cluster-based simulation. APPLICATION As JSIMS remote controller stations, data requirements and accuracy / fidelity requirements become known, agents for each station should be constructed and predictive contracts tailored for that particular agent/client data path. 16 DRAFT Topic 3 – Data Distribution Management Experimentation GENERAL In STOW97, IP multicast channels were used to segment the flows of data among hosts. It was determined that current IP multicast technology does not support enough multicast channels for straightforward, static segmentation schemes to be able to support large-scale exercises. Hence, dynamic, adaptive schemes must be developed to provide efficient use of the existing multicast channels. Furthermore, since channels will be generally multiplexed due to limited availability, algorithms must be developed that attempt to minimize the number of false hits. False hits are messages arriving at a host that contain data not requested by, or not needed by the given host. Data Distribution Management (DDM) considers methods and techniques for the efficient use of a LAN resource by utilizing single transmit, multiple recipient network technologies such as IP multicast and ATM point to multi-point. The DDM problem was broken down into two segments: addressing and routing. Addressing requires the system to determine what hosts, if any, require a given data packet. Routing requires the system to determine the lowest cost mechanism to get a given packet from its source to its destination(s). Under this part of the ASTT Parallel Processing and Efficient Time Management Techniques research program, a number of experiments were performed to analyze algorithms that collect addressing information and produce efficient data routing schemes. We termed these experiments Global Addressing Knowledge (GAK) experiments. These consisted of running a given data set over the GAK algorithm under test to determine the efficiency of each algorithm for that data set. To allow comparisons of various infrastructure algorithms, data sets were fixed; i.e., a data set is a constant, named object that provides identical inputs to each experiment. Data sets consisted of simulation state changes (per host) and subscription / publication information (also referred to as interest data sets). Portions of both sub-problems were addressed. In particular, costs associated with finding out which hosts require a packet were investigated. It should be emphasized that only pure internal-DDM issues were considered for this part of the program. Specifically, the GAK experiments did not address semantics of DDM, where many interesting open research problems remain. In terms of functional allocation within the HLA, these GAK algorithms would exist internal to the RTI, forming portions of an RTI’s implementation of data distribution. These dynamic, adaptive GAK algorithms were evaluated strictly in context of a low-latency clustered computing environment. Extrapolating these GAK experimental results to a high-latency WAN environment is not valid as it invalidates design assumptions in the algorithms, i.e. low latency access to global, dynamic addressing information. Internal to the cluster, neither latency variances due to load nor other network artifacts were considered. Below we present a summary of the experiments. More detailed information can be found in the 17 DRAFT ASTT reports [Evans99, Performance of GAK Algorithms] and [Mellon99, Formalization of the Global Addressing Knowledge (GAK) and Literature Review]. Global Addressing Knowledge (GAK): Functional Description Due to the nature of simulation-shared state, a shared state update generated at one host is generally required at multiple destination hosts. Single transmit, multiple recipient network technologies such as IP multicast and ATM point to multi-point have been proposed as mitigating techniques for the large volume of network traffic and the CPU load per host of generating multiple copies of the same message. The cluster architecture as shown in Figure 2 above provides a decomposition of the data distribution management problem. All GAK algorithms studied in the ASTT DDM effort were completely independent from the content of the data. That is, the research was application-independent, and was based on abstractions of data and resources known as tags and channels. Simulations produced and subscribed to data based on semantic tags. The modeling layer involved in an exercise must agree on the mechanism to associate semantic information to the tags. The tags, as far as the simulation infrastructure is concerned are semantically neutral, and are treated simply as a set of buckets containing a particular volume of data. Thus tag semantics (e.g., sectors, entity types, ranges of data values, etc.) were not under analysis in ASTT. It is the responsibility of the GAK component to map tags to communication channels (Figure 5). The goal of the tag to channel mapping is to reduce unwanted data from being received by a host, while simultaneously interconnecting all hosts according to their subscriptions within limited channel resources. A host that subscribes to a channel to receive tag X may receive other (unwanted) tags that are mapped to the same channel. This mapping is complicated by factors such as: a small number of channels compared to the number of tags typically used; the cost of computing the mapping, the dynamic nature of host subscription and publication data; and the latencies between hosts, which delays the communication of current subscription data and the dissemination of new channel maps. The tag-channel abstraction separates the addressing and routing problems nicely. Indeed, this abstraction bounds the research area of the ASTT Cluster Computing DDM effort. 18 DRAFT Tag 1 Channel 1 Tag 2 Channel 1 Tag 3 … Tag 4 Channel M … Tag n Figure 5: GAK performs tag to channel mapping GAK Algorithms GAK algorithms are roughly divided into two classes, fixed and feedback. Fixed GAKs provide a mapping of tags to channels that can be pre-calculated and are based on data that exists before the simulation is executed. A number of mappings may be used during runtime by fixed GAKs to optimize channel loadings on a phase-by-phase basis. Feedback GAK algorithms track the changing status of host publication and subscription information over the course of the simulation’s execution and produce new tag to channel mappings aimed at reducing the current false hit count within the system. Other runtime data may also be used by a feedback GAK, including the current false hit count per host and per channel. Feedback GAKs require some form of fixed GAK mapping to begin from, then optimize based on current conditions. Fixed GAKs are expected to be extremely low cost to use, but will not make the best possible use of channel resources as their a priori mapping decisions can only be estimates. Also note that traffic data (or estimates of traffic data) may not be available a priori. This class of GAK algorithm examines the value of low GAK overhead against limited efficiency in channel utilization. Feedback GAKs are expected to incur runtime costs in tracking DDM information and distributing new maps, but be more efficient in channel utilization. The tradeoff between fixed and feedback GAKs is effectively captured by a high-level GAK MOE, which includes both GAK overhead and false hits. Feedback GAKs rely heavily on low latency access to DDM information from each host. This precludes their use (as designed) in a high latency (i.e. fully distributed) environment, although some limited use of feedback may be possible in non-realtime applications linked by a high latency WAN cloud. The impact of high latency on feedback GAKs was not part of this investigation. Static Mapping Algorithms Fixed GAK algorithms may operate either with only one phase, or with multiple phases and a new mapping per phase. The mappings are determined solely based analysis of data prior to simulation execution. Key input data for a fixed GAK are the number of tags, and traffic per tag. Traffic per tag may either be an estimate (ala STOW97), or measured from a previous or similar execution. Specific static mapping algorithms evaluated included 19 DRAFT Broadcast: This GAK uses one channel to which all hosts subscribe and publish. This should reflect the worst possible GAK algorithm, in that it will have the maximum number of false-hits with a resulting waste of bandwidth and receive computation. However, it will have very low GAK computation, and will have no dynamic channel subscription changes. This provides a lower bound on performance. Oracular: This GAK algorithm performs a very simple tag to channel mapping. Each tag receives its own channel. This violates the resource restrictions in the cluster, but provides an upper bound on performance. Round Robin: The Round Robin GAK places N/M tags in each of M channels, where N is the number of tags. No consideration is given to reducing false hits, or in any other way balancing the system resources. There is no overhead cost for this algorithm. Greedy: This GAK allocates the highest communication volume K-1 tags each to one channel and puts the remaining N-K+1 in the remaining channel (where there are K channels and N tags). This removes all false positives from the K-1 highest volume tags. Any other mapping would add one of these high volume tags to a lower volume tag. Any consumers that only wanted the lower volume tag would receive false hits Dynamic Mapping Algorithms Dynamic GAK algorithms are based on feedback from current DDM data to adapt the mappings of tags in a channel based on some feedback mechanism. Examples investigated included feedback procedures based on the false hit rate per channel and adjust the mapping of tags in a channel when the false hit rate gets too high. Another approach was based on channel Load where the GAK uses traffic volume per channel as the feedback mechanism in re-allocating tags to channels. One consideration here was that false hit ratios would be too unstable to use as a balancing rule. Consequently, traffic per channel levels were leveled, allowing the network filtering of as much data as possible. From level traffic, a low false hit count resulted. Specific dynamic algorithms implemented included: Dumb Greedy GAK with Feedback: Updating the greedy mapping is trivial, meaning it can be done with very low latency on very current instrumentation. It is a matter of measuring the message rate per tag, sorting and assigning the tags. We ended up calling this the “dumb greedy GAK”. This GAK allocates each of the K-1 highest message volume tags each to its own channel and puts the other N-K+1 tags in the remaining channel (where are K channels and N tags). Greedy Feedback GAK: This agent does not sort the tags by volume, but simply iterates of the list of tags, assigning them sequentially to channels, attempting to minimize the maximum volume across channels. This amount to a sort of “greedy” bin packing algorithm. Smart Greedy Feedback GAK: The smart greedy GAK works in a fashion similar to the greedy GAK, except that it sorts the tags first by message volume. This avoids the problem of zero-message tags being assigned to a single channels and creating contention as the tags become active. 20 DRAFT Dynamic Producer Groups GAK: This GAK approaches the routing problem from a source-based perspective. It simply sorts hosts according to the number of messages produced, and then assigns them a subset of channel resources proportional to their share of the message production. Dynamic Producer / Consumer Groups GAK: When re-mapping takes place, a matrix is created of producers and consumers, with entries in the matrix being a list of tags of produced and consumed by the producer/consumer pair. Lists are sorted by length and assigned to channels in descending order. Linear Programming (LP) GAK: The mapping problem can be posed as a linear system. Thus finding the optimal allocation is equivalent to finding the minimum of an objective function. The solution of the optimization problem was based on standard Linear Programming was implemented in which to decompose and optimize the resulting linear system. Performance Metrics GAK performance was measured using three key simulation infrastructure metrics: False Hits: Since hosts should only receive messages of interest, this is the number of the number of messages an individual host receives in which it has not expressed interest. Such false hits are the result of channel bundling, where multiple tags are assigned to the same channel. A ‘good’ GAK will bundle similar tags on the same channel to reduce the false hit count. Overhead Traffic: Since system performance is limited by the amount of traffic added to the network burden, one must factor in the extra traffic a dynamic GAK adds to the network. All of the algorithms tested were instrumented to measure overhead traffic. A ‘good’ GAK reduces the false hit count without adding too much network traffic. Join/Leave rates per channel: This is the rate at which hosts join and leave channels as they subscribe to data traffic of interest. GAK Natural Measure of Effectiveness (MOE) The GAK infrastructure measure of effectiveness (MOE) applied for the experiments was the ratio of optimal wire accesses to actual wire accesses. This is a measure of GAK global (cluster) efficiency. To be more precise, the MOE used to compare GAK performance is the minimal consumer wire accesses divided by the actual wire accesses: MOE = minimal consumer WA / actual WA The measurement of minimal wire consumer accesses was made using the Round Robin GAK with the number of channels equal to the number of tags (giving the GAK essentially infinite resources and thus no chance of false hits). This minimal WA is precisely the number of messages that must be received by all hosts to get all the state data to which they have subscribed. Since the number of actual wire accesses equals the minimal consumer wire accesses plus the number of messages required for GAK agent overhead, another way to view the MOE is: 21 DRAFT MOE = minimal consumer WA / (minimal consumer WA + GAK agent overhead) Another aspect of the Natural GAK MOE is that it is fully measurable in experiments. Furthermore, it follows that 0 MOE 1 with MOE = 1 for the Oracular GAK. Experimental Hypotheses As indicated above, we took an hypothesis based approach for evaluating the various techniques and methods. Two hypotheses were established that concerned the comparisons of GAK performance as measured by the natural GAK MOE. We also studied two hypotheses concerning what we have called intrinsic scalability. We next summarize the hypotheses. The first hypothesis is termed the Static MOE Hypothesis that is really a statement of the belief that without knowledge about simulation state, random assignment of tags to channels works better on average than any other scheme. Specifically: Among the static GAKs, round robin performs the best. A second fundamental hypothesis, termed Feedback MOE Hypothesis is really a restatement of the DDM clustering hypotheses: Feedback (dynamic) GAKs outperform static. The Feedback MOE hypotheses examines the claim that agents with access to subscription/publication statistics can lower false hits without increasing overhead traffic to unacceptable levels. Note that since latency is not considered, it was not modeled in the experiments. As indicated above, we also considered experiments concerning the issue of what we termed intrinsic scalability that addresses the limit of how well any infrastructure could perform on a simulation’s specific configuration. The DDM experimental scenarios were designed to test two hypotheses about the intrinsic scalability of two well-known DDM problems in distributed simulation. First, a Wide Area Sensor (WAS), such as a JSTARS, or even a space-based sensor, presents a scalability problem that is fundamentally a bandwidth and host overload problem, not a DDM problem that can be solved by a particular choice of GAK algorithm. No matter what GAK mapping strategy is implemented, a WAS host will need to either subscribe to a large number of channels, or channels with huge data rates, or both. We set out to test the WAS Invariance Hypothesis that the WAS problem is GAK-invariant via simulation. WAS problem is GAK-invariant 22 DRAFT For a more complete discussion of why we expected the performance of the various GAK algorithms to remain the same in WAS scenarios, when properly normalized see [Mellon99, DDM Experiment Plan]. A fast-moving entity will force its simulation host to change channel subscriptions under normal federation assumptions (geospatial tag allocation, fast-mover overlaid on background of slow-moving traffic of interest to the fast-mover’s sensor models). However, this problem is actually equivalent to the WAS problem if the fastmover’s simulation host subscribes to channels which will allow it to receive data about entities of interest in the same window of future simulation time calculated using an extrapolation of the fast-mover’s velocity. Thus, when viewed from the proper consumer perspective, a fast-mover scenario should exhibit similar behavior when measured by the GAK metrics with the exception of the join/leave metrics. We set out to test the FastMover Invariance Hypothesis that the fast-mover problem is GAK-insensitive via simulation. The fast-mover problem is GAK-invariant GENERATION OF DATA SETS Initial DDM experiments were run using a high-level movement simulation that generated plausible movement patterns of entities. In addition to the simulation, a script tool was developed that makes the mapping of simulated entities to hosts explicitly programmable. This allowed for scenarios to be developed that attempted to mimic different styles of exercise implementations and to more thoroughly stress the algorithms being tested. Figure 6 shows the baseline scenario that was used throughout the DDM experiments, with minor modifications as described below. Figure 6: Baseline Scenario 23 DRAFT This baseline scenario consisted of a regular grid of N rows by N columns, with a master/slave cluster of entities in each row. In the first round of experiments, the entity clusters started at the leftmost square of each row, and moved at a uniform speed across the row to the rightmost square, and then back to the leftmost square. This pattern was then repeated until the simulation terminated. After initial experimental results were obtained, the clusters in each row were subdivided, with one sub-cluster starting at the leftmost square of each row, and one sub-cluster starting at the rightmost square. This made the interaction patterns, hence the resulting channel subscriptions, less uniform, since the interactions decreased when the sub-clusters were out of each other’s sensor range. These East-West clusters were meant to emulate ground force movement and interactions. Sensor ranges were adjusted so that they will primarily remain within a sector boundary. Experimental Scenarios The baseline scenario was run with three different entity/host mappings, dubbed the “Optimistic”, “Realistic” and “Pessimistic” scenarios. The optimistic scenario allocated geospatial clusters of entities uniformly across the set of simulation host resources. In other word, all the entities in the first cluster were assigned to a host, all the entities in the second cluster were assigned to a second host, and so on. When, all the available hosts were assigned a cluster, the assignment wrapped to the starting point, with the next cluster assigned to the first host, etc. In a sense, the optimistic scenario assumes maximal scenario knowledge on the part of the exercise planners, with interacting entities all on the same host, within reasonable bounds. Recall that there was some interaction among clusters across gridlines. The so-called realistic scenario was an attempt to emulate some of the major decisions that are made in exercise planning. Each cluster in a row was divided in half, with one half assigned to a host, and the other half assigned to a second host. Assignment of the remaining clusters, each divided in half, proceeded in a round-robin fashion, as in the optimistic scenario. Network bandwidth requirements were of course higher in the realistic scenario, as were subscriptions, since sub-clusters of entities interacted among hosts. Underlying the realistic scenario was an attempt to emulate in a very coarse fashion the way exercises are laid out, with regions assigned to sets of simulation hosts, and blue and red units typically simulated on separate hosts within a regional set. The third scenario was called the pessimistic scenario, since it assumed no scenario knowledge in implementing the simulation. In fact, in the pessimistic scenario, entities within every cluster were assigned uniformly across the set of simulation hosts. Accordingly, network interaction was maximized. While the pessimistic scenario stressed the GAK algorithms in some ways, the results are somewhat artificial, since no simulation is ever laid out this way. Furthermore the random entity assignments means that on every host, all channels will be subscribed once the number of clusters exceeds the number of hosts, thus ensuring virtually no false hits in a reasonably sized pessimistic scenario. Each of the three baseline scenarios (optimistic, realistic, pessimistic) runs were further refined to a pair of scenarios, one with a pair of fast-moving entities with a Wide Area Sensor (WAS), one without a WAS. The fast-moving WAS were intended to be 24 DRAFT airborne entities, such as a UAV or a JSTARS. To sum up, six different scenarios were run for each GAK algorithm: Optimistic/WAS, Optimistic/No WAS, Realistic/WAS, Realistic/No WAS, Pessimistic/WAS, Pessimistic/No WAS. The experiments were run with entity counts starting at 100, and were run up to entity counts of 3000. All of the algorithms were run with 3000 entities, with the exception of the LP GAK. Unfortunately, severe memory management problems in the LP package chosen for the DDM experiments prevented us from running experiments using the LP GAK on anything with more than 100 entities. The number of tags was equal to the number of entities, increasing to a maximum of 3000 as the final experiments were run. For each simulation run, the number of channels was varied from 1 up to the number of tags, producing a large number of data points. Considerable work was put into engineering these scale increases. Despite the fact that the ultimate numbers of entities, tags and channels did not exceed 3000, a modest scale by distributed simulation standards, it is our firm belief that no qualitatively new algorithmic behavior would be uncovered by making the numbers higher. The only possible exception was the LP GAK, for which simulation scale was severely limited by memory leaks in the LP package. Exploring the ratio between tags and channels is of much greater interest, with particular focus on when the number of tags is far greater than the number of channels. This is the case which forces multiple tags to be bundled on a single channel, and is the heart of the false hit problem encountered by the large fully distributed simulations to date. An example of results is given in Figure 7 below. For this case, experiments were done out to 3000 channels (to match the maximum numbers of tags used) to validate GAK algorithms, which should then produce near-perfect behavior as sufficient resources are made available. This was in fact observed in the sample results. Here, grids of 60 by 50 cells were used, with each cell measuring 16.67 units by 20 units. The idea was to make the grid cells smaller, thereby increasing the number of tags. In all scenarios, the “ground” units moved at constant speeds of 50 units per simulation tick, and had a sensor range of 50 units. The “airborne” units, on the other hand, moved at speeds of 500 units per tick and had a sensor range of 500 units. In all scenarios, the tags were associated one-to-one with individual grid cells. Some remarks: GAKs that assign tags to channel based on data type (or tag value), approach perfection at 3000 channels for the 3000 tag case. GAKs that use a smaller number of channels (Producer and ProducerConsumer) essentially top out far below that. This is due to their style of tag assignment: the matching of producers to consumers. Once sufficient channels exist to link all producers and consumers, the algorithm reaches its theoretical ceiling. Further examination of Figure 7 (tick 5 chart) shows interesting results in the lower left section of the graph, where few channels exist to the many tags being used. This is typical of simulation systems to date, and as expected, the dynamic GAKs outperform static or round-robin GAKs. Also as expected, comparing the three charts (tick 1, tick 5 and tick 10) show performance differences based on the remap frequency used within the GAK. 25 DRAFT Under the first scenario charted, new maps are calculated and distriubted every ‘tick’ of the time-stepped sample simulation. Performance is minimal, as the dynamic GAKs spend considerable cycles (and network accesses) to collect data, then produce and distribute a new mapping. Further, the data sample set used is small. Under the 5-tick scenario, performance of dynamic GAKs is superior. This is due to the less frequent communication and the larger sample set. While the sample set has slightly old data in it, it is sufficient to produce a good mapping of tags to channels. Under the 10-tick scenario, we observe a dropoff in dynamic GAK performance. This is due to the increasing age of the sample set data used in the new mapping and is an expected factor. The performance dropoff from the 5-tick case to the 10-tick case is not large, and indicates some stability in the re-map frequency for a GAK. Optimistic 3000 Entities, 3000 Tags, 2 WAS 0.95 RoundRobin 0.85 Greedy Greedy (Row Major) MOE 0.75 DumbGreedy SmartGreedy 0.65 Producer ProducerConsumer 0.55 0.45 0.35 0 500 1000 1500 2000 2500 3000 3500 Number of Channels Figure 7: Examples of GAK Experimental Results 26 DRAFT Optimistic (Remap Freq = Every 5 Ticks) 3000 Entities, 3000 Tags, 2 WAS 0.95 0.85 RoundRobin MOE 0.75 DumbGreedy SmartGreedy Producer 0.65 ProducerConsumer 0.55 0.45 0.35 0 500 1000 1500 2000 2500 3000 3500 Number of Channels Optimistic (Remap Freq = Every 10 Ticks) 3000 Entities, 3000 Tags, 2 WAS 0.95 0.85 RoundRobin MOE 0.75 DumbGreedy SmartGreedy Producer 0.65 ProducerConsumer 0.55 0.45 0.35 0 500 1000 1500 2000 Number of Channels 2500 3000 3500 27 DRAFT FINDINGS AND RESULTS A number of important conclusions were obtained from the experimental runs. We next summarize the key results. 1. For systems with poor intrinsic scalability, a broadcast scheme may well be the best option, as dynamic algorithms are likely to thrash while looking for an optimal solution that does not exist. 2. The False Hits metric – while the most important of the DDM infrastructure metrics – cannot be considered in isolation. Indeed, for some test cases, the overhead traffic generated by a GAK algorithm negated any reduction in the false hit count. 3. Round Robin outperforms other static algorithms since random allocation will often perform adequately if a large enough number of tags exist and the traffic per tag is also randomly distributed. 4. Dynamic algorithms outperform static in a low latency environment. 5. The “Dumb Greedy GAK” with feedback showed comparatively good performance on the join/leave metric, counter to our intuitive understanding. 6. Simple "heuristic" dynamic algorithms perform at least as well as more "functional" algorithms across a range of scenarios on false hits and better on MOE - heuristic means variations on greedy, functional means source based or producer groups, producer/consumer groups). 7. Complex heuristic (LP) and producer/consumer group GAKs performed as well as simple heuristic GAKs (greedy) on the false hits metric. 8. Choice of GAK does not improve performance on scenarios with wide area sensors or with fast-moving entities. 9. Computational complexity of the LP GAK makes it impractical for current implementations. 10. Overall, a low level of effort in dynamic mapping of tags to channels produces superior performance over static allocation schemes. However, schemes, which ‘try too hard’, tend to generate more overhead traffic than the number of false hits that they reduce. 11. Static or minimal-effort dynamic GAKs perform adequately as the number of channels approaches the number of tags. 12. Dynamic adaptive GAKs are the best performers for resource-constrained systems, i.e., when the number of tags is much larger than the number of channels. 13. Dynamic GAKs also provide flexibility to the exercise designer: no knowledge of tag traffic rates or host data consumption patterns are required in advance of execution. Efficient free play is thus also supported. 28 DRAFT APPLICATION Based on the conclusion of these experiments, general use cluster-based simulations will be best served by using a relatively simple dynamic GAK, such as the Smart-Greedy GAK. More complex dynamic schemes, such as the ProducerConsumer GAK, perform better when the tag to channel ratio is large (i.e. the channel resource is scarce), but tend to taper off at a lower level when larger numbers of channels are available. Care must also be taken within the simulation to ensure some level of intrinsic scalability. Cases such as Wide-Area Sensors were found to affect negatively performance for all GAKs, as the WAS case has poor intrinsic scalability. Extensions to a GAK algorithm via application-level knowledge (such as details about entity movement) were also found to increase performance. For some cases, better performance at the GAK level will result from exploiting such a priori knowledge. 29 DRAFT Extensions to Future Work As listed above, this research produced a number of very important results. In this section, we provide a set of recommendations for application of and useful extensions to build on this effort: 1. For the purpose of fielding a cluster-based simulation, the best current choice for performance and cost is Fast (switched) Ethernet, using the expanded hubs (‘Edge Devices’) developed under STOW97. 2. The use of optimized ATM as a LAN has significant advantages. Work should be continued to increase the stability and performance of this option for the use of simulation. 3. As JSIMS remote controller stations, data requirements and accuracy / fidelity requirements become known, agents for each station should be constructed and predictive contracts tailored for that particular agent/client data path. 4. Within a cluster, a dynamic DDM routing algorithm is recommended for fielding a cluster-based simulation for the best overall performance and ease of use. Additional research on hierarchical addressing schemes that take advantage of military organization is recommended as well. 30 DRAFT REFERENCES F. Adelstein and M. Singhal. Real-Time Causal Message Ordering in Multimedia Systems. In Proceedings of the 15th International Conference on Distributed Computing Systems. 1995. New York: IEEE. S. Aggarwal and B. Kelly, Hierarchical Structuring for Distributed Interactive Simulation. Proceedings of the 13th DIS Workshop, IST-CR-95-02, 9/95, p. 125. J. Allen, Maintaining Knowledge about Temporal Intervals. Communications of the ACM, 1983. 26(11): p. 832-843. A. Evans and L. Mellon, Performance Survey of Global Addressing Knowledge (GAK) Algorithms. DARPA ASTT Program Deliverable (in progress). S. Bachinsky, et al. RTI 2.0 Architecture. Proceedings of the 1998 Spring SIW Workshop. S. Bachinsky, et al. RTI 2.0 Architecture. Proceedings of the 1998 Spring SIW Workshop. D. F. Bacon and S. C. Goldstein, “Hardware-Assisted Replay of Multiprocessor Programs," 1991. R. L. Bagrodia, K. M. Chandy, and J. Misra, “A Message-Based Approach to DiscreteEvent Simulation,” IEEE Transactions of Software Engineering, vol. SE-13, pp. June 1987. K. Birman, A. Schiper, and P. Stephenson, Lightweight Causal and Atomic Group Multicast. ACM Transaction on Computer Systems, 1991. 9(3): p. 272-314. K. Birman and T. Joseph, Reliable Communication in the Presence of Failures. ACM Transactions on Computer Systems, 1987. 5(1): p. 47-76. N. Boden, et al., Myrinet: A Gigabit Per Second Local Area Network. IEEE Micro, 1995. 15(1): p. 29-36. E. A. Brewer and W. E. Weihl, “Developing Parallel Applications Using HighPerformance Simulation," B. Bruegge, “A Portable Platform for Distributed Event Environments," 1991. J. Calvin and D. Van Hook, “AGENTS: An Architectural Construct to Support Distributed Simulation,” Proceedings of the 11th DIS Workshop, IST-CR-94-02, 9/94, p. 357. J. Calvin et al., “Data Subscription,” Proceedings of the 13th DIS Workshop, IST-CR-9502, 9/95, p. 807. J. Calvin et al., “Data Subscription in Support of Multicast Group Allocation,” Proceedings of the 13th DIS Workshop, IST-CR-95-02, 9/95, p. 593. J. Calvin et al., “STOW Real-Time Information Transfer and Networking System Architecture,” Proceedings of the 12th DIS Workshop, IST-CR-95-01.1, 3/95, p. 343. K.M. Chandy and R. Sherman, The Conditional Event Approach to Distributed Simulation, in Proceedings of the SCS Multiconference on Distributed Simulation, B. Unger and R.M. Fujimoto, Editors. 1989, Society for Computer Simulation. p. 93-99. K. M. Chandy and J. Misra, “A Nontrivial Example of Concurrent Processing: Distributed Simulation," 1978. B. Clay, “Multicast or port usage to provide hierarchical control,” Proceedings of the 8th DIS Workshop, IST-CR-93-10.1, 3/93, p. A-3. P. Dickens, P. Heidelberger, and D. M. Nicol, “Parallel Direct Execution Simulation of Message-Passing Parallel Programs,” IEEE Transactions on Parallel and Distributed Systems, vol. 7, pp. October 1996, 1996. C. Diehl and C. Jard. Interval Approximations and Message Causality in Distributed Systems. K. Doris, “Issues Related to Multicast Groups,” Proceedings of the 8th DIS Workshop, IST-CR-93-10.2, 3/93, p. 279. D. Dubois and H. Prade, Processing Fuzzy Temporal Knowledge. IEEE Transactions on 31 DRAFT Systems, Man, and Cybernetics, 1989. 19(4): p. 729-744. C. Fidge, Timestamps in Message-Passing Systems That Preserve the Partial Order. In The 11th Australian Computer Science Conference. 1988. R.M. Fujimoto, Parallel and Distributed Simulation Systems. 1999: Wiley Interscience. R.M. Fujimoto, Performance Measurements of Distributed Simulation Strategies. Transactions of the Society for Computer Simulation, 1989. 6(2): p. 89-132. Specification, Version 1.3. 1998: Washington D.C. R.M. Fujimoto and P. Hoare, HLA RTI Performance in High Speed LAN Environments, in Proceedings of the Fall Simulation Interoperability Workshop. 1998: Orlando, FL. R. M. Fujimoto, “Parallel Discrete Event Simulation,” Communications of the ACM, vol. 33, pp. October 1990, 1990. R. M. Fujimoto, “Zero Lookahead and Repeatability in the High Level Architecture," R.M Fujimoto, Performance of Time Warp Under Synthetic Workloads, in Proceedings of the SCS Multiconference on Distributed Simulation. 1990.p.23-28 B. Gajkowski, et al. STOW97 Distributed Exercise Manager: Lessons Learned. Proceedings of the 1998 SIW Workshop. B. Groselj and C. Tropper, The Time of Next Event Algorithm, in Proceedings of the SCS Multiconference on Distributed Simulation. 1988, Society for Computer Simulation. p. 25-29. M. Johnson and S. Myers, “Allocation of Multicast Message Addresses for Distributed Interactive Simulation,” Proceedings of the 6th DIS Workshop, IST-CR-92-2, 3/92, p. 109. R. Kerr and C. Dobosz, “Reduction of PDU Filtering Time Via Multiple UDP Ports,” Proceedings of the 13th DIS Workshop, IST-CR-95-02, 9/95, p. 343. L. Lamport, Time, Clocks, and the Ordering of Events in a Distributed System. Communications of the ACM, 1978. 21(7): p. 558-565. F. W. Lanchester, Aircraft in Warfare, the Dawn of the Fourth Arm. Tiptree, Constable and Co. Ltd, 1916. T. J. LeBlanc and J. M. Mellor-Crummey, “Debugging Parallel Programs with Instant Replay,” IEEE Transactions on Computers, vol. C-36, pp. 471-481, 1987. M. Macedonia et al., “Exploiting Reality with Multicast Groups: A Network Architecture for Large Scale Virtual Environments,” Proceedings of the 11th DIS Workshop, ISTCR-94-02, 9/94, p. 503. F. Mattern, Efficient Algorithms for Distributed Snapshots and Global Virtual Time Approximation. Journal of Parallel and Distributed Computing, 1993. 18(4): p. 423434. F. Mattern, Virtual Time and Global States of Distributed Systems, in The International Workshop on Parallel and Distributed Algorithms. 1989. T. McLean, L. Mark, M. Loper, and D. Rosenbaum, “Relating the High Level Architecture to Temporal Database Concepts,” Proceedings of the 1998 Winter Simulation Conference, Washington DC, Dec 12, 1998 S. Meldal, S. Sankar, and J. Vera, Exploiting Locality in Maintaining Potential Causality. IACM Symposium on Principles of Distributed Computing, 1991: p. 231-239. L. Mellon, Cluster Computing in Large Scale Simulation. Proceedings of the 1998 Fall SIW Workshop. L. Mellon, DDM Experimentation Plan, ASTT program deliverable, 1999. L. Mellon, Formalization of the Global Addressing Knowledge (GAK) and Literature Review, ASTT program deliverable, 1999. L. Mellon, Hierarchical Filtering in the STOW Distributed Simulation System. Proceedings of the 1996 DIS Workshop. D. Milgram, “Strategies for Scaling DIS Exercises Using ATM Networks,” Proceedings of the 12th DIS Workshop, IST-CR-95-01.1, 3/95, p. 31. D.C. Miller and J.A. Thorpe, SIMNET: The Advent of Simulator Networking. Proceedings of the IEEE, 1995. 32 DRAFT D.M. Nicol and P. Heidelberger, Parallel Execution for Sequential Simulators. ACM Transactions on Modeling and Computer Simulation, 1996. 6(3): p. 210-242. D.M. Nicol, Noncommittal Barrier Synchronization. Parallel Computing, 1995 D. M. Nicol, “Performance Bounds on Parallel Self-Initiating Discrete-Event Simulations,” ACM Transactions on Modeling and Computer Simulations, vol. 1, 1991. D. M. Nicol, “The cost of conservative synchronization in parallel discrete-event simulations,” Journal of the ACM, vol. 40. D. M. Nicol, “Noncommittal Barrier Synchronization,” Parallel computing, vol. 21, 1995. R. H. B. Netzer and B. P. Miller, “Optimal tracing and replay for debugging messagepassing parallel programs,” presented at Supercomputing ‘92, 1992. S. Pakin, S., et al., Fast Message (FM) 2.0 Users Documentation, . 1997, Department of Computer Science, University of Illinois: Urbana, IL. J. Porras, J. Ikonen, and J. Harju, Applying a Modified Chandy-Misra Algorithm to the Distributed Simulation of a Cellular Network, in Proceedings of the 12th Workshop on Parallel and Distributed Simulation. 1998, IEEE Computer Society Press. p. 188195. E. Powell et al., “Joint Precision Strike Demonstration (JPSD) Simulation Architecture,” Proceedings of the 14th DIS Workshop, IST-CR-96-02, 3/96. E. Powell, The Use of Multicast and Interest Management in DIS and HLA Applications. Proceedings of the 15th DIS Workshop. J. Pullen and E. White, “Dual-Mode Multicast for DIS,” Proceedings of the 12th DIS Workshop, IST-CR-95-01.1, 3/95, p. 505. J. Pullen and E. White, “Analysis of Dual-Mode Multicast for Large-Scale DIS Exercises,” Proceedings of the 13th DIS Workshop, IST-CR-95-02, 9/95. J. Pullen and E. White, “Simulation of Dual-Mode Multicast Using Real-World Data,” Proceedings of the 14th DIS Workshop, IST-CR-96-02, 3/96. S. Rak and D. Van Hook, “Evaluation of Grid-Based Relevance Filtering for Multicast Group Assignment,” Proceedings of the 14th DIS Workshop, IST-CR-96-02, 3/96. M. Raynal and M. Singhal, Logical Time: Capturing Causality in Distributed Systems. IEEE Computer, 1996. 29(2): p. 49-56. M. Raynal, A. Schiper, and S. Toueg, Causal Ordering Abstraction and a Simple Way to Implement it. Information Processing Letters, 1991. S. K. Reinhardt, M. D. Hill, J. R. Laurus, A. R. Lebeck, J. C. Lewis, and D. A. Wood, “The Wisconsin Wind Tunnel: Virtual Prototyping of Parallel Computers,” ACM Sigmetric, 1993. A. Schiper, J. Eggli, and A. Sandoz. A New Algorithm to Implement Causal Ordering, in WDAG: International Workshop on Distributed Algorithms. 1989: Springer-Verlag. R. Schwarz and F. Mattern, Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail. Distributed Computing, 1994. K. Shen and S. Gregory, “Instant Replay Debugging of Concurrent Logic Programs," R. Sherman and B. Butler, “Segmenting the Battlefield,” Proceedings of the 7th DIS Workshop, IST-CR-92-17.1, 9/92. M. Singhal and A. Kshemkalyani, An Efficient Implementation of Vector Clocks. Information Processing Letters, 1992. J. Smith et al., “Prototype Multicast IP Implementation in ModSAF,” Proceedings of the 12th DIS Workshop, IST-CR-95-01.1, 3/95. L.M. Sokol and B.K. Stucky, MTW: Experimental Results for a Constrained Optimistic Scheduling Paradigm, in Proceedings of the SCS Multiconference on Distributed Simulation. 1990. S. Swaine and M. Stapf, “Large DIS Exercises - 100 Entities Out Of 100,000,” Proceedings of the 16th I/ITSEC Conference, 11/94. D. Van Hook et al., “Performance of STOW RITN Application Control Techniques,” Proceedings of the 14th DIS Workshop, IST-CR-96-02, 3/96. 33 DRAFT D. Van Hook et al., “Scalability Tools, Techniques, and the DIS Architecture,” Proceedings of the 15th I/ITSEC Conference, 11/93. D. Van Hook et al., “An Approach to DIS Scalability,” Proceedings of the 11th DIS Workshop, IST-CR-94-02, 9/94. D. Van Hook et al., “Approaches to Relevance Filtering,” Proceedings of the 11th DIS Workshop, IST-CR-94-02, 9/94. J. Calvin and R. Weatherly. An Introduction to the High Level Architecture Run Time Infrastructure, in The 14th Workshop on Standards for the Interoperability of Distributed Simulations. 1996. Orlando, Florida: UCF/Institute for Simulation and Training. A.L. Wilson and R.M. Weatherly, The Aggregate Level Simulation Protocol: An Evolving System, in Proceedings of the 1994 Winter Simulation Conference. 1994. R.Yavatkar, MCP: A Protocol for Coordination and Temporal Synchronization in Multimedia Collaborative Applications, in The 12th International Conference on Distributed Computing Systems. 1992: IEEE. Minutes of the Communications Architecture and Security Subgroup, Proceedings of the 9th DIS Workshop, 9/93, pp. 298-300, 359-366 IEEE Std 1278.1-1995, IEEE Standard for Distributed Interactive Simulation -Application Protocols. 1995, New York, NY: Institute of Electrical and Electronics Engineers, Inc. 34 DRAFT GLOSSARY AND ACRONYMS AFAP - As fast as possible AICE – Agile Information Control Environment (DARPA program) AOI – Area Of Interest (Geographic interest management) API – Application Programmer’s Interface ASTT – Advanced Simulation Technology Thrust AT - Approximate Time ATC - Approximate Time Causal ATM – Asynchronous Transfer Mode BADD – Battlefield Awareness and Data Dissemination (DARPA program) CDI – Common Data Infrastructure (STOW software component) CLTOut - Conditional Lower Bound or the L-time of future outgoing messages C4I – Command, Control, Communication, Computers and Intelligence LP, may generate COEA – Cost and Operational Effectiveness Analysis CPU- Central Processing Unit (Computer architecture) DARPA – Defense Advanced Research Projects Agency DDM – Data Distribution Management DIS – Distributed Interactive Simulation DMSO – Defense Modeling and Simulation Office DTO – Data Transmission Optimization (DDM-specific) DMA -- Direct Memory Access ELT - Earliest Long Time ESPDU – Entity State Protocol Data Unit (DIS) FDK - Federal Developers Kit GAK – Global Addressing Knowledge GVT – Global Virtual Time HLA – High Level Architecture LBLT - Lower Bound Long Time IEEE – Institute of Electrical and Electronics Engineers I/O – Input/Output IP – Internetworking Protocol JPSD – Joint Precision Strike Demonstration JSIMS – Joint Simulation System JSTARS – Joint Surveillance, Targeting and Reconnaissance System LAN – Local Area Network LAPSE- Large Application Parallel Simulation Environment LET - Latest Estimated Time LTOut - Lower bound or the L-time of future outgoing messages LP, may generate MC – Multicast 35 DRAFT MCP - Multi-Flow conversation Protocol MMF – Military Modeling Framework (JSIMS infrastructure software component) MODSAF – Modular Semi-Automated Forces MOE – Measure of Effectiveness MPEG – Moving Picture Experts Group MTSI - Min-Time Interval Size NP – Nondeterminstic Polynomial (measure of computational complexity) OEM – Original Equipment Manufacturer PDU – Protocol Data Unit (DIS) QoS – Quality of Service RAM – Random Access Memory (Computer architecture) RITN – Real-Time Information Transfer and Networking (DARPA program) RRTI - Repeatable Run-Time infrastructure RTC – Run-Time Component (STOW interest-management gateways) RTI – Run-Time Infrastructure SI – Simulation Infrastructure (STOW software component) SIMNET – Simulation Network SMP – Symmetric Multi-Processor TM - Time Management TTL – Time To Live SCI - Scalable Coherent Interface STOW – Synthetic Theater of War TCP – Transmission Control Protocol UDP – User Datagram Protocol WAN – Wide Area Network WWT - Wisconsin Air Tunnel 36