ParalleX Execution Model - Center for Research in Extreme Scale

INDIANA UNIVERSITY

ParalleX Execution Model

version 3.1

Thomas Sterling

12/17/2012

ParalleX is an execution model being developed to meet the objectives of the DOE Exascale

Program. ParalleX is an evolving model that is intended to guide the co-design of the extremescale architecture, programming models, and operating and runtime system software. It is driven by DOE mission-critical problems. ParalleX is being employed by the XPRESS Project under the ASCR X-stack Program and by the ASCR Modeling Execution Models Project as well. This early technical report offers a discussion of the ParalleX model and its primary foundation concepts. It represents a work in progress and will be continuously expanded, refined, clarified, and otherwise improved to guide the evolution of the Exascale system architecture and software.

Table of Contents

1. Introduction .............................................................................................................................................. 3

1.1 ParalleX Execution Model ................................................................................................................... 3

1.2 Execution Model Concept and Tool .................................................................................................... 6

1.3 Codesign through ParalleX .................................................................................................................. 7

1.4 Report Organization ............................................................................................................................ 9

2. Overview ................................................................................................................................................. 11

2.1 Hardware Operation Properties ....................................................................................................... 11

2.2 Active Global Address Space ............................................................................................................. 11

2.3 Split-phase transactions .................................................................................................................... 12

2.4 Message-driven computation ........................................................................................................... 12

2.5 Multi-threaded operation ................................................................................................................. 12

2.6 Local (lightweight) control objects .................................................................................................... 12

3. Fundamental Elements ........................................................................................................................... 14

3.1 Synchronous Domains....................................................................................................................... 14

3.2 Active Global Address Space ............................................................................................................. 14

4. Parallel Processes .................................................................................................................................... 15

4.1 Process Parallelism ............................................................................................................................ 15

4.2 First Class Object ............................................................................................................................... 15

4.3 Comprising Elements ........................................................................................................................ 15

4.4 Accessing Process Data ..................................................................................................................... 16

4.5 Instantiation ...................................................................................................................................... 17

5. Computation Abstract Complexes .......................................................................................................... 18

5.1 Complexes are Advanced Threads .................................................................................................... 18

5.2 Local Objects ..................................................................................................................................... 18

5.3 First Class Objects ............................................................................................................................. 18

5.4 Variable Granularity .......................................................................................................................... 19

5.5 Internal Structure .............................................................................................................................. 19

5.6 Complex Lifecycle .............................................................................................................................. 20

1

6. Local Control Objects .............................................................................................................................. 22

6.1 Goals of the LCO Construct ............................................................................................................... 22

6.2 LCO Concept ...................................................................................................................................... 22

6.3 Generic LCO ....................................................................................................................................... 24

7. Parcels ..................................................................................................................................................... 27

7.1 Introduction ...................................................................................................................................... 27

7.2 Destinations ...................................................................................................................................... 28

7.3 Actions .............................................................................................................................................. 29

7.3.1 Atomic memory operation ............................................................................................................. 30

7.4 Data ................................................................................................................................................... 34

7.5 Continuation ..................................................................................................................................... 37

7.6 Application Programming Interface .................................................................................................. 39

7.8 Binary parcel format ......................................................................................................................... 41

Parcel structure ....................................................................................................................................... 42

8. Percolation .............................................................................................................................................. 51

9. Fault Management .................................................................................................................................. 53

9.1 Basic Principles of Resiliency ............................................................................................................. 53

9.2 Micro-checkpointing ......................................................................................................................... 55

9.3 Layered Fault Detection .................................................................................................................... 56

10. Protection ............................................................................................................................................. 57

11. Self-Aware Execution ............................................................................................................................ 59

11.1 What does It Mean to be “self-aware” ........................................................................................... 59

11.2 Dynamic Adaptive ........................................................................................................................... 59

11.3 Introspection and Goal-Oriented .................................................................................................... 60

11.4 Low-Level Autonomic ...................................................................................................................... 60

11.5 Medium-Level Global Balance ........................................................................................................ 61

11.6 High-Level Intelligent Perspective .................................................................................................. 61

12. Energy Management ............................................................................................................................. 62

2

1. Introduction

An innovative execution model is required to address the challenges impeding progress to effective Exascale computing before the end of this decade and accelerating problems demanding strong scaling even today. The strategic challenges include time and energy efficiency, scalability, programmability, and reliability. Designs for Exascale systems that are based on evolutionary extensions of current conventional methods are likely to fail in meeting these challenges, at least for some crucial application problem domains. An execution model provides a holistic definition of a class of systems and provides a conceptual framework for the codesign, operation, and interoperability of the many hardware and software layers which in combination make up a complete parallel computing system. ParalleX is an innovative experimental execution model under development to address these challenges and enable extreme scale computation. ParalleX serves as the guiding execution model for the DOE ASCR

X-stack Program XPRESS Project and the DOE ASCR Modeling Execution Models Project.

This technical report, “ ParalleX Execution Model, version 3.1

” provides a description of the

ParalleX model in its current instantiation. It presents the overall strategy and describes the primary conceptual components of ParalleX. Its content is offered to foster collaboration on the development of a new execution model, to provide guidance in the development of the stack of new system software compliant with the needs of ParalleX, and to support the codesign of the hardware and software system layers including core and system hardware architecture, system software of operating systems and runtime systems, and programming models, languages, and tools. It also informs the development of new parallel applications and algorithms to take advantage of these systems. Unlike previous execution models, ParalleX facilitates not only basic application performance through scalability, but other critical factors of practical operation including reliability, protection, and power consumption while exploiting new self-aware control methods.

1.1 ParalleX Execution Model

1.1.1 Goals and Objectives

The goal of ParalleX is to replace the Communicating Sequential Processes parallel execution model 1) to enable future practical systems capable of performance levels beyond an Exaflops,

2) to accelerate scaling-constrained applications through improved methods of strong scaling, and 3) to facilitate efficient processing of dynamic directed graph-based applications that exhibit memory-intensive computation and heavy system-wide communication and control. It is to dramatically improve scalability, temporal and energy efficiency, reliability, and programmability.

3

The objectives of ParalleX include providing means of exposing and exploiting high degree of currently untapped parallelism, hide latency, significantly decrease overhead, and diminishing impact of logical and physical resource access contention to provide significant improvements in efficiency in the use of both physical resources and energy. An additional objective is to exploit runtime methods to contribute to the efficiency challenge above and to significantly improve programmability by reducing programmer responsibility for explicit resource management and task scheduling. Another objective is to enable new solutions to the problems of fault tolerance and protection. A final objective is to devise a framework for applications emphasizing meta data processing such as dynamic directed graphs.

1.1.2 Strategy

The strategy of ParalleX is based on the idea that HPC Phase changes are both reflected and facilitated by a corresponding change in execution model to provide a conceptual framework or paradigm that permits codesign of the system layers and application workloads within the scope of enabling technology properties. ParalleX was first and foremost intended to shift system operation from static resource management and task scheduling of course grain parallelism to dynamic allocation of tasks to resources at the level of medium grain parallelism through lightweight user multithreading. This strategy was combined with message-driven computation to achieve split-phase transaction execution and intrinsic system-wide latency hiding. The effect sought is to eliminate resource blocking even when task elements are logically blocked.

To expose more parallelism, ParalleX eliminates global barriers and instead employs a rich set of local synchronization for dynamic adaptive control in the presence of widely varying task granularity. To maintain control locality, continuations are defined to migrate throughout the system as opposed to being tied to fixed program counters, stack pointers, and the occasional global barrier. ParalleX is defined to support a global address space with a hierarchical logical name space. ParalleX needs to provide means for fault tolerance, protection, and runtime energy use management.

A philosophy of simplicity of design of highly repeated components is assumed to achieve complexity of operation as an emergent property of the dynamic behavior for low development cost. Policies are undefined for flexibility of implementation and portability across different systems, systems of the same type but varied scale, and systems of different generations. Such diversity of policy is supported through a set of specified invariants of operation that guarantee compatibility of ParalleX compliant systems.

1.1.3 Major Components

ParalleX comprises a set of constructs each of which may be realized in many ways from pure software to pure hardware as well as a mix of the two. To introduce the ParalleX model, here are presented the small set of primary components required to achieve the goals and objectives through the strategy described.

4

1.1.3.1 Compute Complexes

The actual work is performed by small first-class (named) executing objects that combine code with private state or ParalleX Variables . Compute Complexes have limited lifetimes and each exists only within a local physical resource called a “

Synchronous Domain

”. Flow control within a complex is a combination of static dataflow or functional execution using single-assignment semantics with the ParalleX Variables and global mutable data including synchronization objects

(discussed below). Complexes may be preempted, suspended, and resumed for dynamic resource usage.

1.1.3.2 ParalleX Processes

ParalleX Processes are contexts that contain one or more executing complexes, their code objects, ParalleX Variables, and relatively near global state. They maintain mapping of their state across one or more synchronous domains and referenceable operating quality state and control actors. They may also include child processes.

1.1.3.3 Local Control Objects

Advanced synchronization objects are used to manage asynchrony of operation, expose parallelism, protect global mutable variables, control eager versus lazy evaluation, support migration of continuations, handle anonymous producer-consumer computation and much more.

It generally incorporates a semantic structure that manages incident events, updates its control state which it compares against a predicate, and under appropriate conditions triggers new actions like a compute complex. While they can be used for primitive conventional semaphore and mutex synchronization constructs, they can be much more powerful such as dataflow template and futures . They are local in that any one such object must exist in only one synchronous domain at a time (although they may migrate between them). This is different than distributed control operations that may be spread across multiple synchronous domains and are formed through a collection of local control objects.

1.1.3.4 Parcels

A parcel is a class of active messages provided to allow message-driven computation and the moving of work to the data when appropriate rather than always moving the data to the work. A parcel can cause an action to occur remotely (in a different synchronous domain) applied to a named logical or physical entity such as a program variable. Parcels specify the destination object to which they are to be applied, the action they are to perform, additional operand values required to complete the action, and the continuation defining what is to happen upon completion of the specified action. Among such actions, block data moves can be performed by parcels.

1.1.3.5 Percolation

A variation of parcels, percolation offers the means to move work to a physical resource such as a GPU to hide latency and offload overhead for costly units. Percolation moves all necessary information needed to perform a complex, possibly highly parallel, task to a separate execution component, and remove its results to main memory.

5

1.1.3.6 Micro-checkpointing

A means of providing local management of copies of designated variable values until follow on results for which they are precedent are valid are incorporated within the model. It provides a framework from which fault tolerant policies can be developed and included in a given application execution. These methods scale and can employ multiple methods of fault detection, isolation, correction, and recovery.

1.2 Execution Model Concept and Tool

An execution model (or model of computation) is a paradigm that provides the governing principles to guide the co-design and cooperative operation of the vertical layers comprising a parallel computing system. Execution models are devised to respond to changes in enabling technologies and needs of new classes of applications to exploit the opportunities afforded by such new technologies while addressing the challenges they impose to effective exploitation.

Throughout the history of high performance computing (HPC), there have been at least five mainstream scalable execution models such as the vector model with communicating sequential processes (CSP) dominating the last two decades of MPPs and commodity clusters. At the inchoate Petaflops era powered by multicore and GPU technologies, a new model of computation is required to permit the development and application of future Exaflops parallel systems.

An execution model incorporates key attributes required to organize the system elements and guide their co-operation in fulfill the application tasks. These are:



Holistic system representation – provides a framework to describe and consider and entire system rather than the set of separate system layers,



Strategies – the approach and fundamental ideas that constitute the paradigm embodied by the execution model,



Semantics – the set of referenced objects, the action classes performed on the objects, the means of representing parallelism and supporting synchronization, and name space management,



Policies – provides a set of invariants that have to be satisfied but for which the implementation is unspecified providing flexibility,



Resource Control – the practical management of system resources for fault tolerance, security protection, and power management,



Abstract to Physical – the methods and means for resource management, data distribution, and task scheduling,



Commonality – support for portability across similar systems of different scale, across different contemporary systems, and across systems of successive generations.

6

1.3 Codesign through ParalleX

A goal of ParalleX is to enable codesign of system layers and of system design with application design. ParalleX as an execution model allows the use of the conceptual approach of the

“ decision chain

” which addresses the question: how does each layer of a system contribute to the determination of when and where an operation is to be performed. ParalleX may, at least ideally, simplify the problem of codesign from O(n

2

) to O(n) by allowing each layer to be designed with respect to the execution model and their neighboring layers rather than with every other layer in the system simultaneously. The following discussion suggests a possible breakdown of the relationship of every layer in the system design in supporting the functionality and capability for each of the key ParalleX components.

1.3.1 Complexes Codesign

1.3.1.1 Programming model

Modularity, encapsulation, and composability of executable work

1.3.1.2 Compilation

Complex modules for encapsulation of local work. Aggregation of very lightweight complexes to balance granularity with overhead.

1.3.1.3 Runtime System

Complexes scheduling manager with preemption. Depleted threads treated as local control objects.

1.3.1.4 Operating System

Provides basic thread execution resource to stream of runtime threads.

1.3.1.5 System Architecture

First class object support, Relies on complexes for intra-domain latency hiding.

1.3.1.6 Core Architecture

ISA extensions. Rapid creation and context switching. Priority scheduling. Execution with ILP.

Preemption of user complexes.

1.3.2 ParalleX Processes Codesign


Primary modularity, encapsulation, and composability of program organization. Coarse grain parallelism. Strict and eager evaluation.


Inheritance for processes as objects. Stores all components within processes. Hierarchical name space.

7


Create and terminate processes


Serves as protection barriers via capabilities-based addressing. Allocates physical resources

(synchronous domains) to processes. I/O from “main”.


NICs communication across processes and between processes. Mapping support.


ISA extensions

1.3.3 Local Control Objects Codesign


Rich semantics of fine grain synchronization (e.g. dataflow, futures); Hidden at high level;

Exposed at low level (e.g. XPI).


Distributed continuations; Eager versus lazy evaluation; Out of order asynchronous control.


Management of Depleted Threads. Carry out basic primitive DMA operations on LCOs. Update

Local Control Objects included depleted threads.


Locate.


NICs directly process updates from parcels; Atomic updates.


ISA extensions; Rapid update and consequent action instantiation

1.3.4 Parcels


Medium of global access, control, and action at a distance; Hidden at high level; Exposed at low level (e.g. XPI).


Formulate work requirements in specified data-struct for conversion to parcels

8


Conversion of data-struct to parcel; Conversion of parcel to Computational Abstract Complex;

Carry out basic primitive DMA operations; Update Local Control Objects included depleted threads.


Manages all system wide parcel transport, delivery, buffering


Transport layer, Physical buffers, Lots of smarts in the NICs.


ISA extensions; Rapid creation

1.4 Report Organization

The “Report on the ParalleX Execution Model” is presented to describe and explain the ParalleX model of computation. It is intended to inform those who wish to acquire a professional understanding of the model concepts, computational scientists who will apply ParalleX related system software and programming environments and tools, and computer scientists who will implement ParalleX system hardware or software. This report is organized simply to provide easy access and understanding to the concepts and elements of the ParalleX execution model.

The next section, Section 2, provides a brief overview of the model to give a complete high level picture to which more detailed focused discussions can be related. With this understanding,

Section 3 establishes the fundamental elements at the lowest level of abstraction of any ParalleXbased system. These currently include the synchronous domains upon which a compute complex is performed and the active global address space (AGAS) upon which the hierarchical programmer name space is mapped and that maps to the system physical address space. Section

4 begins the description of the principal elements and the element principles. The first, given in this section, is the ParalleX Process that provides the hierarchical name space and contexts for data, computation, and mapping as well as protection barriers. Section 5 describes Compute

Complexes that serve the role of threads of physical systems in performing the actual actions of computation. Section 6 describes the semantics and abstract mechanisms for lightweight but sophisticated synchronization to manage asynchrony, support distributed continuations, permit anonymous producer-consumer computation, and providing means of rate of work generation

(lazy versus easy evaluation). These are the class of Local Control Objects that include but are not limited to dataflow representation of precedence constraints and futures constructs. Parcels are described in Section 7 to explain the form, function, and very low level API of messagedriven computation. This is followed by a special case of parcels, Percolation , in Section 8 for heterogeneous computing and optimized use of precious resources. At Exascale, new methods to achieve reliability may be required and Section 9 describes the principles of Fault Management

9

through the concept of micro-checkpointing. The final three sections deal with the UHPC advanced requirements of Protection , Self-Aware , and Energy , respectively.

10

2. Overview

The ParalleX model of computation is very different from the conventional Communicating

Sequential Processes model that dominates scalable MPP and commodity clusters today at the beginning of the Petaflops era. ParalleX incorporates a global shared name space rather than the

CSP distributed or fragmented memory. ParalleX program wide parallelism is at three levels: coarse grain parallel processes that may each span multiple, even shared, nodes, medium grain computational abstract complexes (or just “complexes”) which are a kind of thread, and fine grain (operations level) dataflow parallelism. All are ephemeral rather than the static MPI processes. Communication is message-driven rather than CSP message-passing. Synchronization is lightweight with sophisticated semantics rather than heavyweight global barriers of ghost cell copying which has the same effect. ParalleX also deals with key issues that are not addressed by the CSP model (although there is research in some of these areas) such as reliability, security, energy efficiency, and self-aware adaptivity.

2.1 Hardware Operation Properties

ParalleX incorporates a model of the underlying hardware. The system hardware is an ensemble of interconnected “synchronous domains”. Each synchronous domain incorporates 1) the ability to execute complexes, 2) ephemerally stores data of distinct forms, 3) performs operations in bounded time, 4) accesses hardware addresses for thread control, scratch pad fast memory, lowlevel counter and actuator bits/flags, 5) global virtual to physical address translation, 6) sends/receives parcels, 7) incorporates hardware fault detection, 8) reconfiguration flags, 9) energy sensitive control, 10) connects to external I/O for mass storage, and 11) guarantees compound atomic sequence of operations on local data.

2.2 Active Global Address Space

The ParalleX model incorporates a global address space. Any executing complex (such as a thread) can access any object within the domain of the parallel application with the caveat that it must have appropriate access privileges. The model does not assume that global addresses are cache coherent; all loads and stores will deal directly with the site of the target object. All global addresses within a Synchronous Domain are assumed to be cache coherent for those processor cores that incorporate transparent caches.

The Active Global Address Space used by ParalleX differs from research PGAS models.

Partitioned Global Address Space is passive in their means of address translation. Copy semantics, distributed compound operations and affinity relationships are some of the global functionality supported by AGAS capability.

11

2.3 Split-phase transactions

ParalleX is devised to enable a new modality of execution and relationship between the logical and physical system. Split-phase transactions are tasks that are performed in a sequence of segments, each of which is performed on a computing subsystem closest or containing the memory state upon which the actions are to be performed. This, in turn makes possible a key form of operation, the work queue model. This method only does work on local state and through the availability of a stream of such work packets achieves no blocking or idle time resulting from remote access. Thus, the effect of latency hiding is achieved and energy, as well as time, is reduced for the computation.

2.4 Message-driven computation

Parcels are a form of active message that permits the instigation of remote work and thus enable message driven computation. Parcels are a form of compound message that carry data, target address, action, and continuation information. Parcels permit the migration of continuations to create a new dynamic global control space that can adapt at runtime for locality management through self-aware low level mechanisms and high-level situational-awareness. In support of heterogeneous computing and optimal use of precious resources through latency hiding and overhead offloading, parcels support a medium to coarse grained “percolation” methodology (see section 8).

2.5 Multi-threaded operation

ParalleX is a multithreaded model to support three levels of parallelism with complexes (threads) serving as the intermediate level. Complexes are first class objects and can be both referenced and manipulated by user applications. Complexes are ephemeral in that they exhibit finite lifetimes. While logically they can migrate between physical domains, this is not a preferred as to do so could incur egregious overhead. However, self-consistency of the model demands this capability. Complexes incorporate a static-dataflow fine grain parallel control model to provide a simple means of representing operation level parallelism. This is a more abstract representation that permits easy and diverse transformation to employ different class of processor cores. The dataflow is on transient values while global mutable side-effects still require synchronization or control protection.

2.6 Local (lightweight) control objects

A key element of the ParalleX model is its dependence on a set of sophisticated synchronization constructs referred to as “local control objects” or LCOs. These are lightweight objects all of the state of which exists in a single contiguous logical block and physical memory bank to be acquired, modified, and restored atomically at low temporal and energy cost. While LCOs can perform the traditional synchronization primitives such as semaphores and mutexes, their innovation is in bringing such powerful constructs such as dataflow and futures for the first time to general usage. For example, an entire static dataflow program could be entirely represented

12

with a series of LCOs. In certain cases, this might even be a useful way to organize a computation (not recommended for the general use).

The diagram below provides a graphic of the ParalleX execution model. The shaded boxes represent synchronous domains while the larger bounded boxes represent logical parallel processes which are capable of employing more than one system node at a time and even share one or more nodes between processes. The complexes are represented by the wavy black lines while the directed green arcs imply message-driven computation using parcels. Other features of

ParalleX are also displayed in this picture.

13

3. Fundamental Elements

Independent of the implementation details of a system supporting the ParalleX execution model, every system must incorporate a set of specific fundamental elements. Here they are partitioned in to two general classes: synchronous domains and active global address space.

3.1 Synchronous Domains

A “synchronous domain” is a contiguous physical domain that guarantees compound atomic operations on local state. It manages intra-locality latencies and exposes diverse temporal locality attributes. The set of synchronous domains comprises the functional capability of the entire system and divides the world into the synchronous and the asynchronous. Thus a system comprises a set of mutually exclusive, collectively exhaustive domains. Synchronous domains although hardware are themselves first class objects with the same attributes of other objects modulated by type “synchronous-domain”. Such domains need not be identical and as long as specific inter-domain protocols and invariants are satisfied can support heterogeneous system architectures. However, all asynchronous domains must exhibit specific inalienable properties.

3.2 Active Global Address Space

The active global address space (AGAS) of a system provides a unified reference space on a distributed physical system. AGAS assumes no coherence between localities. It enables the ability to move virtual named elements in physical space without the need to change those names as would be necessary in PGAS systems. Examples include: user variables, synchronization variables and objects, parcel sets (but not parcels!), threads as first-class objects, and processes which are also first class objects specifying a broad task. AGAS defines a distributed environment spanning multiple localities and need not be contiguous.

Active Global Address Space maps globally unique addresses to local virtual addresses. Global addresses are immutable. They allow objects to be moved to other localities and are a precondition for efficient dynamic flow control and load balancing.

14

4. Parallel Processes

The concept of the “process” in ParalleX is extended beyond that of either sequential execution or communicating sequential processes. While the notion of process suggests action (as do

“function” or “subroutine”) it has a further responsibility of context, that is, the logical container of program state. It is this aspect of operation that process is employed in ParalleX. Furthermore, referring to “parallel processes” in ParalleX designates the presence of parallelism within the context of a given process, as well as the coarse grained parallelism achieved through concurrency of multiple processes of an executing user job. ParalleX processes provide a hierarchical name space within the framework of the active global address space and support multiple means of internal state access from external sources. It also incorporates capabilities based access rights for protection and security.

4.1 Process Parallelism

The ParalleX process is the main abstraction encapsulating parallel execution. It does this in three ways. It uses three levels of physical parallel resources. It can distribute its state and multiple actions across multiple synchronous domains (nodes). Within any one synchronous domain a parallel process may employ multiple physical threads or cores to support multiple concurrent ParalleX user threads within the process. And, within a physical thread, a user thread of a ParalleX process may employ its instruction level parallelism (ILP) or dataflow parallelism.

More than one ParalleX process can share a synchronous domain. The relationship between the abstract process parallelism and the hardware physical parallelism may change over time, either because the object of parallelism is ephemeral (a finite lifetime) or because an existing executable object (e.g., process or complex) shifts the physical resources allocated to it for load balancing or fault tolerance.

4.2 First Class Object

A ParalleX process is a first class object. It is identified by a name in the same name space of application variables (e.g., x or y, or i or j) and can be treated by a set of associated operations in the same instruction stream of conventional operation primitives. A process may be manipulated by other processes; created, terminated, suspended, moved, queried with regards to status or properties, etc. It may be applied to an argument object. A process is defined by its corresponding procedure. A procedure is a representation of the state to be created and manipulated and the actions to be applied to the state as well as defining other attributes. A procedure is also a first class object that itself may be manipulated and mutated.

4.3 Comprising Elements

The procedure defines the active processes that are its instances. It contains a number of elements of different type classes. These elements include action entities including child

15

processes, computation abstract complexes (e.g., threads), and primitive operations. Other elements are data either individual scalar variables or structure of such variables such as sets, vectors, arrays, and graphs. A process includes a map of its elements to its physical resources. As changes of resource allocation or element assignment occurs, the content of the resource map is changed to reflect this. As discussed below, a process includes its access list for capabilities based addressing for protection. The process also contains observable parameters including a measure of power, reliable operation, utilization, and other factors.

4.4 Accessing Process Data

The threads of a process can access any data within its resident process. Depending on the relative position of the resident process with a second target process in the process hierarchy, the means of accessing data contained by the target process is possible but will vary.

4.4.1 Tree Namespace Hierarchy of Processes

A process may create child processes by instantiating procedures. These child processes may in turn create their own child processes. The result is that a user ParalleX computation is organized as a tree of processes with the root process, “main”, the ancestor of all other descendent processes. This tree of processes defines the hierarchical name space within the global address space. An object in an application is determined by its ordering within its resident process and the sequence of process decent from “main” to the resident process. This provides a unique and dynamic naming of all program objects.

4.4.2 Direct Access

A thread of its resident process can directly access any data contained by any child process or its descendant processes. Direct access refers to the ability to read or write the value content of a memory location without an intervening software such as a method. Except for the hierarchical naming it is identical to accessing data in the original resident process of the thread.

4.4.3 Indirect Access

A thread of its resident process seeking access to the state of a process that is a sibling or cousin process must do so through the use of intermediary methods. Direct access is not possible and the abstractions of the access methods guarantee isolation of underlying structures designed separately while providing correct semantic operation. It also permits a transparent layer of protection through access rights provided by capability based addressing.

4.4.4 An open issue:

One relationship has not been resolved: that of access by a descendent process thread to an ancestor process data. One approach is to give them direct access to the data directly above them. Another is to limit such to indirect access. This question is being explored currently.

16

4.5 Instantiation

A process is ephemeral except for the process “main” that exists throughout the lifetime of the application program execution. Therefore, a process is created as an instance of a called procedure potentially at any time during the application execution with a set of operand values.

ParalleX employs higher order procedures permitting procedures to be passed as operands of calls to other procedures. ParalleX also permits the use of both conventional strict calling sequence as well as an eager mechanism. There is also a lazy evaluation mode. The strict calling sequence will not instantiate a parallel procedure until all of the procedure arguments are available; that is, they have been determined and are local to the site of the calling thread.

Alternatively, an eager mode of procedure instantiation will create a new process when any of the argument values are available and will initiate such parts of the process computation requiring only those available operands while deferring other internal computation until their required operands also become available. It is also valid to pass a futures variable as an operand to a procedure. A future represents the equivalent of an IOU for a value yet to be derived. It permits a form of calculation that is both eager and lazy. How? The called procedure may proceed without the actual operand value, holding instead the future for eager evaluation. But the future permits its own calculation to only begin when the using process is instantiated thus permitting a form of lazy evaluation.

Main

Operands (strict & eager)

Data

Methods

Codes

S

D

S

D

S

D

S

D

Resource

Configuration

I/O

Process

Child

Processes

CAC

CAC

CAC

CAC

17

5. Computation Abstract Complexes

5.1 Complexes are Advanced Threads

Almost all computation is performed through the mechanism of the Computation Abstract

Complex (also referred to as a “CAC” or just simply a “complex”). For all intent and purposes, the complex can be thought of as a thread in the conventional lexicon. The reason this is not exactly correct is because a complex permits but does not demand certain forms of internal control flow that is not ordinarily employed by threads. However, a complex in its abstraction can be targeted to a physical thread of conventional processors. The reason that complexes do not constitute all the computation is that for some very lightweight functions such as remote basic compound atomic operations the overhead, even with hardware support, for thread instantiation would significantly exceed that of the useful work itself thus requiring alternative mechanisms for achieving the same result.

5.2 Local Objects

Complexes provide the functionality of its resident parallel process. An important distinction between complexes and processes is that complexes are local while processes are potentially distributed. The control state of a complex is entirely incorporated within its resident physical synchronous domain as is its register state and that of its stack frame if used. Therefore, intra complex operation can anticipate the majority of its operation operands will be locally available and therefore exhibiting bounded latency.

5.3 First Class Objects

Complexes are first class objects. They are individually identified by an address within the global address space and are referenced uniquely in the hierarchical name space of the application. They not only perform actions on data but can allow actions to be performed on them as a whole. This is critical for the development of runtime systems, schedulers, prioritizers, and dynamic load balancing.

Within a complex its state is also accessible externally that ordinarily would be hidden or private.

The abstraction of registers within a complex is an example. Under certain conditions, registers may be accessed from outside its complex. This involves not only the internal data value state of a register but also the control state as well. The abstraction of a complex assumes single assignment semantics for registers. This implies a write-once, read-many times side-effect model. It is motivated by the advantage of working without anti-dependencies so that both compilation and hardware runtime control will have easy flexibility in both extracting fine grain parallelism and allocating physical register resources. Single assignment methods permit dataflow semantics for exposing the potential parallelism even if only on a sequential issue

18

processor core for determining issue order, pipeline latency mitigating, and instruction level parallelism.

5.4 Variable Granularity

ParalleX complexes permit encapsulation of local parallelism in a form of variable granularity that can either improved available concurrency through lighter weight complexes or mitigate software overheads through heavier weight complexes. Aggregation at a level of compilation can be used to construct coarser grain complexes from the specification of fine grain complexes transparent to the user or DSL (Domain Specific Language) layer that adopting granularity to system operational properties.

The opportunity to exploit near-fine grain parallelism possible with ParalleX complexes will ultimately demand low overhead mechanisms of the system including both hardware and software. For example, reasonable scaling requires that the useful work of a complex exceed in critical length time that of the overhead time costs needed to manage the complex. Recent experiments with the experimental HPX runtime system shows full lifetime overhead of a complex on the order of 1 microsecond with additional overheads incurred for the likely multiple context switching. Experiments by other user multithreaded software runtime systems suggest that fine tuning and optimization may be able to reduce this by a factor of four or more. Further improvements will demand specific hardware functionality for reducing the overheads more.

5.5 Internal Structure

A complex is a partially ordered set of actions to be performed on first class objects from simple scalar variables, through structures of data, to other complexes and processes. The ordering is determined by establishment of precedent constraints. That is: which operations must be completed or variable values produced and provided (locally) prior to and sufficient for a described action to be conducted. This is equivalent to the dataflow graph of the computation.

Every complex has a set of locally named data elements that in conformance to conventional practices is referred to here as “registers” but which are used in a way different from conventional processor cores. ParalleX registers are single-assignment objects that maybe written once and read many times. This is equivalent to a functional computing model. As an abstraction it implies that the number of registers of a ParalleX complex is variable. The names themselves are computable, i.e., they may be indexed to create higher order register structures.

The motivation for this as previously mentioned is to convey the fine grain parallelism of the computation represented by a given complex through a single simple but powerful model

(dataflow). Theses registers are used for all local and private intermediate calculation results.

One form of a complex can use only this technique thus satisfying the requirements to operate purely functionally, i.e., value oriented. Like a process, a complex upon instantiation can be applied to a set of operand values. If the complex requires no additional input data, such as loading from global mutable data, then it can operate functionally. The results of a complex may be conveyed in four ways: 1) the result value may be stored in a register of the parent thread, 2)

19

the result may be conveyed via either directly or indirectly with a parcel to a site of instantiation of a new thread, 3) the result may be deposited with a local control object, or 4) the result may be loaded in at the site of a global mutable object. The first two methods extend the functional execution style. The third also can be used in a functional style when the LCO is the dataflow type.

Complexes engage several classes of operations of which more than one operation may be included. These major classes are briefly defined in the following subsections.

5.5.1 Primitive scalar

The largest set of operations are the familiar scalar operations performed on the contents of complex registers resulting in the production of one or more resulting values loaded in the complex registers.

5.5.2 Compound atomic specified operations

Short sets of primitive operations to be performed on tightly bound data likely to be stored in the same memory bank. This class of operations may provide very efficient compound operations potentially not requiring software locks if proper hardware support is incorporated within the system.

5.5.3 Complex and process calls

Both computational abstract complexes and processes are created through calls from complexes themselves. While there are additional means by which these important activities can be instantiated call by complexes is by far the most common.

5.5.4 Parcel management

A small set of parcel create and manage semantic constructs are included in the model to elicit work at a distance or to move blocks of data. Parcels require a destination, an action, possibly some operand values, and an indication of what may happen afterwards. This information is provided by a data structure that is applied to the parcel-set object. There, transparent to the user program it is transformed to a parcel and encapsulated in a data transport packet. Other useful instructions solicit such data structures from the parcel set that have arrived within the local subset of the parcel set but not delivered. Such parcels can be acquired by requesting properties such as a parcel from a particular source, a particular kind of parcel, or a parcel with a specific priority.

5.6 Complex Lifecycle

A complex is ephemeral being created at some point in the lifetime of its resident process, performing its assigned function, and terminating at or before the conclusion of its host process.

The lifecycle represents the sequence of operational states through which the complex transitions. It is characterized by the state diagram such as the one below. Upon instantiation of a complex, its beginning state is initialized and when completed it is taken over by the scheduling methods of the system to issue it for processing. If resources are immediately available the

20

complex issues its operations for execution. If there are no resources available, the complex which is logically capable of continuing is transitioned to the pending state joining other complexes waiting for access to physical resources. When the complex runs out of operations ready to be performed but expects to be able to proceed in a modest time span, then it transitions to the waiting state. Upon completion of a precedent operation, the complex may be returned to the pending or even the issues state. Other states provide different modes of suspension such as blocked or depleted permitting different system response. Upon completion of its allocated functional responsibilities, the complex is retired first cleaning up its footprint in memory and then eliminating it from the system at the end.

21

6. Local Control Objects

Local Control Objects (LCO) provide semantic constructs for globally distributed synchronization. They complement the internal synchronization state of the active threads, together comprising the parallel control state of the executing program. An LCO is a finite state machine ( FSM ) whose local control and data state evolve in response to the incidence of program events. The LCO establishes a predicate that defines the criteria in terms of the control state for which it will invoke an action that may affect any part of program. An LCO is local in the sense that its entire being exists within a single locality .

6.1 Goals of the LCO Construct

The Local Control Object is a seminal construct of the ParalleX model of computation, taking many forms but in essentially all cases serving the need for dynamic global synchronization and migration of continuations . The principal goal of the LCO is to enable the exploitation of parallelism in a diversity of forms and granularities for extreme scalability such as exploiting the implicit parallelism of meta-data in dynamic directed graphs. A number of important attributes of the ParalleX include the following:

Exposure of Parallelism – greatly expand the kinds of parallelism that can be represented for scalability

Eliminate global barriers – provide (near) fine grain synchronization of small parts of flow control to mitigate effects of variable thread lengths within fork-join boundaries

Migration of continuations – enable transfer of low control across distributed computation and system to put it near operating data

Facilitate Global Control Operations – serve to implement distributed global atomic operations

Eager-lazy evaluation – facilitate scheduling policies that balance eager and lazy methods for time-domain load balancing

Dynamic Adaptive Resource Management – enable global resource management at runtime for both space and time domain load balancing and critical path execution optimization including mitigating effects of contention (e.g., shared resource access conflicts)

6.2 LCO Concept

A Local Control Object is a small object oriented structure and associated set of constructs that implement a finite state machine (FSM) for flexible synchronization and control. As an object, an LCO comprises data and control state as well as a set of related methods that perform on both.

22

An LCO is a first class object which means it exists within the user variable name space and may be addressed and accessed by any thread within its context hierarchy. A thread’s protected state is accessed by calls to its designated methods which manage all incident event accesses, assimilate input data as required, modify control state, determine satisfiability of a predicate constraint, and if so satisfied initiate the purpose action of the LCO.

The LCO is a family of synchronization functions potentially representing many classes of synchronization constructs, each with many possible variations and multiple instances. The LCO is sufficiently general that it can subsume the functionality of conventional synchronization primitives such as spinlocks, mutexes, semaphores, and global barriers. However due to the LCO rich structural substrate it can represent powerful synchronization and control functionality not widely employed such as dataflow and futures among others which open up enormous opportunities for rich diversity of distributed control and operation.

The LCO supports lightweight synchronization; the synchronizing of a few rather than essentially all activated threads of action. It provides for management of global control flow in the presence of asynchrony; the non-deterministic partial ordering of events due to variable latencies, contention, and scheduling policies of a global system beyond the control of user code.

Lightweight synchronization, within the practical limits of overhead costs, may remove the overconstrained scheduling imposed by conventional techniques such as global barriers and critical sections.

Here is a list of some of the possible synchronization classes , many well known, all of which may be implemented within the LCO framework:



Mutexes



Semaphores



Events



Full-empty bits



Dataflow



Futures



Depleted threads (suspended)

But LCOs as a family may be customized to very specific usages permits unique to a given application code, still maintaining the properties of all LCOs. Such customization may be achieved through the refinement of generic LCO form with intrinsic inherited form and functions

(methods).

The concept of the event is a general way of capturing the asynchronous incidence of a communication with the LCO representing some other computation completion or requirement for organized action. The organization usually implies some coordination with one or more other action stages achieved; each an event in its own right. From the perspective of ParalleX, an event

23

is manifest as the incidence (arrival at the LCO) of a parcel from anywhere in the system or an access request by a thread in the same locality. It may also be a system state from the runtime or operating system (OS) or even the hardware architecture (usually the OS serves as an intermediary between the user code and the hardware architecture). All LCOs react to incident events by modifying their control state and possibly their data state.

The basic operation of an LCO is to perform a consequent action upon the satisfaction of a predicate criterion specified in terms of the LCO control state. Upon each incidence of an event, the control state once updated by the internal FSM is checked in terms of the LCO predicate.

Only when the predicate is true (the criterion is satisfied) is the LCO consequent action performed. Upon completion of its action, an LCO may be terminated ( ephemeral ) or be reset to continue operation towards the next in a sequence of synchronization functions ( persistent ).

6.3 Generic LCO

The generic LCO is a set of general structure and operational properties shared by all LCO objects from which specialized predefined LCO classes or user custom defined LCOs are derived. Alternatively, such classes inherit a common set of attributes from the generic LCO.

The generic LCO is an embodiment of the concepts described previously. In summary: an LCO is a persistent or ephemeral first class object that instantiates new threads or performs other actions in response to a set of directed incident events according to pre-established criteria codified as the LCO predicate in terms of the LCO control state.

The family of ParalleX LCO is partitioned into a set of classes each of which may include a set of types . The Generic LCO represents the entire family of ParalleX LCOs. LCO classes include dataflow, futures, depleted threads, producer-consumer, and others. A type for anyone of these classes is the special case with specific parameters, predicate, control finite state machine, etc.

The Generic LCO is reactive. It is quiescent except when it is the destination object of a requesting agent such as a thread, a parcel, or some special mechanism. Such a request is referred to as an incident event . When quiescent, the Generic LCO consumes no execution resources and is embodied as a data structure in memory identified by a global address.

The Generic LCO is described in terms of its basic elements and the basic operations performed on and by them. The following two subsections describe the Generic LCO in each of these.

6.3.1 Basic Elements

The Generic LCO consists of data and control state buffers, control finite state machine, predicate definition, and multiple methods defining input processing and output actions. These basic elements are inherited by all LCO types and can take diverse forms. Here each basic element is introduced.

24

6.3.2 Input Event Queue

An LCO includes a means of temporarily holding an incident event until the destination LCO is available to assimilate that event. This ensures atomic processing of any preceding events for deterministic operation. The event queue will hold a number of such events.

6.3.3 Data State

The data state of an LCO represents the aggregation of input events sufficient to provide the necessary consequence actions upon satisfaction of predicate criterion. The data buffer is updated upon the incidence of an event. The contents of the buffer may be comprehensive copies of the content of the incoming accesses or as limited as a count of such events. Other filters applied to the input events may assert mappings that provide and store some intermediary data representation. In a small subset of LCO types, the input data may be the null set with all event sequence effects limited to the LCO control state.

Event data state may be either fixed in size specified by the LCO type or variable length determined at runtime. An example of fixed size buffer for data state is the Dataflow LCO with one entry for each of a set number of operands. An example of a variable length data state buffer is as part of a Future LCO to store a non-determinant number of access requests.

6.3.4 Control State

The control state of the LCO reflects the history of incident events and preceding actions of the

LCO. It is the state of the finite state machine and is modified upon incidence of input events or initiation or completion of consequence actions. The control state is employed by the predicate criterion to determine when the LCO can perform its specified consequent action.

6.3.5 Predicate Criterion

The predicate criterion determines when the LCO can perform its specified consequent action. It is an encoding of a logical relation based on the control state of the LCO. When the predicate is evaluated to true , the LCO will initiate its specified consequent action.

6.3.6 Methods

Every LCO has associated with it a set of methods some of which are shared among all LCOs and others which are adjusted or entirely custom for the needs of the specific functional requirements of the particular LCO.

The inherited generic methods are the low level functionality required of all LCOs. These include the basic create and terminate, event queue, action control, parcel constructors, garbage collection, and heap management among others.

The event assimilation methods manage the specific input processes of acquired incident events including the initial input data processing, state buffer management, and resulting control state update. The control method manages the internal finite state machine, the predicate criterion and test with respect to the control state, and the initiation of the consequent action. The thread

25

method is optional and is used when the consequent action involves the creation of a new thread, either directly, or through the mediation of a parcel for remote thread instantiation.

Inherited Generic Methods

Incident

Events

Event Buffer

Event

Assimilation

Method

Cont rol

State

Control

Method

Predicate

Thread

Create

Complex

New

Complex

26

7. Parcels

7.1 Introduction

7.1.1 Objectives

The Parcel is a component of the ParalleX execution model that communicates data, asserts action at a distance, and distributes flow-control through the migration of continuations. Parcel bridge the gap of asynchrony between synchronous domains while maintaining symmetry of semantics between local and global objects. Parcels enable message-driven computation and may be classed as a form of “active messages”. Other important forms of message-driven computation predating active messages include dataflow tokens, the J-machine support for remote method instantiation, and at the coarse grained variations of Unix remote procedure calls, among others. This enables work to be moved to the data as well as performing the more common get action of bringing data to the work. A parcel can cause actions to occur remotely and asynchronously, among which are the instantiations of computational complexes (a generalized form of threads) at different system nodes or synchronous domains. Optimizing the protocol of Parcels is important for efficient scalable parallel computation by minimizing communication overhead and latency effects, avoiding communication contention, and exploiting increased parallelism.

This section describes Parcels and provides their Specification as part of the broader ParalleX

Specification. It is a working document and will experience frequent and substantial changes, refinements, and clarifications to reflect knowledge gained through research.

7.1.2 Overview

A Parcel is a semantic construct intended to serve as an abstraction of an asynchronous message between two distant but integrated resources, both logical and physical, in a parallel computer system. It is distinguished from conventional send/receive data transfers in that it can target as a destination any first class object, can specify an action to be performed and upon completion the data and control side-effects to assert. The Parcel has four major components or fields that provide a flexible operator for asynchronous execution control. These are: destination field, action field, data payload field, and continuation field. Parcels are expected to be manifest in a few different sizes with short parcels optimized for minimized latency and long data movement parcels optimized for data throughput.

7.1.3 Organization of this Section

This subsection briefly introduces the semantic construct, Parcel, as a critical component of the

ParalleX execution model, as well as an overview of its function. Section 7.2 describes the

27

Destination Field of the Parcel that identifies the target of a parcel of the object to which the parcel action is to be applied. Section 7.3 presents the set of possible Actions that can be represented in a Parcel. Section 7.4 discusses the Data field of the parcel and its many types, forms, and roles. Section 7.5 presents the Continuation field that determines the follow-on task that is performed upon completion of the parcel specified action at the destination objects.

Section 7.6 presents some ideas on a basic programming interface that would invoke the fundamental functionality represented by the specified parcels. Finally, Section 7.7 explores possible formats to derive sizing factors and constraints.

7.1.4 Open Issues

As this document represents continued progress in research on message-driven computation, it will book-keep issues considered important that at any point in time are unresolved.

7.2 Destinations

7.2.1 Destination classes

The Parcel destination field uniquely identifies any ParalleX object, such as a process, LCO, computational complex, synchronous domain, or hardware entity represented in the AGAS address space. Such a first class object is the logical target of the action conveyed by the designated parcel. The parcel destination field provides the context for the parcel action execution; the action execution being analogous to, but not necessarily implemented by, a method invocation applied to the destination object. The actual arguments of the method are data components of the parcel including the complete destination descriptor and the contents of the parcel data field.

All first class objects have a unique virtual address (name) assigned by the AGAS. This global virtual address space is flat. It is not possible to derive ancestor or descendant information in global object hierarchy based solely on the object’s virtual address. The virtual addresses are integers of fixed size in a given ParalleX implementation and cannot be explicitly decomposed into meaningful components.

Using first class objects as parcel destinations is sufficient to implement and perform arbitrary operations on ParalleX objects. However, this may result in excessively large type descriptors required to describe and access their immediate subcomponents. While AGAS permits assignment of names to arbitrary entities represented in the global address space, doing so for many of them would frequently result in exhaustion of the address space and cause additional inefficiencies. Therefore, parcels can access immediate subcomponents of the first-class objects; this is accomplished by adding an optional target descriptor to the parcel data field (presence of this information is indicated by a flag in the parcel descriptor). The supported subcomponent targets include:

28

 process data members,

 complex registers,

 functional blocks of hardware entities.

The target field is a number uniquely identifying subcomponent or member of a specific object type. The component enumeration must be consistent across all synchronous domains in the system.


The specific binary length of the destination field is as yet unspecified but is further discussed in

Section 7.7.

7.3 Actions

The Parcel Action field defines the type of operation to be performed upon parcel arrival at the synchronous domain containing the instance of parcel destination object. This operation typically involves processing of the contents stored in parcel data field, therefore the format of the latter is strictly determined by the action type. The operation may be performed by any execution resource physically available within the domain, in particular, intelligent NICs or custom function accelerators as long as they possess the necessary capabilities to decode the data payload (when present), transfer the data between parcel buffer and system memory (when required), and support the computation to be performed. The exceptions from this rule include percolation and parcels targeting explicit physical destinations (such as hardware registers).

Parcel actions are not explicitly typed; instead, the type of operation is derived from the types of action arguments. The type information is strongly associated with the carried payload

(operands, objects) and encoded in the parcel data field.

The action field is subdivided into two subfields: action class and action subtype. The first describes the major categories of operations, while the second specifies the actual operation to be performed or its additional parameters. Execution resources advertise their capabilities to the root parcel dispatcher in a synchronous domain indicating which types of actions each of them can handle; supporting the whole action class implies that all its action subtypes are supported.

Certain actions are launched with an explicit intent to retrieve the contents of remote memories or other computational state. Note that these actions do not specify how or where these results should be transmitted, or even what should happen to the result; this part of parcel-driven transaction is governed entirely by continuation associated with the parcel. The result produced by such actions, represented as a return value from a function implementing the action, is passed as one of the arguments to the continuation.

Parcels support the following action classes:

29

7.3.1 Atomic memory operation

This action class applies to globally accessible portion of synchronous domain’s memories. It represents the smallest parcel format and should therefore promote latency-optimized implementations. Its argument is at most one scalar value up to 64 bits in size stored in the parcel data field.

Supported subtypes:



Simple scalar read : read a 1, 2, 4 or 8-byte scalar from memory location identified by destination field.



Simple scalar write : write a 1, 2, 4 or 8-byte scalar to memory location identified by destination field.



Atomic test-and-set with operand width derived from type information stored in data field the action completes when set operation succeeds.



Atomic fetch-and-add using the datum supplied in the data field as the addend and its type to determine the size of the operation in bits; the return value is equal to the incremented value stored in memory as a result of operation.



Atomic compare-and-swap using the first argument supplied in the data field as the comparison value and the second argument as the intended memory value upon successful comparison; the operand’s type determines the size of the operation in bits.

Compare and swap returns a Boolean value indicating whether the operation was successful.

7.3.2 Compound atomic memory operation

Compound atomic memory operation consists of a sequence of one or more AMOs. The effects of this action are memory updates or retrievals (or both) carried out as if all component AMOs were executed simultaneously and in uninterrupted fashion. If any of the component AMOs fails, the entire action fails with no memory updates. The actual order of execution of the individual

AMOs is undetermined and implementation specific.

The effects of compound AMOs are explicitly restricted to the memory or state of a single synchronous domain. The entire operation can be executed to completion as a result of a single incident parcel. This stands in direct contrast to another class of compound atomic operations called Distributed Control Objects (DCOs), which are not directly implemented through parcels and operate atomically on object sets with arbitrary distributions.

7.3.3 Block data movement

The block move action performs bandwidth-optimized large volume data transfers. The raw bytes representing object contents are stored in parcel data field, accompanied where needed by related type information. Note that some runtime system implementations may be sensitive to the way the copies of data are handled (for example due to implicit reference counting); such environments must specify an additional flag to prevent the hardware layer from executing the action. This flag is bitwise or-ed with action subtype.

30

This specification does not explicitly name the entities or layers responsible for the serialization of objects or their members. The implementers should have the flexibility of choosing the best performing implementation for a specific system and application profile. It is expected that while these tasks could be carried out at multiple levels of the parcel stack, the software layer by default exercises the most control over serialization formats. However, it is possible that certain data movement operations, specifically those including POD (plain old data), could be accomplished using relatively unsophisticated hardware and therefore improve the overall efficiency.

Supported subtypes:



Opaque block put : writes block of bytes of specified size into location pointed to by the destination.



Opaque block get : reads block of bytes of specified size from location identified by the destination; returns the size and contents of the memory block.



Object data member put : update the contents of a data member within an object using supplied serialized contents accompanied by type information; the destination identifies the containing object and the accessed data member.



Object data member get : obtain a copy of a data member using supplied type information in the data field; parcel destination identifies the target object along with the accessed data member.



Object restore

: recreate a ParalleX object’s state as a child of process given as the parcel destination; the data field contains serialized object data. Object may be a process, an inactive LCO, or a depleted thread. The serialized state may include object’s AGAS name, in which case the association continues to be maintained after the object is recreated on the other end; otherwise a new name is obtained from AGAS and assigned to the restored object.



Object migrate : force migration of the object given as the parcel target to destination process or synchronous domain specified in the data field. This action results in object put parcel with destination field set to the target domain or new parent process, and serialized object state and name in the data field. Name associations referring to the previous location of the object will have to be updated. If the object is relocated to an explicitly identified synchronous domain, it becomes a child of the main process in the same application. This action operates on the same object kinds as object put parcel.



Object clone : replicate the internal state of an object specified in parcel destination, assigning it a new AGAS name. If a new parent object or host domain is specified in the data field, transfer it there as described for object migrate subtype; otherwise make the clone a sibling of the destination. The supported object types are the same as for object put .

31

7.3.4 Remote Thread Instantiation (methods)

RTI action creates and initiates the execution of a thread at the synchronous domain hosting the object specified in the parcel destination field. Parcel data field contains an optional sequence of thread attributes and their values defining the characteristics of the created thread, and a portable description of the method to invoke along with its marshaled argument values. This helps decoupling the software domain, which may provide arbitrarily complex descriptions of methods as well as their input arguments, from the processing at lower layers of parcel stack. For example, parcel hardware may still be used to accelerate the thread creation and/or scheduling operations (even if only by allocating the memory for thread state or performing early thread register initialization) whenever possible. The created thread executes a predefined function that accepts the parcel destination and method description as its arguments.

The RTI class has no subtypes.

7.3.5 Thread state manipulation

This class of parcel performs operations on thread state. The parcel destination field must contain thread name.

Supported subtypes:



Thread attribute get : obtain values for the number of thread attributes named in the data field; the retrieved name-value pairs are returned.



Thread attribute set : update the values of thread attributes using name-value pairs supplied in the data field.



Modify thread status : attempt to resume, suspend or deplete a thread. Returns status of the operation.



Thread register write : a special case of LCO fire method (discussed later) permitting low level update of thread register. The destination must be a suspended thread. The data field contains a typed scalar value, which is implicitly converted to thread register type. Note that this action is semantically different from object data member update (7.3.3), because it explicitly causes rescheduling of the modified thread.

7.3.6 LCO access

The LCO access class permits control of the LCO state using predefined interface methods

(consult the XPI Specification document for details). Since the operation semantics and format of the argument list is fixed for each fundamental LCO type, this opens the possibility of lowoverhead hardware-assisted LCO manipulation.

Supported subtypes:



LCO register : enters object names specified in the data field on the notification list associated with an LCO identified in parcel destination. The data field may contain an optional typed value used as an argument to register method.

32



LCO unregister : removes object names stored in the parcel data field from the

LCO notification list.



LCO wait : waits for the destination LCO to fire, with a timeout value specified in the data field. When that occurs, a typed value optionally produced by the LCO may be returned as the action’s result.



LCO fire : supply a trigger event to the target LCO associated with a typed value as an input argument.

7.3.7 Execute code

Parcels may carry explicitly executable objects, which serve two purposes. First, they implement percolation, which attempts to harness non-standard execution resources, such as accelerators, coprocessors, or dedicated functional units. The second use case involves execution of simplified micro-programs directly by parcel handler logic. The latter may have uses for system bootstrap, introspection and fault handling complementary to hardware-level register access. o Type 1 (self-contained) percolation: data field supplies an execution object to the percolation target identified in parcel destination. The executable object must conform to a prearranged format understood by the percolation’s target (e.g., ELF). o Parcel handler microcode: execute in-lined code segment expressed in simplified ISA interpreted by all parcel processing hardware in the system. The code along with associated data is stored in parcel data field.

7.3.8 Hardware register access

This class of actions enables a direct access to hardware registers. It applies to any device or unit with physically addressable memory space or register file disjoint from synchronous domain main address space. The parcel destination identifies the target hardware unit, while the sequence of physical addresses and bitwise arguments is stored in the data field. The data access operations are carried out strictly in order specified.

Supported subtypes: o Register read: perform the read sequence using addresses and bit size information from the data field. Return the sequence of tuples containing address, size in bits and value for each access. o Register write: perform a sequence of bitwise write operations using addresses, sizes and values stored in parcel data field. Return operation status. o Type 2 (low level) percolation. (This will be expanded upon)

7.3.9 I/O

<TBD>

33

7.3.10 Parcel tunnel

Parcel tunneling provides an encapsulation mechanism for parcels. The primary application of this protocol is to enable easy handoff of parcel content to functions operating on parcel arguments, such as optimized AGAS forwarding. The destination object must export a predefined method that accepts the metadata of the wrapper parcel and wrapped parcel as the arguments.

There are no explicitly specified subtypes, however the implementations are free to use the subtype field to define application-specific behavior.

7.3.11Open Issues

Among the important innovative claims of the ParalleX model is the action classes of its Parcels.

New advances are being made in this area and while much has been accomplished, there are key specifics of refinement still required. Some of these are briefly mentioned here. a.

With regards to Atomic Compare and swap would it better to let it retry until it succeeds? b.

There may be some use for double word atomics are useful, if they can be implemented correctly. c.

Would a more generic fetch-and-op function be useful instead of limiting to addition? d.

Should atomic fetch and add operate only on a single byte? e.

AMOs require substantial refinement in their semantics including detailed functionality. f.

Percolation should either be defined here or included elsewhere in the ParalleX specification. g.

I/O requires a system-wide strategy including support from Parcels.

7.4 Data

Parcel data field has to accommodate many kinds of payloads and therefore has the most flexible format of all parcel components. The possible variants include:

 data ranging from as little as one byte to as large as multiple memory pages,

 both anonymous (opaque) and fully typed datasets,

 bitwise values representing hardware register contents, scalars, structures, arrays, as well as serialized ParalleX objects,

 additional metadata required to execute parcel action (types, addresses, sizes, etc.),

 purely passive data and executable code streams, and mixes of both,

 other (embedded) parcels,

 optional specifiers.

34

Data field has a variable length that is specified in parcel descriptor. In addition to raw payload data, in many cases it is necessary to provide sufficiently descriptive metadata (such as type information) for unambiguous decoding of the field contents; this is illustrated in more detail in

Section 7.7. The field format is strictly determined by parcel action class and its subtype. Below are enumerated the possible formats of the parcel data field along with lists of related actions.

The count and type information have not been listed explicitly for data values, identifiers, sizes and addresses for readability and to prevent overly restrictive format specification; it is mentioned only when necessary to identify the required component that is not associated with a value in data field.

7.4.1 Null

The data field does not contain any information. Its length in the parcel descriptor must be zero.

Used by: object_clone (variant), LCO_wait (variant), LCO_fire (variant)

7.4.2 Scalar type

Format: scalar_type

Used by: simple_scalar_read, test_and_set

7.4.3 Scalar operand

Format: scalar_value

Used by: simple_scalar_write, fetch_and_add, object_migrate, object_clone (variant), modify_thread_status, thread_register_write, LCO_wait (variant), LCO_fire (variant)

7.4.4 Scalar pair

Format : scalar_value, scalar_value

Used by: compare_and_swap

7.4.5 Untyped memory size

Fixed-length integer number representing size in bytes; the length must be sufficient to express maximal permitted payload.

Format: size

Used by: block_get

7.4.6 Untyped memory block

Format: byte

0

, …, byte size-1

Used by: block_put, type1_percolation, NIC_microcode, parcel_tunnel

In addition, if the action field contains flags preventing processing by lower layers of parcel stack, the following requests degenerate to tis data format:

35

Note: the size is known from parcel descriptor.

7.4.7 Compound type descriptor

Compound type descriptor expresses the structural type information describing an arbitrary

ParalleX object or its data member. This information must decompose (in leaf types) to scalar types, such as integers, floating-point numbers, and AGAS names. The details are presented in binary parcel specification.

Format: compound_type

Used by: data_member_get

7.4.8 Serialized ParalleX object

Format: compound_type, value

0

, …, value n-1

Used by: data_member_put, object_restore

7.4.9 Thread attribute get

Thread attribute uniquely identifies a specific attribute of a thread. The number of attributes to be retrieved is derived from data field length.

Format: attr_id

0

, …, attr_id n-1

Used by: thread_attribute_get

7.4.10 Thread attribute set

Format: attr_id

0

, attr_value

0

, …, attr_id n-1

, attr_value n-1

Used by: thread_attribute_set

The attributes may require arguments of different types and/or sizes. The value size may be uniquely determined either by its attribute identifier or described through embedded type specification mechanism.

7.4.11 Method

Format: attr_id

0

, attr_value

0

, …, attr_id n-1

, attr_value

n-1

, method_type, arg_value

0

, …, arg_value k-1

Used by: RTI

The size entry contains the count of thread attributes. The method_type describes the type of the individual arguments as well as the computed output type. There may be more scalar values following the type descriptor than function input arguments, as some of them may have to be used to construct compound input objects.

36

7.4.12 Name list

Format: name

0

, …, name n-1

Used by: LCO_register (variant), LCO_unregister

7.4.13 LCO register request

Format: name

0

, …, name n-1

, scalar_value

Used by: LCO_register (variant)

7.4.14 Hardware read request

Format: addr_size, address

0

, …, address n-1

Used by: register_read

The request parameters are a sequence of hardware addresses of the same size, specified in the first parameter. The addresses must be padded to the integral number of bytes if necessary.

7.4.15 Hardware state

Format: addr_size, data_width, address

0

, value

0

, …, address n-1

, value n-1

Used by: register_write

The meaning of the size parameter is as for 7.4.14. Both address and value fields must be padded to occupy an integral number of bytes.


Additional issues will be identified upon further discussion.

7.5 Continuation

7.5.1 Purpose of Parcel Continuations

Parcels operate in a space of distributed control state for a parallel application. Included in that state are the program counters of the associated complexes (threads) in a way similar to many other environments. However, in addition are independent objects of control, the LCOs, which although passive embody much of the control state of an executing ParalleX program. It is recognized that at the nano-cycle level, these alone are insufficient to fully specify a parallel programs exact (instantaneous) state. However, ParalleX establishes execution invariants that will be satisfied independent of the internal path followed to satisfy them. Therefore, they serve as a control abstraction that can provide comprehensive control description. This reflects the overall means of managing the distributed, dynamic, and migrating parallel control state. The

Parcel field dedicated to the management of the control state is the ‘continuation field’. This field indicates what is to occur after the Parcel action is terminated.

37

When the action specified by a parcel is completed, what happens next? In some cases, the answer is completely determined by the operation specified by the Parcel Action such as a method that is called. But often such routines are general functions that do not recognize the specific context of the call or incorporate the precise nature of the continued execution at that point in the program and cannot determine the consequent response. The continuation field of a

Parcel provides this information.

7.5.2 Elements of a Continuation Field

The Continuation field achieves three key functions: 1) it returns a value or pointer resulting from the action to a required location; 2) it modifies the control state of the system in some manner to reflect that the parcel action is completed; and 3) it may perform a conditional that will determine follow-on action. This last function includes response to error conditions.

7.5.3 Operation Modes

The change of data state resulting from the action of a Parcel may occur in a number of ways as designated by the Continuation field. Perhaps most typical of conventional practices and certainly supported (although not preferred) by the use of Parcels and continuations is returning a value to the initiating thread. This modifies the state of either a register of a thread, or one or more of its entries in the thread’s possible stack frame. If the returned value is complex or compound, a reference pointer to a newly created data structure can be instead stored within the immediate thread state. A second form of result placement is a change to the global state of the program (within a ParalleX process) by either modifying the value or values associated with an existing data structure or adding a new datum and changing the meta-data of a structure to incorporate it. A special case but of sufficient importance to distinguish it is the insertion of a result in to the designated element of an existing Local Control Object. A sophisticated variation of this is the instantiation of a new LCO, adding the result to it, and linking the LCO in to a list

(or tree) of distributed LCOs as part of a recursive chain or stack. The result of an action may be sent to an I/O server process. Finally, a result may be used to alter the condition of physical components such as switches, flags, fuses (in the case of reconfiguration), or visual display (e.g.,

LED).

Almost no such action (there are exceptions) can be performed without reflecting the completion of the action in the distributed control state of the ParalleX programs. The specifics of the control state to update depends, in part, in the placement of the result data as described above. For example, if the state such as a register of a thread receives the Parcel action’s result, then the thread control state must be updated as well. Likewise, if the result is inserted in an existing

Local Control Object, the control state of the LCO must be modified to reflect this change in its state and serve as the differentiable within the total control state of the action completion. More conventionally, a semaphore, mutex, or barrier in the old style may be changed if sufficient to reflect the condition of action completion on future activities. The creation of an LCO and its addition to some global name space or meta-structure also adds to the control space of the

38

execution thereby indicating the action termination. An alternative is the immediate succession and creation of new applied actions including but not limited to Parcels that convey the same action just completed and apply it to successive data such as the vertices of a graph to which the current (just processed) vertex is linked. This is a form of continuation migration, moving the control state (as well as its associated work) through the metadata of a compound structure.

The third aspect of the effect of the Continuation field of the Parcel is related to conditional determination of follow-on action. If this subfield is not null (empty) then it can designate consequential successive action determined by specified conditions. One predicate is the error condition: did the action (and operands) specified by the Parcel complete correctly without error.

If an error did occur, an error handler routine is designated to be performed. A second criterion is an end-condition. A sequence of successive actions may be spawned by predecessor Parcels. An example would be the traversal of a linked list. When an end-condition is satisfied a different continuation response is generated resulting in the termination of a sequence of continuation migration. The last form of follow-on is the normal mode. This is the expected follow-on action of a parcel after its specified action is completed. This can be multiple actions and these two may be conditioned. This is particularly useful for supporting data-directed execution and graph traversal operations.

7.5.4 Continuation Field Structure

A possible format of the Parcel continuation component incorporates three fields:

(<next>, <final>, <error>) where <next> defines the follow-on action if there is to be one, <final> defines the global effects for side-effects and control state change, and <error> specifies an error handler to activate.


This is the cutting edge of Parcel development with much more to be accomplished.

7.6 Application Programming Interface

7.6.1 Introduction

This section describes a notional API providing access to the parcel world from the remainder of the runtime system and in some cases, application space. Since many details of the parcel world such as binary parcel format or the internals of parcel handling should remain hidden from the user, only the select functionality is explicitly exported. This helps minimize the chances of data corruption through uncontrolled interference with critical internal data structures and achieve independence from low level implementations of parcel layer. The interface described below allows the user to submit and retrieve the information necessary to create and decompose parcels

39

(i.e., safely cross the parcel world boundary), send and receive parcels using their handles, and search the available parcels based on their attributes.

7.6.2 Interface functions

The interface functions below are presented in C language syntax. Each of the functions returns an error code describing the status of the operation. In addition, each of their arguments are described as IN, OUT, or INOUT, depending whether they are used as input, output, or bidirectional (both input and output) parameters, respectively.

7.6.2.1 Submit parcel description to parcel world.

error_t parcel_put(parcel_contents_t *contp, parcel_handle_t *handp)

Arguments:

 contp (IN): pointer to structure describing parcel contents (described in 6.3);

 handp (OUT): pointer parcel handle.

7.6.2.2 Initiate parcel send operation.

error_t parcel_send( parcel_handle_t hand)

Arguments:

 hand (IN): parcel handle.

7.6.2.3 Retrieve parcel description from parcel world. error_t parcel_get(parcel_handle_t hand, parcel_contents_t *contp)

Arguments:

 hand (IN): parcel handle;

 contp (OUT): pointer to the parcel description structure to be filled out by the function.

7.6.2.4 Receive the first available parcel.

error_t parcel_recv(parcel_handle_t *handp)

Arguments:

 handp (OUT): pointer to parcel handle to be filled out by the function.

7.6.2.5 Receive parcels matching certain attributes. error_t parcel_recv_matched(int acount, parcel_attr_t *attrp, int *hcount, parcel_handle_t

*handp)

Arguments:

40

 acount (IN): number of attributes to much stored in attrp array;

 attrp (IN): attribute array;

 hcount (INOUT): on input, contains the maximal number of handles that can be stored in the handp array. The function outputs the actual number of handles that have been stored in the handle array, which may not exceed the provided maximal size value;

 handp (OUT): parcel handle array.

7.6.3

Data structures

The parcel layer uses the following data structure to describe the primary components of the parcel in user space: struct parcel_contents

{

dest_t destination;

action_t action;

struct parcel_data

{

int count; /* number of elements in ptr and size arrays */

void *ptr; /* array of pointers to contiguous data segments */

int *size; /* array of contiguous data sizes */

} data;

struct parcel_continuation

{

cont_next_t next;

cont_final_t final;

cont_error_t error;

} continuation;

}; typedef struct parcel_contents parcel_contents_t;

7.6.4

Open issues

 development of precise semantics for API calls in multithreaded environment,

 definition, handling, and reuse of type information and other metadata,

 provide other interface functions,

 declarations of other data structures used,

 enumeration of possible error codes and other useful constants.

7.7 Binary parcel format

This section introduces a minimal usable binary representation of parcel contents. While the individual aspects of this specification may be modified and extended by particular

41

implementations to match the characteristics of the underlying system, this description primarily aims to present a possible encoding of parcel components as well as minimal bounds on parcel field sizes and summarize the relationship between parcels and other layers of message stack.

7.7.1 Encapsulation

Parcel specification doesn’t define physical, data link, nor network layer. While some properties of parcel protocol may overlap certain functions associated with transport layer in OSI model

(error handling, for example), the implementations are still free to be built on top of protocols commonly associated with the transport layer, such as TCP or UDP. Within synchronous domains, parcels may take advantage of even lower level protocols, such as PCI-Express,

HyperTransport, or QPI. In any case, the actual parcel contents are embedded into packets of the underlying protocol as payload (see Figure 1), with possible fragmentation and reassembly performed transparently to the parcel layer.

Figure 1 Encapsulation of parcel contents

Parcel content has no direct impact on the contents of packet header (excepting metadata related to payload size). While the physical address/routing information is derived from the parcel destination field, the mapping of this information is performed by the runtime system and AGAS services.

7.7.2 Parcel structure

Figure 2 Low level parcel structure

The main components of the binary parcel correspond to the logical fields discussed in Sections

7.2 through 7.5. In addition, binary parcels contain a metadata field called Parcel Descriptor providing other information related to optional contents and expediting parcel processing. These fields, along with the overview of portable type representation used in data and continuation components, are discussed in more detail below.

42

Each of the parcel fields is designed to decompose into meaningful elements aligned on octet boundary. While this may be less space efficient than bit-packed contents, it simplifies the extraction of parcel data from messages, enabling simpler processing hardware.

The data items contained in parcels may be expressed using different endianness. The components closer to the start of parcel use predefined “network” byte order, thus removing the necessity of byte order conversions, which would additionally complicate the first stages of parcel decoding. Large volumes of typed data are typically presented in sender’s byte order, hence avoiding any conversions when communicating among homogeneous synchronous domains and deferring the conversion to be performed at the destination synchronous domain once the presence of the recipient object is confirmed.

7.7.4.1 Destination

Parcel destination is a required, fixed size field containing the virtual address of the target object assigned by AGAS. All participating synchronous domains in a particular system implementation must use destinations of the same size. The minimal destination width is 64 bits, with possible variants including, but not limited to, 80, 96 and 128 bits. The byte order of the destination value conforms to the network order. Transparent bridging of multiple self-contained systems (i.e., without explicitly established customized parcel conversion services) is not considered at this time.

7.7.4.2 Parcel descriptor

Figure 3 Minimal parcel descriptor

The required parcel descriptor field consists of three components: flags, data size specifier

(unsigned integer), and continuation size specifier (unsigned integer). The minimal version, depicted in Figure 3, consumes only three octets, but limits the sizes of data and continuation fields to at most 255 octets each. If the size of either of these fields is zero, the corresponding field is absent from the parcel. The meaning of the flags is as follows:



[S]mall parcel format: if set, parcel descriptor uses format shown in Figure 3;



[L]ittle endian: if set, all typed data items stored in data and continuation fields are in little endian format, or big endian otherwise;



[T]arget present: if set, the first element of data field is the destination subcomponent number (absent otherwise);



[R]eserved for future expansion (nominally set to zero);

43



[P]riority: encodes parcel priority level;



[E]ncoding level for type descriptors that is intended to provide hints to hardware level parcel dispatchers; e.g., 0: scalar types only, 1: inlined compound types, 2: referenced types (see type descriptor section), 3: unknown.

To accommodate larger parcel payloads, an alternative layout shown in Figure 4 is used. The S flag for that format must be cleared. The sizes of data and continuation fields are unsigned integers of width (in bits) as presented in Figure 4 and expressed in network byte order.

Figure 4 Large parcel descriptor

7.7.4.3 Action

Figure 5 Action descriptor

Action descriptor consists of two octets; the first identifies action class along with main action flags, and the second action subtype and its related flags. The supported main action flags include:



[I]nhibit low-level processing: prevent parcel action to be executed by a low-level resource, forcing it to be delegated to the software layer of the primary execution resource. The target execution resource is the same as the one that manages the custodian process for the synchronous domain.

The subtype flags depend on both action class and action subtype and will be disclosed later.

7.7.4.5 Type descriptors

Type descriptors fulfill dual function in parcels. They permit portable representation of “plain old data” (POD) types, thus enabling unambiguous parsing and interpretation of parcel data payloads. Their secondary function is specification of the distribution of delivered payloads in the memory of target synchronous domain. The latter is necessary due to presence of holes in data layout caused by both different alignment rules on various architectures and the need to redistribute datasets in such a way that the new layout is not contiguous (e.g., strided arrays, random location updates). The type descriptors must fulfill the following conditions:

 portability between memory spaces of synchronous domains that are supported by standard-compliant C compiler(s);

 ability to express nested structures and arrays in terms of scalar storage units recognized by the C compiler;

44

 full definition of compound types without reliance on any external context or information other than compiler ABI specification;

 support for type parameterization;

 concise format, especially when expressing repetition;

 both description elements and carried data aligned on octet boundary;

 octet as the preferred size of the “building block” of type description to avoid most endianness issues;

 simplicity of expressing scalar or other less complex types to enable well controlled processing in hardware.

Note that not required is the support for conversions between incongruent types or types that specify different storage sizes (bit widths). Such automatic conversions may be the source of difficult to trace errors and therefore should be under explicit software control. In many objectoriented environments usage of POD approach is not sufficient to correctly migrate objects as their methods (in particular constructors and destructors) often perform more sophisticated operations on data members than simple initialization and may also produce side effects affecting other objects. However, whenever the application/runtime software determines that

POD-style approach is safe, the “inhibit” flag in action field may be cleared to enable potentially more efficient data manipulation by dedicated hardware units.

Three subsections below describe the format of two basic type descriptors used to construct more complex type definitions along with the rules governing the type derivation.

7.7.4.6 Scalar type descriptor

Figure 6 Scalar type descriptor format

Scalar type descriptor fits into a single octet and stores three flags along with type identifier. The flags are interpreted in the following way:



[S]calar descriptor marker: must be set for scalar type descriptor;



[P]resence marker: signals the presence of immediate argument or populated type slot

(see rules);



[E]nd marker: set if the octet is the last element of derived type definition sequence or is the last octet of type definition before packed data sequence.

Type identifier encoded on five bits represents one of the following:



[n]one (void),



[i8] 8-bit signed integer,




45









[u8] 8-bit unsigned integer,












[f16] 16-bit binary floating point number (half precision),



[f32] 32-bit binary floating point number (single precision),



[f64] 64-bit binary floating point number (double precision),



[f128] 128-bit binary floating point number (quad precision),



[d32] 32-bit decimal floating point number,









[p] pointer,



[a] global virtual address.

Both binary and decimal floating point formats listed above are defined by IEEE754-2008 as basic floating point formats. The full conformance of the transmitted data values to the standard is not strictly necessary as long as the alignment rules for these types are preserved.

7.7.4.7 Type operator descriptor

Figure 7 Type operator format

The type operator is used together with scalar descriptor to construct arbitrarily complex, possibly parametric, compound type definitions. The contents of the type operator consist of four flags and an operator code. The flags encode the following information:



[S]calar type descriptor: must be cleared;



[D]efine: begin definition of a derived type;



[E]nd definition marker: set to delimit the definition of reference or compound type;



[R]eplicate: if set, the values defining numeric parameters and scalar content of all subtypes associated with the operator will be read only once from the parcel data field and repeated as necessary to construct the full data layout.

The operator code assumes one of the following values:



[n]one: a no-op;



[v]ector: construct a vector using an element count (which is a typed scalar value) and single element type;

46



[s]truct: wrap a sequence of member types in a structure;



[u]nion: define union of component types;



[m]ethod: represents function type signature; the first type (mandatory) is the output type, followed by types of input parameters;



[r]eference: substitute the definition of custom compound type identified by a 1-byte tag;



[p]arameter: signifies type parameter and can be used in place of any type descriptor.

7.7.4.8 Type specification rules

The type specifiers are closely associated with data values either supplied in parcel data and continuation fields, or retrieved from memory in the destination synchronous domain. The construction and parsing of type definitions are governed by the following rules:

1.

Type specifiers carried by parcels describe layout of accessed data structures in memory of the destination synchronous domain. The layout must be compliant with the applicable compiler ABI (primarily for C/C++ programs) for the domain’s main processing resource.

2.

The type specifier is defined with respect to the base address of the parcel target component within the destination object or, if absent, the base of the destination object.

3.

The type specifier needs not to provide any type information for data layout in memory extending past the accessed byte at a highest offset relative to base address of the destination.

4.

For read access, data field of a parcel contains a sequence of type specifiers followed by the exact amount of data to define all outstanding parameters of parameterized types (if such are present in type specifier).

5.

For writes, the data field contains a sequence of type specifiers followed by the exact amount of data to satisfy both all parameters of parameterized types present in type specifiers and all individual scalar values to be written to memory.

6.

The last type specifier preceding numeric parameters and/or data values must be an explicit scalar terminator (scalar type descriptor containing the [E] flag and not paired with any referenced or compound type definition). Void scalar type may always be used for that purpose and does not impact type specification.

7.

The ordering of each type or type-with-data component is as follows: a.

Optional sequence of referenced type definitions. These definitions do not drive directly the access to data, but may be incorporated in the actual type specifier. b.

Sequence of type specifiers, where each type specifier is immediately followed by type parameters for parameterized types if such are present, forming proper type closures; the process of forming the type closures may need to be repeated recursively until each type specification is fully defined, c.

Sequence of either optional numeric type parameters (read access) or optional numeric type parameters interleaved with scalar data values (write access). The

47

order of values corresponds strictly to the depth-first traversal of type hierarchy starting at the root of type specifier. The values themselves are byte-packed (no padding and no alignment other than byte boundary).

8.

The accessed data items correspond to the leaf scalar descriptors in fully expanded type specification with the [P] (present) flag set. No other memory locations may be accessed and no other data payload may be present following the type specifier.

9.

Compound type definition is started by type operator with operand code corresponding to

[v], [s], [u], or [m] followed by the required number of member data type specifiers and, whenever required, numeric parameter descriptors, and terminated by the matching (after considering nested subtypes) scalar or operator descriptor with the [E] flag set.

10.

Referenced type definition is started by type operator encoding one of [v], [s], [u], or [m] with define [D] flag set followed by the required number of member type specifiers and numeric parameter descriptors, and terminated by the matching descriptor with the [E] flag set. The terminator must be immediately followed by byte-sized tag, which later can be used to retrieve the full definition of the type. The tags should be unique; type definition that uses the same tag as the one of the earlier definitions will overwrite the previous mapping.

11.

Referenced types may be accessed in the type specifier sequence section by using the reference [r] operator followed by byte-wide tag corresponding to the required type definition. Such references may substitute any of the type specifiers or be used as actual type parameters.

12.

The vector type descriptor [v] requires two type components: the first is a scalar integer type to specify the type of numeric parameter expressing the number of vector elements and the second (of arbitrary type) determines the type of vector elements. The vector also requires a numeric parameter identifying the number of vector elements.

13.

The structure type descriptor [s] has an arbitrary number of type components corresponding to data member types. Any combination of the component types (or their subtypes) may have the presence [P] flag set.

14.

The union type descriptor [u] requires an arbitrary number of type components corresponding to data member types. At most one of these types may have the presence flag set, or contain subtypes that have this flag set.

15.

The method type descriptor [m] requires at least one type component for the method result followed by an arbitrary number of type components identifying input arguments to the method.

16.

Each type component is either a scalar type descriptor, inlined compound type descriptor, referenced type, or a deferred type (descriptor [p]). The latter may only be used in a referenced type definition and has to be substituted by actual type parameter when the defined type is referenced in type specifier sequence.

17.

If type operator descriptor has the replication flag [R] set, all its type parameters, related numeric parameters and scalar data values (if any) must be specified only once regardless

48

of how many times the type is being referenced later. Since scalar types don’t support the

[R] flag, their replication may be achieved through type parameters with the [R] flag set bound to an actual parameter being a scalar type descriptor.

7.7.4.8 Examples

The following examples aim to clarify the parcel type encoding. The notation uses capitalized characters for each flag set in a descriptor octet followed by a lowercase letter to identify the type operator, or lowercase letter optionally followed by a number signifying bitwidth for scalar descriptors. To avoid dealing with the endianness of data values, they are expressed verbatim and contain subscript signifying their length in bytes. Both descriptors and values are separated by commas.

Example 1 : 32-bit integer of value 42 (encoded in 5 bytes)

SPEi32, 42

4

Example 2 : payload targeting a structure containing 3unsigned 4-byte integers and two double precision floating-point numbers, in which the second integer must be set to 5 and the second FP number to -13 (encoded in 19 bytes) s, Su32, SPu32, Su32, Sf32, SPEf32, Ev, 5

4

, -13.0

8

Example 3 : 4-kbyte page with all bits of the contents set to 1 (encoded in 7 bytes) v, SPu16, ERp, SPEu8, 4096

2

, 255

1

Example 4 : access 25 th

and 27 th

through 31 st

elements of the array of 64-bit integers (number 42 is the tag identifying referenced type created at the beginning of the sequence); encoded in 16 bytes (type specifier only)

Dv, p, Ep, 42, r, 42, SPu8, Si64, SPi64, Si64, r, 42, SPu8, SPEi64, 24

1

, 5

1

7.7.5 Data

The organization of data field assumes one of the four generalized layouts outlined in Figure 8.

The data field may start with an optional target specifier which has a fixed width across the entire system in a given implementation. The endianness of the target field is (as well as the remainder of data field) is controlled by the [L] flag in parcel descriptor.

49

Figure 8 Data field formats

The (a) layout is used to retrieve data items from the destination synchronous domain’s memory space. Note that type does not necessarily have to be a single specifier, but may be a sequence of specifiers. Layouts (b) and (c) are equivalent; (b) uses a single terminated sequence, whereas (c) explicitly breaks the type descriptors and their associated values into shorter sequences that may be easier to handle by the lower level parcel stack layers. Finally, layout (d) describes untyped sequences of bytes, which might accompany parcels with the action flag [I] set.

7.7.6 Continuation

<TBD>

50

8. Percolation

Percolation is a special case of the use of parcels. Percolation moves work to physical resources rather than to logical data. It is important because it provides a highly efficient way to make effective use of expensive (precious) resources. It also provides one way to effectively employ heterogeneous architectures incorporating such component classes as GPU accelerators. As its underlying support mechanism, it uses parcels to provide operand values and define work to be performed on the precious computing component. Parcels are also used to transfer results to the rest of the computation. Percolation hides latency of data movement for the precious resource. It also offloads critical path overhead from the precious resources performed by small, low cost ancillary resources. Percolation permits a stream of work for precious resources so that they have continuous work to perform assuming adequate parallelism therefore achieving high utilization and efficiency.

Percolation as a component of the ParalleX model supports a multi-stage coarse grained pipeline of activities. It assumes a multi-banked storage medium tightly coupled to the target precious computing resource from which that computing element can acquire task descriptions and working set data as well as deposit task final results. The precious resource operates principally to work out of the multi-banked storage medium reading deposited instruction streams, operating on deposited operand data and the intermediate data produced throughout the computation, and depositing the final results. Logically, the precious computing element can work on one buffered task at a time or multitask among those resident tasks.

A separate low-cost task server component manages the transfer of information into and out of the precious resource buffer storage. When a task is ready to be performed, all of the necessary data is moved to a selected precious resource buffer by a server. This includes all of the code to be executed, the initial arguments values upon which the task is to be performed, the buffer area for depositing result values, and final completion information such as what synchronization signal to return to indicate completion of the task.

The task server component is also responsible for cleaning up the task at its completion. This involves a triggering signal (which is non-blocking) from the precious resource to the task server, transfer of the result values to be written into global data structures, and releasing the dedicated memory resources within the buffer storage for the next task.

Ideally, percolation treats the precious resource as a separate component that works exclusively out of its buffer storage. With sufficient coarse grained parallelism, the precious computing resource can operate optimally without incurring overheads of external communication or

51

latencies for remote access. However, the model requires that it be able to perform global data accesses, either read or write, when necessary and when it’s not possible to determine these data requirements a priori.

52

9. Fault Management

While a subject of some controversy, there is sufficient reason for concern that current practices related to reliability of computing systems will not scale to systems capable of Exaflops performance and new strategies and their novel implementation will be required. A model of computation providing the conceptual scaffolding of a future computing system should embody foundation principles for the codesign of system hardware and software towards a dynamic resilient operation. ParalleX will serve as a research execution model to explore means to achieving fault tolerance in future highly scalable systems.

9.1 Basic Principles of Resiliency

A robust system in the presence of faults to preclude operational failure must be capable of a set of interrelated actions that in combination yield continued computing even when individual hardware or software components do fail permanently (hard failures) or intermittently (soft failures). ParalleX incorporates such a set of failure related response modes. It does not assert a single design solution. Rather it establishes a framework of invariances that if satisfied by any combination of designs at multiple layers within the system will in synthesis yield resiliency as a global emergent property. These are: detection, isolation, diagnosis, repair, recovery, and restart.

Fault detection requires that an error at any stage in the computation be recognized as such.

Isolation prevents an error, once detected, from propagating through further computing creating additional erroneous program state and in a parallel computer at more than one location.

Diagnosis identifies the source of the error both in terms of the operation or operations that were performed incorrectly and the system components, hardware and/or software that through faulty operation produced the errors. Repair is an automatic methodology that transforms a system that is broken in to a system that is functional. Recovery returns the computation to an earlier state prior to the point of failure so that any proceeding computation will also be correct (until the next failure). And , restart permits a once stopped parallel computation to begin again from an intermediate state such that the vast majority of successful calculations already performed need not be redone. This is critical to both time and energy savings.

Scalability past an as yet undetermined level of concurrency requires that fault tolerance be conducted in parallel just as healthy or correct computations are under hopefully normal operation. This means that two or more errors in different parts of the parallel computation occurring at essentially the same time can be treated separately in space by simultaneously in time. It also means that other non-affected parts of the computation that have not been corrupted by the errors may proceed independently of the failed portions at least until precedence constraints are violated.

The time it takes to fix an error must be significantly shorter than the time between successive failures. Yet, any method applied to the methods must be robust in the presence of cascading errors. That is, while a system is in the process of self-correction, additional errors may be

53

allowed to occur and the system can still return to a valid state. Not only should this be possible where such multiplicity of errors is manifest in disjoint (time or space) regimes of the computation, but corrective actions should be feasible even when a fault is incident within the corrective measures taking place in response to the previous error.

A policy of correct behavior in the presence of faults is adopted by the ParalleX model. That is, any computation assumes imperfections of the set of operations continuously being performed and will detect and respond to errors allowing a computation to proceed to deliver a correct solution. This does not require that the rate of progress towards the end goal will be sustained. In fact, ParalleX assumes that only as an upper bound. Rather, ParalleX anticipates and exploits a policy of graceful degradation , if not initially then over time. Graceful degradation allows elements of a system experiencing hard faults to be isolated and eliminated from the working system while retaining correct if somewhat reduced rate of future functionality. In the extreme, this suggests that a system may operate in a zero maintenance context and benefit from design and packaging techniques that assume means of maintenance will not be required.

A methodology that demands that all faults be immediately detected is too rigid, limiting cost effective solutions, and possibly proving to be ultimately infeasible except for special cases.

Therefore, ParalleX anticipates multilevel detection of faults where different levels of the system can detect certain failure modes and these are employed in an overlapping manner such that the failure to detect an error at one level may be corrected somewhat later at another level. A key requirement for the success of such a layered approach is that the frequency and temporal cost of each detection technique be commensurate with the overheads associated with each layer. This suggests that the higher the layer involved the coarser is the granularity of recovery domain.

The role of the programmer in fault tolerance is controversial. Some methods directly involve the programmer in writing applications that are fault tolerant where all fault responsiveness is consciously provided through explicit programming style by the application programmer. Other methods such as triply redundant system architectures isolate the programmer from failures of the system making recovery transparent to the user. Lastly, yet other systems permit failure and expect the application to be restarted from the beginning. This is typical of transactional systems such as cell phone connections (dropped calls) and public search engines (e.g., Google) where the user will resubmit a search request. ParalleX adopts a different policy treating the programmer as just one layer, usually the top layer, of the multi-layer system providing possibly some but additional information that can guide and augment the embedded fault tolerance machinery incorporated lower in the system.

Redundancy as conventionally employed in non-stop systems involves availability of back up components to take over when live components fail and need to be replaced. Redundancy may also imply multiple instances of the same computation and active compare and validate. Multiply redundant systems have been used in such crucial applications as banking and avionics among others. The space shuttle carries five processors all performing the same workload and frequently

54

comparing results to detect errors and retain operational capability. ParalleX does not preclude the use of redundant systems or computations but neither does it demand it. The class of systems such as X-Caliber serving as ParalleX manifestations are likely to have millions (perhaps hundreds of millions) of identical units which serves as a kind of redundancy in form permitting reduced grade capability but equivalent functionality. This is the basis for graceful degradation.

The use of redundancy to sustain constant capability is a chimera. If the resources are available in a truly parallel system, they can be used to achieve higher performance, only then to degrade overall system performance as failures occur. Using them for detection through repeated execution of the same computations and comparing is profligate in waste both in cost and energy. System structures should be devised with multiplicity of means for retaining functionality and flexibility of structure for reconfiguration in the presence of failed components.

9.2 Micro-checkpointing

ParalleX supports a policy of micro-checkpointing to realize the goals, requirements, strategies, and principles discussed in the previous section. Micro-checkpointing is a framework of retaining critical path values after they would ordinarily be eliminated until downstream computation results that are dependent on them are determined to be correct. While this reflects in some ways conventional checkpointing there are some major differences. Microcheckpointing is local both in place and in time. Only one or a few values are retained within an integrity domain and these only for a brief period of time. An integrity domain is the ensemble of input and output values and the space of intervening computations between them. Conventional checkpointing captures all of the program values at a particular stage in the total computation and stores that data to disk. It is unlikely that such methods will scale to Exaflops scale systems.

When the resulting values of an integrity domain have been determined to occur without error, the initial values of that domain which had been kept in main memory are released.

The method by which this is achieved is to be referred to as a compute-validate-commit (CVC) cycle. The compute phase performs the required calculations between the input and resulting values of an integrity domain. The resulting values are validated by verifying that none of the possible detection points involved in the calculation discovered an error. Then the commit phase is performed that can perform the side-effects if there are some and garbage collect the original input data. An important variation of the CVC cycle is the overlapped cycle where the input values are still being validated while the follow on compute phase is being performed. The result is that almost no critical path time is lost and little additional computation resources or energy is consumed.

Micro-checkpointing supports many of the elements of fault management including detection, isolation, diagnosis, recovery, and restart. It requires additional capabilities for repair and while it uses fault detection mechanisms, it doesn’t provide them itself. But it does provide the core strategy and policies that establish ParalleX as a resilient model of computation.

55

9.3 Layered Fault Detection

Fault detection is a challenging problem. While some tasks like moving or storing data can be managed by ECC or CRC like techniques. Other aspects of computing is less easy. Low level fault detection will be an important part of the X-Caliber system design with a fault model used to fully analyze coverage. But ParalleX takes a more aggressive strategy by supporting and encouraging multiple levels of fault detection overlapping those at different levels. The CVC based micro-checkpointing framework allows compilers, runtime systems, and operating systems to augment hardware detection although at a coarser granularity. But the ParalleX model also accepts contributions by users or their library surrogates. Two major contributions come from this highest level. The first is: what to do in response to unrecoverable faults for specific integrity domains. The second is: added tests to be performed provided by libraries and DSL programming tools to provide high level checks of functionality. This can also allow functional programming methods to be used to define the boundaries of integrity domains to help automate the micro-checkpointing of an application.

56

10. Protection

Security of resource accessibility and data integrity is essential for future computing of commercial, industrial, and defense related mission-critical applications. Security demands that systems are only accessed by authorized individuals, institutions, and agents. Security anticipates, detects, and defends against malicious threats and attack from external perpetrators via channels between a controlled target system and the uncontrolled external world. But security strategies and measures must also protect against internal attacks by malicious agents that have gained access through institutional means. Thus security must be robust for any given application without full assurance of security at the system level. ParalleX incorporates means of protection upon which security policies can be implemented. It provides inter and intra mechanisms that will allow integrity of data and operation as well as sustained performance (denial of service attacks). Security strategies are diverse and constantly evolving. No single fixed strategy can anticipate all future forms of aggression and responsive measures must be able to evolve and adapt over time and context in response. ParalleX does not presume to predict all possible modes of attack nor provide fault-proof operation. Instead, ParalleX treats security methods as policy and instead provides the necessary underlying mechanisms from which such security systems may be implemented. Therefore, the ParalleX model of computation defines a specific but flexible protection framework that if incorporated in all ParalleX-compliant system architectures can permit consistency and portability of diverse security strategies. With the ParalleX protection mechanisms security rings can be established within the system for layered protection.

ParalleX protection is based on capabilities-based addressing and the management of access rights. Capabilities addressing imposes a permissions requirement on all accesses to logical or physical resources. It also provides the necessary rights and protocols for sharing such access rights with other logical resources.

The boundary of protection within the ParalleX model is the ParalleX Process. Within a process all accesses by a complex (thread) to a local first class object are considered safe. Otherwise, an access attempt by a complex (e.g., thread) from one process to any object in another ParalleX process requires access privileges to that object. There are two layers of protection imposed by the ParalleX model. The first relates to the target data. As part of the data typing, the kind of accesses and their sources are specified. For example, some data within a process may be considered private to that process. Therefore, it cannot be accessed by any complex of another process. Some data may be read-only or write-once data such that write accesses are prohibited from external processes. This extends to the use of methods which may in turn copy or modify process state. The second layer of protection relates directly to the source process. At the time of instantiation, a process is assigned specific external access privileges.

57

Parcels present a special problem. To ensure protection of data being transferred and therefore exposed to interception, an encrypted capabilities field is incorporated within the parcel format to convey privileges for accessing remote process data. Note, that not all accesses between two processes need require parcels. Two or more processes can share the resources of a synchronous domain and therefore not require parcel communications.

58

11. Self-Aware Execution

11.1 What does It Mean to be “self-aware”

Conventional systems employ a combination of open-loop control methods driven by compiler and programmer specification methods and separate simple operating system for multi-tasking and multi-programming decoupled from the contents of the application. Historically, with some incremental advances in scheduling policies, the overall open loop control strategy has dominated. Advanced decisions were derived almost exclusively through the person-in-the-loop.

Past some ill-defined level of complexity in time, space, irregularity, and requirements this approach cannot deliver either optimality or responsiveness to exigencies that threaten correctness of results in bounded time.

ParalleX is self-aware. The ParalleX execution model provides higher-order intelligence to govern the management of the execution of a user parallel problem on extreme scale computer systems. ParalleX enables a closed-loop control system. Experience with decades of control theory can be applied to the management of computing systems. ParalleX embodies certain key properties of such classical control systems. Experience with complex biological systems teach about the importance of a multi-level hierarchy of control from autonomic responsiveness of the central nervous system, the situational awareness of the reptilian complex, and the strategic planning of the neo-cortex. ParalleX embodies key properties of organic control to divide reaction among a low level layer for minimum time local response, a medium level for system wide basic quick response, and high level for system long term planning against strategic goals and recovery.

11.2 Dynamic Adaptive

Dynamic adaptive system operation benefits from runtime information to make on the fly decisions concerning resource management and task scheduling. This permits a mode of control not available exclusively through compile time only program control. Such capability demands three kinds of support that must be shared at multiple system layers. The first is situational awareness. This is a means of detecting, measuring, and understanding the condition of a computation in its many forms and parameters. The second is an objective function that establishes the policy of optimality; what needs to be changes and what kind of change is recognized to represent an improvement. The third is a planning method given the current state of the computation and the goal state implied by the objective function. Planning determines what changes need to be made in terms of resource allocation and task scheduling. Finally, the fourth kind of support for dynamic-adaptive execution is the means by which control of both logical and physical resources is achieved such that adjustments to both can result in the sought

59

improvements. The ParalleX model incorporates means to achieve all four of these interrelated capabilities.

11.3 Introspection and Goal-Oriented

ParalleX is introspective. It is aware of its detailed status at every moment. Status is conceived as an abstract state comprising measured operational properties compared to the required condition at the same time. The class of status monitored for introspection differs for each of the three levels of the control system. Low level introspection monitors energy, failures, utilization, and demand of workload present with respect to available resources. Low-level condition required is lowest possible energy for unit work, minimum interrupt due to pathological behavior, best use of resources with respect to cost (high availability of some low cost components is necessary for high utilization of precious resources), and rate of completing offered work. Medium-level introspection monitors the system-wide behavior of the application on the system resource. It provides global perspective for balance and optimality as well as progress-to-goal. High-level introspection monitors the success gradient of the computation both towards the goal of the application and the goals of the system (they are different). This high-level introspective layer understands the totality of operation both system-wide and time-wide borrowing from unbounded knowledge that may apply through past experience and complex planning strategies.

All three layers of introspection are closed loop to provide adaptive control of resources in support of demand. The ParalleX multi-level self-aware structure is goal-oriented using a multidimensional space of parameters that comprise an objective function, one each for each layer.

The objective function for each layer guides decisions local in time towards the representative goal condition. The objective function at each layer is mutable; may be adjusted over time to improve the system convergence to optimal behavior. Essentially, one wants a critically damped system for most rapid convergence both towards optimality and end goal.

11.4 Low-Level Autonomic

The low-level layer of the ParalleX self-aware system requires rapid response and operates local in space and time. For performance, it provides dynamic adaptive resource allocation to workflow demand. Specifically, this layer for normal behavior moves pending user complexes to physical resources when perceived by the low level self-aware system based on the combination of introspective measurements and status analysis, progress toward goal guided by objective function, and informed prioritization for time and energy. Such priorities are provided by OS, application, and Medium Level of self-aware ParalleX. Under anomalous circumstances, the low-level introspective system detects the nature of the pathology, and locally and quickly reacts to mitigate the problem and convey event information to the medium level.

60

11.5 Medium-Level Global Balance

The medium-level layer of the ParalleX self-aware system requires a global system-wide operation still local in time, although at more coarse-grain control time steps. The medium-level is introspective across the entire system and workload. It monitors progress towards goal of the entire application, the cost of the application in energy, the system quality of performance and correctness, the system configuration, and their incidence of security threats. For every challenge, it incorporates a global response. For failed devices, it works across operating system and runtime system as well as architecture to create a modified system with new resource assignment to the application requirements while restarting the application after correcting the consequences of errors. Such pathologies are conveyed to the high-level intelligence layer for long term changes to overall strategies of how best to employ the system.

11.6 High-Level Intelligent Perspective

The high-level ParalleX layer of the self-aware system provides a framework for determining long term operation policy that guides the medium-level Global Critical Balance layer and indirectly the low-level Autonomic layer. It performs three major functions for advanced selfaware:

Planning – determining a priori how the system will fulfill its computing obligations within bounded time, space, and energy declarative requirements and how it will respond to unanticipated exigencies either from the system itself such as in the presence of an error or failure or from external source such as an application user, system administrator, or incorporating system (e.g., and embedded real time control system such as UAV).

Learning – building experience from prior activities such as earlier runs of the same applications and post-mortem analysis. Learning is also critical for anticipating future failures, detecting software errors, and identifying poor energy usage components. It can also model external demands noting likely real time requirements or frequently asked services.

Back-out – systems cannot crash. When a system is put into a state that is untenable that would cause an ordinary system to stop, the Intelligence layer will catch this and back the system out of that wedged condition.

61

12. Energy Management

Energy is a resource. Its rate of use is power. Both impose bounds to efficient operation. An application (in most cases) has a base level energy consumption budget. The full set of operations requires a minimum amount of energy. Additional energy will be required depending on data movement, speculative execution, runtime and operating system support mechanisms, and additional work for fault tolerance and security. A last contribution to the energy consumption is choice of processor core rate modulated by variable clock rate, voltage, or other perturbation. This is of value for such techniques as “sprinting” for attacking Amdahl effects.

The tradeoff between energy and power is performance and requires advanced dynamic methods of applying work demand to physical resources. Energy, like time, is a resource that needs to be managed. The number of pJ/nsec is a measure of power while ops/nsec is performance. The tradeoff therefore is performance versus power. But the energy usage varies continuously as the computation workflow varies according to the program requirement and the balance of the machine. ParalleX incorporates a means of managing energy without imposing the policies to be employed.

As part of the ParalleX structure, there is a set of interfaces that provide a source of operational state measurements and means of controlling key operational parameters. For each locality,

ParalleX assumes it has access to monitoring mechanisms for:

 executing threads,

 local memory access,

 remote data movement, and

 ancillary control actions.

It also has measures of energy requirements for each class of operation to provide estimates of power and energy consumption as well as overall power usage on a coarser gain for correlation and calibration. For control purposes ParalleX assumes the ability to modulate key operational parameters including:

 voltage rails,

 clock rate for processor cores,

 bit rate for communication channels,

 resource module on or off (active/quiescent), and others

With these, ParalleX includes interface protocols for policy procedures to be incorporated for optimizing power consumption according to user requirements.

62

63