Introducing Ernst Kretzek Acceleration & Image Quality Institute for Data Processing and Electronics KIT – University of the State of Baden-Württemberg and National Laboratory of the Helmholtz Association www.kit.edu Who am I ? Ernst Kretzek • My roots • Born 1982 in Hermannstadt / Sibiu (Romania) • My education • Technical secondary school (Pforzheim) • Dipl. Ing. in Electrical Engineering and Information Technology at KIT • My hobbies • Sport (Soccer, Volleyball, TT, Climbing ) 2 Professional Interests • Research • Acceleration with GPUs, FPGAs • Image Reconstruction with 3D USCT • Analyze and improve image quality • Technologies • 3D USCT • GPU, FPGA Programming • High Performance Computing • Electrical Engineering • PCB design 3 Project • Acceleration by factor 100 • Fast speed-of-sound and attenuation correction • New Modality: reflection characteristics Without SOS correction Without ATT correction SAFT Speed-up ~100 5.4TB With SOS correction With ATT correction 25 GB Max: PSNR +32% 236 GB 4 The Future - new technologies or research trends? • Further accelerating with parallel hardware Reconstruction with more than one Server • Use hardware • Infiniband • Maxwell-GPUs? CUDA-aware MPI, or GPI-2 (Fraunhofer)? • open source • Support GPUs (RDMA) 5 … … GPI-2 History • GPI-2 is the next generation of GPI with more features. GPI has been evolving since 2005 and was known as FVM (Fraunhofer Virtual Machine) and in 2009 settled with the name GPI (Global address Programming Interface). GPI has completely replaced MPI at the Fraunhofer ITWM, where all products and research are based on GPI. In 2011, the Fraunhofer ITWM an its partners such as Fraunhofer SCAI, TUD, T-Systems SfR, DLR, KIT, FZJ, DWD and Scapos have initiated and launched the GASPI project to define a novel specification an API (GASPI based on GPI) and to make this novel specification a reliable, scalable and universal tool for the HPC community. GPI-2 is the first open source implementation. Features • High performance, Flexible API, Failure tolerance, Memory segments to support heterogeneous systems (e.g. Intel Xeon Phi), Threaded-model and thread-safe interface Requirements • Hardware - Infiniband or RoCE devices • Software - OFED software stack installation (in particular libibverbs) • ssh server running on compute nodes (requiring no password). Concepts • GPI-2 (and GASPI) provides interesting and distinguishing concepts. Segments • Modern hardware typically involves a hierarchy of memory with respect to the bandwidth and latency of read and write accesses. Within that hierarchy are non-uniform memory access (NUMA) partitions, solid state devices (SSDs), graphical processing unit (GPU) memory or many integrated cores (MIC) memory. The memory segments are supposed to map this variety of hardware layers to the software layer. In the spirit of the PGAS approach, these GASPI segments may be globally accessible from every thread of every GASPI process. GASPI segments can also be used to leverage different memory models within a single application or to even run different applications. 7 GPI-2 Groups • A group is a subset of all ranks. The group members have common collective operations. A collective operation on a group is then restricted to the ranks forming that group. There is a initial group (GASPI_GROUP_ALL) from which all ranks are members. Forming a group involves 3 steps: creation, addition and a commit. These operations must be performed by all ranks forming the group. The creation is performed using gaspi_group_create. If this operation is successful, ranks can be added to the created group using gaspi_group_add. To be able to use the created group, all ranks added to it must commit to the group. This is performed using gaspi_group_commit, a collective operation between the ranks in the group. One-sided communication • One-sided asynchronous communication is the basic communication mechanism provided by GPI-2. The one-sided communication comes in two flavors. There are read and write operations (single or in a list) from and into allocated segments. Moreover, the write operations are extended with notifications to enable remote completion events which a remote rank can react on. One-sided operations are non-blocking and asynchronous, allowing the program to continue its execution along with the data transfer. • The mechanisms for communication in GPI-2 are the following: gaspi_write gaspi_write_list gaspi_read gaspi_read_list gaspi_wait gaspi_notify gaspi_write_notify gaspi_write_list_notify gaspi_notify_waitsome gaspi_notify_reset Queues • There is the possibility to use different queues for communication requests where each request can be submitted to one of the queues. These queues allow more scalability and can be used as channels for different types of requests where similar types of requests are queued and then get synchronised together but independently from the other ones (separation of concerns). Global atomics • GPI-2 provides atomic operations such that variables can be manipulated atomically. There are two basic atomic operations: fetch_and_add and compare_and_swap. The values can be used as global shared variables and to synchronise processes or events. Timeouts • Failure tolerant parallel programs require non-blocking communication calls. GPI-2 provides a timeout mechanism for all potentially blocking procedures. Timeouts for procedures are specified in milliseconds. For instance, GASPI_BLOCK is a pre-defined timeout value which blocks the procedure call until completion. GASPI_TEST is another predefined timeout value which blocks the procedure for the shortest time possible, i. e. the time in which the procedure call processes an atomic portion of its work. 8 GASPI Project Activities – Summary • Definition of the GASPI standard of a PGAS-API; ensure interoperability with MPI. • Development of a high-performance library for one-sided and asynchronous communication based on the Fraunhofer PGAS-API. • Provision of a highly portable and open source GASPI-implementation. • Adaptation and further development of the Vampir Performance-Analysis-Suite. • Provision of efficient numerical libraries (core functions and higher-level solvers) for both sparse- and dense Linear Systems. • Porting complex, industry-oriented applications. • Evaluation, benchmarking and performance analysis. • Information dissemination, formation of user groups, training and workshops. Background and Motivation • Parallel software is currently mainly based on the MPI standard, which was established in 1994 and has since dominated applications development. The adaptation of parallel software for use on current hardware, which is mainly dominated by higher core numbers per CPU and heterogeneous systems, has highlighted significant weaknesses of MPI, which preclude the scalability of applications on heterogeneous multi-core systems. As a result of both the hardware development and the objective of achieving scalability to even higher CPU numbers, we now see new demands on programming models in terms of a flexible thread model, asynchronous communication, and the management of storage subsystems with varying bandwidth and latency. This challenge to the software industry, also known as the "Multicore Challenge", stimulates the development of new programming models and programming languages and leads to new challenges for mathematical modeling, algorithms and their implementation in software. • PGAS (Partitioned Global Address Space) programming models have been discussed as an alternative to MPI for some time. The PGAS approach offers the developer an abstract shared address space which simplifies the programming task and at the same time facilitates: data-locality, thread-based programming and asynchronous communication. The goal of the GASPI project is to develop a suitable programming tool for the wider HPC-Community by defining a standard with a reliable basis for future developments through the PGAS-API of Fraunhofer ITWM. Furthermore, an implementation of the standard as a highly portable open source library will be available. The standard will also define interfaces for performance analysis, for which tools will be developed in the project. The evaluation of the libraries is done via the parallel re-implementation of industrial applications up to and including production status. 9