G4_Ernst - Indico

advertisement
Introducing Ernst Kretzek
Acceleration & Image Quality
Institute for Data Processing and Electronics
KIT – University of the State of Baden-Württemberg and
National Laboratory of the Helmholtz Association
www.kit.edu
Who am I ?
 Ernst Kretzek
• My roots
• Born 1982 in
Hermannstadt / Sibiu (Romania)
• My education
• Technical secondary school (Pforzheim)
• Dipl. Ing. in Electrical Engineering and Information Technology at KIT
• My hobbies
• Sport (Soccer, Volleyball, TT, Climbing )
2
Professional Interests
• Research
• Acceleration with GPUs, FPGAs
• Image Reconstruction with 3D USCT
• Analyze and improve image quality
• Technologies
• 3D USCT
• GPU, FPGA Programming
• High Performance Computing
• Electrical Engineering
• PCB design
3
Project
• Acceleration by factor 100
• Fast speed-of-sound and attenuation correction
• New Modality: reflection characteristics
Without SOS
correction
Without ATT
correction
SAFT
Speed-up
~100
5.4TB
With SOS correction
With ATT correction
25 GB
Max: PSNR +32%
236 GB
4
The Future - new technologies or research trends?
• Further accelerating with parallel hardware
 Reconstruction with more than one Server
• Use hardware
• Infiniband
• Maxwell-GPUs?
 CUDA-aware MPI, or GPI-2 (Fraunhofer)?
• open source
• Support GPUs (RDMA)
5
…
…
GPI-2
History
•
GPI-2 is the next generation of GPI with more features. GPI has been evolving since 2005 and was known as FVM (Fraunhofer
Virtual Machine) and in 2009 settled with the name GPI (Global address Programming Interface). GPI has completely replaced MPI
at the Fraunhofer ITWM, where all products and research are based on GPI. In 2011, the Fraunhofer ITWM an its partners such as
Fraunhofer SCAI, TUD, T-Systems SfR, DLR, KIT, FZJ, DWD and Scapos have initiated and launched the GASPI project to define a
novel specification an API (GASPI based on GPI) and to make this novel specification a reliable, scalable and universal tool for the
HPC community. GPI-2 is the first open source implementation.
Features
•
High performance, Flexible API, Failure tolerance, Memory segments to support heterogeneous systems (e.g. Intel Xeon Phi),
Threaded-model and thread-safe interface
Requirements
• Hardware - Infiniband or RoCE devices
• Software - OFED software stack installation (in particular libibverbs)
• ssh server running on compute nodes (requiring no password).
Concepts
• GPI-2 (and GASPI) provides interesting and distinguishing concepts.
Segments
• Modern hardware typically involves a hierarchy of memory with respect to the bandwidth and latency of read and write accesses.
Within that hierarchy are non-uniform memory access (NUMA) partitions, solid state devices (SSDs), graphical processing unit
(GPU) memory or many integrated cores (MIC) memory. The memory segments are supposed to map this variety of hardware
layers to the software layer. In the spirit of the PGAS approach, these GASPI segments may be globally accessible from every
thread of every GASPI process. GASPI segments can also be used to leverage different memory models within a single application
or to even run different applications.
7
GPI-2
Groups
•
A group is a subset of all ranks. The group members have common collective operations. A collective operation on a group is then restricted to the
ranks forming that group. There is a initial group (GASPI_GROUP_ALL) from which all ranks are members.
Forming a group involves 3 steps: creation, addition and a commit. These operations must be performed by all ranks forming the group. The creation
is performed using gaspi_group_create. If this operation is successful, ranks can be added to the created group using gaspi_group_add.
To be able to use the created group, all ranks added to it must commit to the group. This is performed using gaspi_group_commit, a collective
operation between the ranks in the group.
One-sided communication
•
One-sided asynchronous communication is the basic communication mechanism provided by GPI-2. The one-sided communication comes in two
flavors. There are read and write operations (single or in a list) from and into allocated segments. Moreover, the write operations are extended with
notifications to enable remote completion events which a remote rank can react on.
One-sided operations are non-blocking and asynchronous, allowing the program to continue its execution along with the data transfer.
•
The mechanisms for communication in GPI-2 are the following: gaspi_write gaspi_write_list gaspi_read gaspi_read_list gaspi_wait gaspi_notify
gaspi_write_notify gaspi_write_list_notify gaspi_notify_waitsome gaspi_notify_reset
Queues
•
There is the possibility to use different queues for communication requests where each request can be submitted to one of the queues.
These queues allow more scalability and can be used as channels for different types of requests where similar types of requests are queued and then
get synchronised together but independently from the other ones (separation of concerns).
Global atomics
•
GPI-2 provides atomic operations such that variables can be manipulated atomically. There are two basic atomic operations: fetch_and_add and
compare_and_swap. The values can be used as global shared variables and to synchronise processes or events.
Timeouts
•
Failure tolerant parallel programs require non-blocking communication calls. GPI-2 provides a timeout mechanism for all potentially blocking
procedures. Timeouts for procedures are specified in milliseconds. For instance, GASPI_BLOCK is a pre-defined timeout value which blocks the
procedure call until completion. GASPI_TEST is another predefined timeout value which blocks the procedure for
the shortest time possible, i. e. the time in which the procedure call processes an atomic portion of its work.
8
GASPI
Project Activities – Summary
• Definition of the GASPI standard of a PGAS-API; ensure interoperability with MPI.
• Development of a high-performance library for one-sided and asynchronous communication based on the Fraunhofer PGAS-API.
• Provision of a highly portable and open source GASPI-implementation.
• Adaptation and further development of the Vampir Performance-Analysis-Suite.
• Provision of efficient numerical libraries (core functions and higher-level solvers) for both sparse- and dense Linear Systems.
• Porting complex, industry-oriented applications.
• Evaluation, benchmarking and performance analysis.
• Information dissemination, formation of user groups, training and workshops.
Background and Motivation
• Parallel software is currently mainly based on the MPI standard, which was established in 1994 and has since dominated
applications development. The adaptation of parallel software for use on current hardware, which is mainly dominated by higher
core numbers per CPU and heterogeneous systems, has highlighted significant weaknesses of MPI, which preclude the scalability of
applications on heterogeneous multi-core systems. As a result of both the hardware development and the objective of achieving
scalability to even higher CPU numbers, we now see new demands on programming models in terms of a flexible thread model,
asynchronous communication, and the management of storage subsystems with varying bandwidth and latency. This challenge to
the software industry, also known as the "Multicore Challenge", stimulates the development of new programming models and
programming languages ​and leads to new challenges for mathematical modeling, algorithms and their implementation in software.
• PGAS (Partitioned Global Address Space) programming models have been discussed as an alternative to MPI for some time. The
PGAS approach offers the developer an abstract shared address space which simplifies the programming task and at the same time
facilitates: data-locality, thread-based programming and asynchronous communication. The goal of the GASPI project is to develop a
suitable programming tool for the wider HPC-Community by defining a standard with a reliable basis for future developments
through the PGAS-API of Fraunhofer ITWM. Furthermore, an implementation of the standard as a highly portable open source
library will be available. The standard will also define interfaces for performance analysis, for which tools will be developed in the
project. The evaluation of the libraries is done via the parallel re-implementation of industrial applications up to and including
production status.
9
Download