TTG Apptimizer, a CPU+GPU autotuning toolkit A product of ttgLabs Customer Problem Today, hybrid systems where GPUs are intensively used for parallel computations become more and more popular. However, to use the real computational power of GPUs, i.e. to get a performance of hundreds of GFLOPS on a single GPU and a 30- to 100-fold performance gain compared to CPU, one has to efficiently optimize the application and to tailor it to the architecture of GPU it is running on and to the structure of the processed data. Unfortunately, the architecture of graphics accelerators from different vendors and even of devices of different generations from the same vendor may differ dramatically thus requiring the usage of specific yet incompatible optimization methods. Technically, the development and further optimization of applications for hybrid platforms entails with several difficulties. One of them consists in accurate choosing of processor type for each part of the algorithm which requires a priori detailed knowledge of the platform architecture. Another problem arises as one tries to fit the so called ‘magic constants’ such as block size or threads topology to graphics accelerators’ architecture. While these constants could have a noticeable effect on application performance there is no ‘rule of thumb’ for choosing their particular values. Traditional Approach Within current programming paradigm and development tools, taking into account the particular architectural features of computational cores is a tough task. The development of applications that use efficiently all the potential of hybrid platforms remains unacceptably time consuming. Developers have to study an absolutely new programming discipline thus significantly increasing time to market. While there are several code analyzers on the market that dramatically facilitate the reveal of the application bottlenecks, in practice the developer usually has to rely on the ‘test-and-error’ approach or to guess the proper values of the aforementioned ‘magic constants’ to come up with an efficiently optimized code. But even in this case, to remain an application performance at comparable level after changing the hardware or data structure, in most cases this work has to be done from the beginning. That’s why it is much easier to write a separate code for each type of processing units than to build a highly optimized ‘one-fits-all’ software. As a result, the developers have to choose between creating a single universal version of an application with performance being two or three times lower than its possible value or wasting additional resources to develop and support several versions of an application for various GPUs. Our Solution - TTG Apptimizer Contrary to traditional solutions, we offer an absolutely new, dynamic approach to the software optimization problem. Its key idea is the software autotuning, or dynamic optimization, which means that an application dynamically tailors itself to the particular hardware platform and data structure directly in the runtime. The described approach has been implemented in TTG Apptimizer toolkit that contains a library of C++ templates and some mechanisms for applications autotuning. Its key components are 'smart' optimization algorithms that take into account various behavioral models of hybrid software and optimize several dynamic parameters transparently to the application. TTG Apptimizer enables to use all available processing units of hybrid system simultaneously and provides load balancing between them. This toolkit efficiently solves the most tedious problems of ‘hybrid coding’ the developers usually met with. TTG Apptimizer will direct all the computer processor power to computational tasks by efficiently distributing them between CPUs and GPUs and by providing load balancing between these two types of processing units. This software accomplishes several dynamic optimization procedures thus allowing one to develop new applications for and to port existing software to hybrid platforms sometimes without significant recoding. TTG Apptimizer can be considered as an extension of widely used parallel programming tools, therefore the cost of its integration into software development process reduces significantly. Basically this software runs on top of existing industrial solutions thus facilitating their usage and significantly reducing the demands to customer developers’ skills. How It Works The developer should make minimal modifications with the source code just embedding TTG Apptimizer primitives into his/her computing kernels. And that’s it. Even for a very complicated code, it usually takes no more than a couple of days. During runtime, the optimizer module will gather information about available GPUs and processing data. After that, each kernel will be automatically tuned for a current usage scenario and the computations will be efficiently distributed between all GPUs, thus providing performance close to the maximum one for this particular system and data structure. Competitive Advantages Autotuning (application will optimize itself in runtime) Universal solution (no ‘platform-lock’, supports various GPUs, OSs, APIs and compilers) Shorter time-to-market for new customer applications Not so demanding to customer developer’s skills Potential Customers TTG Apptimizer can be used in a broad range of HPC areas for solving computational problems that could be efficiently parallelized on hybrid platforms. The areas of its application include various disciplines of physics, chemistry and computational biology, drug design, geological prospecting and meteorology, ecology and forecast of natural disasters, automobile and aircraft design, semantic analysis and business intelligence. Potential customers are enterprises that actively use HPC applications including universities and other research organizations, design departments in different industries, oil and gas enterprises, pharmacological companies, data centers of meteorological agencies and of organizations that are involved in seismological data analysis and simulation of global processes, and any other companies that work with computationally intensive applications. Available Editions and Prices Currently, TTG Apptimizer toolkit is available in three editions, namely Lite (Trial), Workstation and Mini-Cluster. The Lite edition can be downloaded from ttgLabs.com for free. The Workstation edition supports from one to ten GPUs with prices starting from 500 USD for usage on 1 or 2 GPUs. The Mini-Cluster edition (under development) is addressed to systems with at least three GPUs, prices started from 2290 USD. A detailed TTG Apptimizer price list is provided upon request. Support Basic customer support should be provided by the reseller. A ‘second-line’ support will be provided by vendor in working hours by e-mail, Skype or phone. Special support plans can be also discussed. Contact Pavel Ivanov, PhD Co-founder and Deputy CEO, Business Development p_ivanov@ttgLabs.com +7 903 121 1420 ttgLabs.com