GPGPUs/DPAs Dezső Sima April 2011 (v1.0, Last updated 04/15/2011) © Dezső Sima 2011 1. Introduction (1) Aim Brief introduction and overview. Contents 1.Introduction 2. Basics of the SIMT execution 3. Overview of GPGPUs 4. Overview of data parallel accelerators 5. References 1. Introduction 1. Introduction (2) Representation of objects by triangles Vertex Edge Surface Vertices • have three spatial coordinates • supplementary information necessary to render the object, such as • • • • color texture reflectance properties etc. 1. Introduction (3) Main types of shaders in GPUs Shaders Vertex shaders Transform each vertex’s 3D-position in the virtual space to the 2D coordinate, at which it appears on the screen Pixel shaders (Fragment shaders) Geometry shaders Calculate the color of the pixels Can add or remove vertices from a mesh 1. Introduction (4) DirectX version Pixel SM Vertex SM Supporting OS 8.0 (11/2000) 1.0, 1.1 1.0, 1.1 Windows 2000 8.1 (10/2001) 1.2, 1.3, 1.4 1.0, 1.1 Windows XP/ Windows Server 2003 9.0 (12/2002) 2.0 2.0 9.0a (3/2003) 2_A, 2_B 2.x 9.0c (8/2004) 3.0 3.0 Windows XP SP2 10.0 (11/2006) 4.0 4.0 Windows Vista 10.1 (2/2008) 4.1 4.1 Windows Vista SP1/ Windows Server 2008 11 (in development) 5.0 5.0 Table: Pixel/vertex shader models (SM) supported by subsequent versions of DirectX and MS’s OSs [18], [21] DirectX: Microsoft’s API set for MM/3D 1. Introduction (3) Convergence of important features of the vertex and pixel shader models Subsequent shader models introduce typically, a number of new/enhanced features. Differences between the vertex and pixel shader models in subsequent shader models concerning precision requirements, instruction sets and programming resources. Shader model 2 [19] • Different precision requirements Vertex shader: FP32 (coordinates) Pixel shader: FX24 (3 colors x 8) • Different instructions • Different resources (e.g. registers) Shader model 3 [19] • Unified precision requirements for both shaders (FP32) with the option to specify partial precision (FP16 or FP24) by adding a modifier to the shader code • Different instructions • Different resources (e.g. registers) 1. Introduction (3) Shader model 4 (introduced with DirectX10) [20] • Unified precision requirements for both shaders (FP32) with the possibility to use new data formats. • Unified instruction set • Unified resources (e.g. temporary and constant registers) Shader architectures of GPUs prior to SM4 GPUs prior to SM4 (DirectX 10): have separate vertex and pixel units with different features. Drawback of having separate units for vertex and pixel shading • Inefficiency of the hardware implementation • (Vertex shaders and pixel shaders often have complementary load patterns [21]). 1. Introduction (5) Unified shader model (introduced in the SM 4.0 of DirectX 10.0) Unified, programable shader architecture The same (programmable) processor can be used to implement all shaders; • the vertex shader • the pixel shader and • the geometry shader (new feature of the SMl 4) 1. Introduction (6) Figure: Principle of the unified shader architecture [22] 1. Introduction (7) Based on its FP32 computing capability and the large number of FP-units available the unified shader is a prospective candidate for speeding up HPC! GPUs with unified shader architectures also termed as GPGPUs (General Purpose GPUs) or cGPUs (computational GPUs) 1. Introduction (8) Peak FP32/FP64 performance of Nvidia’s GPUs vs Intel’ P4 and Core2 processors [43] 1. Introduction (9) Evolution of the FP-32 performance of GPGPUs [44] 1. Introduction (9) Evolution of the bandwidth of Nvidia’s GPU’s vs Intel’s P4 and Core2 processors [43] 1. Introduction (10) Figure: Contrasting the utilization of the silicon area in CPUs and GPUs [11] 1. Introduction (9) Background slides to Introduction 1. Introduction Figure: Peak SP FP performance of Nvidia’s GPUs vs Intel’ P4 and Core2 processors [11] 1. Introduction Figure: Bandwidth values of Nvidia’s GPU’s vs Intel’s P4 and Core2 processors [11] 2. Basics of the SIMT execution 2. Basics of the SIMT execution (1) Main alternatives of data parallel execution Data parallel execution SIMD execution SIMT execution • One dimensional data parallel execution, • Two dimensional data parallel execution, i.e. it performs the same operation i.e. it performs the same operation on all elements of given on all elements of given FX/FP input vectors FX/FP input arrays (matrices) • is massively multithreaded, and provides • data dependent flow control as well as • barrier synchronization Needs an FX/FP SIMD extension of the ISA E.g. 2. and 3. generation superscalars Needs an FX/FP SIMT extension of the ISA and the API GPGPUs, data parallel accelerators Figure: Main alternatives of data parallel execution 2. Basics of the SIMT execution (2) Scalar, SIMD and SIMT execution Scalar execution SIMD execution SIMT execution Domain of execution: single data elements Domain of execution: elements of vectors Domain of execution: elements of matrices (at the programming level) Figure: Domains of execution in case of scalar, SIMD and SIMT execution Remark SIMT execution is also termed as SPMD (Single_Program Multiple_Data) execution (Nvidia) 2. Basics of the SIMT execution (3) Key components of the implementation of SIMT execution • Data parallel execution • Massive multithreading • Data dependent flow control • Barrier synchronization 2. Basics of the SIMT execution (4) Data parallel execution Performed by SIMT cores SIMT cores execute the same instruction stream on a number of ALUs (i.e. all ALUs of a SIMT core perform typically the same operation). SIMT core Fetch/Decode ALU ALU ALU ALU ALU ALU ALU ALU Figure: Basic layout of a SIMT core SIMT cores are the basic building blocks of GPGPU or data parallel accelerators. During SIMT execution 2-dimensional matrices will be mapped to blocks of SIMT cores. 2. Basics of the SIMT execution (5) Remark 1 Different manufacturers designate SIMT cores differently, such as • streaming multiprocessor (Nvidia), • superscalar shader processor (AMD), • wide SIMD processor, CPU core (Intel). 2. Basics of the SIMT execution (6) Each ALU is allocated a working register set (RF) Fetch/Decode ALU ALU ALU ALU ALU ALU ALU ALU RF RF RF RF RF RF RF RF Figure: Main functional blocks of a SIMT core 2. Basics of the SIMT execution (7) SIMT ALUs perform typically, RRR operations, that is ALUs take their operands from and write the calculated results to the register set (RF) allocated to them. RF ALU Figure: Principle of operation of the SIMD ALUs 2. Basics of the SIMT execution (8) Remark 2 Actually, the register sets (RF) allocated to each ALU are given parts of a large enough register file. RF RF RF RF RF RF RF RF ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Figure: Allocation of distinct parts of a large register set as workspaces of the ALUs 2. Basics of the SIMT execution (9) Basic operation of recent SIMT ALUs • execute basically SP FP-MADD (simple precision i.e. 32-bit. Multiply-Add) instructions of the form axb+c , RF • are pipelined, capable of starting a new operation every new clock cycle, (more precisely, every shader clock cycle), That is, without further enhancements their peak performance is 2 SP FP operations/cycle ALU • need a few number of clock cycles, e.g. 2 or 4 shader cycles, to present the results of the SP FMADD operations to the RF, 2. Basics of the SIMT execution (10) Additional operations provided by SIMT ALUs • • • FX operations and FX/FP conversions, DP FP operations, trigonometric functions (usually supported by special functional units). 2. Basics of the SIMT execution (11) Massive multithreading Aim of massive multithreading to speed up computations by increasing the utilization of available computing resources in case of stalls (e.g. due to cache misses). Principle • Suspend stalled threads from execution and allocate ready to run threads for execution. • When a large enough number of threads are available long stalls can be hidden. 2. Basics of the SIMT execution (12) Multithreading is implemented by creating and managing parallel executable threads for each data element of the execution domain. Same instructions for all data elements Figure: Parallel executable threads for each element of the execution domain 2. Basics of the SIMT execution (13) Effective implementation of multithreading if thread switches, called context switches, do not cause cycle penalties. Achieved by • providing separate contexts (register space) for each thread, and • implementing a zero-cycle context switch mechanism. 2. Basics of the SIMT execution (14) SIMT core Fetch/Decode CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX Actual context CTX CTX CTX CTX CTX CTX CTX CTX Context switch CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Register file (RF) Figure: Providing separate thread contexts for each thread allocated for execution in a SIMT ALU 2. Basics of the SIMT execution (15) Data dependent flow control Implemented by SIMT branch processing In SIMT processing both paths of a branch are executed subsequently such that for each path the prescribed operations are executed only on those data elements which fulfill the data condition given for that path (e.g. xi > 0). Example 2. Basics of the SIMT execution (16) Figure: Execution of branches [24] The given condition will be checked separately for each thread 2. Basics of the SIMT execution (17) First all ALUs meeting the condition execute the prescibed three operations, then all ALUs missing the condition execute the next two operatons Figure: Execution of branches [24] 2. Basics of the SIMT execution (18) Figure: Resuming instruction stream processing after executing a branch [24] 2. Basics of the SIMT execution (19) Barrier synchronization Lets wait all threads for completing all prior instructions before executing the next instruction. Implemented e.g. in AMD’s Intermediate Language (IL) by the fence threads instruction [10]. Remark In the R600 ISA this instruction is coded by setting the BARRIER field of the Control Flow (CF) instruction format [7]. 2. Basics of the SIMT execution (20) Principle of SIMT execution assuming serial kernel processing Host Device Each kernel invocation lets execute all thread blocks (Block(i,j)) belonging to the related Grid Remark In the Figure CUDA terminology is used. Figure: Hierarchy of threads [25] 2. Basics of the SIMT execution (21) Remark A parallel kernel processing is also possible assuming advanced GPGPU devices (such as Nvidia’s Fermi or AMD’s HD 69xx GPGPUs) and appropriate software support. 3. Overview of GPGPUs 3. Overview of GPGPUs (1) Basic implementation alternatives of the SIMT execution GPGPUs Data parallel accelerators Dedicated units Programmable GPUs supporting data parallel execution with appropriate with appropriate programming environments programming environment Have display outputs E.g. Nvidia’s 8800 and GTX lines AMD’s HD 38xx, HD48xx lines No display outputs Have larger memories than GPGPUs Nvidia’s Tesla lines AMD’s FireStream lines Figure: Basic implementation alternatives of the SIMT execution 3. Overview of GPGPUs (2) GPGPUs AMD/ATI’s line Nvidia’s line 90 nm G80 80 nm Shrink 65 nm G92 Enhanced arch. G200 Shrink 40 nm R600 Shrink Enhanced arch. GF100 (Fermi) 55 nm RV670 Enhanced arch. RV770 Shrink Enhanced Enhanced arch. arch. RV870 Figure: Overview of Nvidia’s and AMD/ATI’s GPGPU lines Cayman 3. Overview of GPGPUs (3) NVidia 10/07 11/06 G80 G92 GT200 90 nm/681 mtrs 65 nm/754 mtrs 65 nm/1400 mtrs Cores Cards 6/08 8800 GTS 96 ALUs 320-bit 8800 GTX 8800 GT GTX260 GTX280 128 ALUs 384-bit 112 ALUs 256-bit 192 ALUs 448-bit 240 ALUs 512-bit OpenCL OpenCL Standard 11/07 6/07 CUDA Version 1.0 Version 1.1 6/08 11/08 Version 2.0 Version 2.1 AMD/ATI Cores Cards 11/05 5/07 11/07 5/08 R500 R600 R670 RV770 80 nm/681 mtrs 55 nm/666 mtrs 55 nm/956 mtrs (Xbox) HD 2900XT HD 3850 HD 3870 HD 4850 HD 4870 48 ALUs 320 ALUs 512-bit 320 ALUs 256-bit 320 ALUs 256-bit 800 ALUs 256-bit 800 ALUs 256-bit OpenCL 12/08 OpenCL 11/07 Brooks+ Standard 9/08 12/08 Brook+ Brook+ 1.2 Brook+ 1.3 (SDK v.1.0) (SDK v.1.2) (SDK v.1.3) 6/08 RapidMind 3870 support 2005 2006 2007 2008 Figure: Overview of GPGPUs and their basic software support (1) 3. Overview of GPGPUs (4) NVidia Cores 3/10 07/10 11/10 GF100 (Fermi) GF104 (Fermi) GF110 (Fermi) 40 nm/3000 mtrs 40 nm/1950 mtrs 40 nm/3000 mtrs 1/11 Cards GTX 470 448 ALUs 320-bit OpenCL GTX 460 10/09 6/10 OpenCL 1.0 OpenCL 1.0 OpenCL 1.1 SDK 1.0 Early release SDK 1.0 SDK 1.1 Version 22 3/10 Version 3.0 Version 2.3 GTX 580 336 ALUs 192/256-bit 6/09 5/09 6/09 CUDA GTX 480 480 ALUs 384-bit 512 ALUs 384-bit 6/10 1/11 Version 3.1 Version 3.2 GTX 560 Ti 480 ALUs 384-bit 3/11 Version 4.0 Beta AMD/ATI 9/09 Cores 10/10 RV870 (Cypress) Cayman Pro/XT 40 nm/2100 mtrs 40 nm/1700 mtrs 40 nm/2640 mtrs Cards HD 5850/70 HD 6850/70 1440/1600 ALUs 256-bit OpenCL 03/10 OpenCL 1.0 OpenCL 1.0 HD 6950/70 1408/1536 ALUs 256-bit 08/10 OpenCL 1.1 (SDK V.2.2) (SDK V.2.01) Brook+ 1.4 (SDK V.1.4 Beta) RapidMind 960/1120 ALUs 256-bit 11/09 (SDK V.2.0) 3/09 Brooks+ 12/10 Barts Pro/XT 8/09 Intel bought RapidMind 2009 2010 Figure: Overview of GPGPUs and their basic software support (2) 2011 3. Overview of GPGPUs (5) Remarks on AMD-based graphics cards [45], [66] Beginning with their Cypress-based HD 5xxx line and SDK v.2.0 AMD left Brook+ and started supporting OpenCL as the basis of their HLL programming language. AMD/ATI 10/10 9/09 Cores RV870 (Cypress) Cayman Pro/XT 40 nm/2100 mtrs 40 nm/1700 mtrs 40 nm/2640 mtrs Cards HD 5850/70 HD 6850/70 1440/1600 ALUs 256-bit OpenCL 03/10 OpenCL 1.0 OpenCL 1.0 HD 6950/70 1408/1536 ALUs 256-bit 08/10 OpenCL 1.1 (SDK V.2.2) (SDK V.2.01) Brook+ 1.4 (SDK V.2.01) RapidMind 960/1120 ALUs 256-bit 11/09 (SDK V.2.0) 3/09 Brooks+ 12/10 Barts Pro/XT 8/09 Intel bought RapidMind 2009 2010 2011 As a consequence AMD changed also • both the microarchitecture of their GPGPUs (by introducing Local and Global Data Share memories) and • their terminology by introducing Pre-OpenCL and OpenCL terminology, as discussed in Section 5.2. 3. Overview of GPGPUs (6) Remarks on Fermi-based graphics cards [45], [66] FP64 speed • ½ of the FP32 speed for the Tesla 20-series • 1/8 of the SP32 speed for the GeForce GTX 470/480/570/580 cards 1/12 for other GForce GTX4xx cards ECC available only on the Tesla 20-series Number of DMA engines Tesla 20-series has 2 DMA Engines (copy engines). GeForce cards have 1 DMA Engine. This means that CUDA applications can overlap computation and communication on Tesla using bi-directional communication over PCI-e. Memory size Tesla 20 products have larger on board memory (3GB and 6GB) 3. Overview of GPGPUs (7) Positioning Nvidia’s discussed GPGPU cards in their entire product portfolio [82] 3. Overview of GPGPUs (8) Nvidia’s compute capability concept Nvidia manages the continuous evolution by a) defining sets of capabilities and features designated as compute capability versions, b) specifying which compute capability version is supported by their • programming environments, represented by their SDKs, and • GPGPU lines, c) and specifying compatibility rules. among them. 3. Overview of GPGPUs (9) a) Defined sets of compute capability versions by Nvidia-1 [81] 3. Overview of GPGPUs (10) a) Defined sets of compute capability versions by Nvidia-2 [81] 3. Overview of GPGPUs (11) b1) Compute capability versions of the PTX ISAs generated by different releases of CUDA SDKs [50] Fermi 3. Overview of GPGPUs (12) b2) Support of the compute capability versions by Nvidia’s GPGPU cards [81] Capability GPGPU cores GPGPU devices 1.0 G80 GeForce 8800GTX/Ultra/GTS, Tesla C/D/S870, FX4/5600, 360M 1.1 G86, G84, G98, G96, G96b, G94, G94b, G92, G92b GeForce 8400GS/GT, 8600GT/GTS, 8800GT/GTS, 9600GT/GSO, 9800GT/GTX/GX2, GTS 250, GT 120/30, FX 4/570, 3/580, 17/18/3700, 4700x2, 1xxM, 32/370M, 3/5/770M, 16/17/27/28/36/37/3800M, NVS420/50 1.2 GT218, GT216, GT215 GeForce 210, GT 220/40, FX380 LP, 1800M, 370/380M, NVS 2/3100M 1.3 GT200, GT200b GTX 260/75/80/85, 295, Tesla C/M1060, S1070, CX, FX 3/4/5800 2.0 GF100, GF110 GTX 465, 470/80, Tesla C2050/70, S/M2050/70, Quadro 600,4/5/6000, Plex7000, GTX570, GTX580 2.1 GF108, GF106, GF104, GF114 GT 420/30/40, GTS 450, GTX 460, 500M 3. Overview of GPGPUs (13) c) Compatibility rules related to compute capability versions [50] The basic rule is forward compatibility within the main versions (versions 1.x and 2.x), but not across main versions. This is interpreted as follows Object files (called CUBIN files) compiled to a particular compute capability, are supported on all devices having the same or higher version number within the same main version. E.g. object files compiled to the compute capability 1.0 are supported on all 1.x devices but not supported on compute capability 2.0 (Fermi) devices. For more details see [52]. 3. Overview of GPGPUs (14) 8800 GTS 8800 GTX 8800 GT GTX 260 GTX 280 Core G80 G80 G92 GT200 GT200 Introduction 11/06 11/06 10/07 6/08 6/08 IC technology 90 nm 90 nm 65 nm 65 nm 65 nm Nr. of transistors 681 mtrs 681 mtrs 754 mtrs 1400 mtrs 1400 mtrs Die are 480 mm2 480 mm2 324 mm2 576 mm2 576 mm2 Core frequency 500 MHz 575 MHz 600 MHz 576 MHz 602 MHz No of SMs (cores) 12 16 14 24 30 No.of FP32 EUss 96 128 112 192 240 Shader frequency 1.2 GHz 1.35 GHz 1.512 GHz 1.242 GHz 1.296 GHz 3 3 Computation 21 No. FP32 operations./cycle Peak FP32 performance 230.4 GFLOPS 345.61 GFLOPS 508 GFLOPS 715 GFLOPS 933 GFLOPS Peak FP64 performance – – – 59.62 GFLOPS 77.76 GFLOPS 1600 Mb/s 1800 Mb/s 1800 Mb/s 1998 Mb/s 2214 Mb/s Mem. interface 320-bit 384-bit 256-bit 448-bit 512-bit Mem. bandwidth 64 GB/s 86.4 GB/s 57.6 GB/s 111.9 GB/s 141.7 GB/s Mem. size 320 MB 768 MB 512 MB 896 MB 1.0 GB Mem. type GDDR3 GDDR3 GDDR3 GDDR3 GDDR3 Mem. channel 6*64-bit 6*64-bit 4*64-bit 8*64-bit 8*64-bit SLI SLI SLI SLI SLI PCIe x16 PCIe x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16 10 10 10 10.1 subset 10.1 subset 146 W 155 W 105 W 182 W 236 W Memory Mem. transfer rate (eff) System Multi. CPU techn. Interface MS Direct X TDP 1: Nvidia takes the FP32 capable Texture Processing Units also into consideration and calculates with 3 FP32 operations/cycle Table: Main features of Nvidia’s GPGPUs-1 3. Overview of GPGPUs (15) Remarks In publications there are conflicting statements about whether or not the GT80 makes use of dual issue (including a MAD and a Mul operation) within a period of four shader cycles or not. Official specifications [22] declare the capability of dual issue, but other literature sources [64] and even a textbook, co-authored by one of the chief developers of the GT80 (D. Kirk [65]) deny it. A clarification could be found in a blog [66], revealing that the higher figure given in Nvidia’s specifications includes calculations made both by the ALUs in the SMs and by the texture processing units TPU). Nevertheless, the TPUs can not be directly accessed by CUDA except for graphical tasks, such as texture filtering. Accordingly, in our discussion focusing on numerical calculations it is fair to take only the MAD operations into account for specifying the peak numerical performance. 3. Overview of GPGPUs (16) Structure of an SM of the G80 architecture Texture processing Units consisting of • TA: Texture Address units • TF: Texture Filter Units They are FP32 or FP16 capable [46] 3. Overview of GPGPUs (17) GTX 470 GTX 480 GTX 460 GTX 570 GTX 580 GF100 GF100 GF104 GF110 GF110 3/10 3/10 7/10 12/10 11/10 40 nm 40 nm 40 nm 40 nm 40 nm Nr. of transistors 3200 mtrs 3200 mtrs 1950 mtrs 3000 mtrs 3000 mtrs Die are 529 mm2 529 mm2 367 mm2 520 mm2 520 mm2 732 MHz 772 MHz Core Introduction IC technology Core frequency Computation No of SMs (cores) 14 15 7 15 16 No. of FP32 EUs 448 480 336 480 512 Shader frequency 1215 MHz 1401 MHz 1350 MHz 1464 MHz 1544 MHz 2 2 3 2 2 Peak FP32 performance 1088 GFLOPS 1345 GFLOPS 9072 GFLOPS 1405 GFLOPS 1581 GFLOPS Peak FP64 performance 136 GFLOPS 168 GFLOPS 75.6 GFLOPS 175.6 GFLOPS 197.6 GFLOPS 3348 Mb/s 3698 Mb/s 3600 Mb/s 3800 Mb/s 4008 Mb/s 320-bit 384-bit 192/256-bit 320-bit 384-bit 133.9 GB/s 177.4 GB/s 86.4/115.2 GB/s 152 GB/s 192.4 GB/s Mem. size 1.28 GB 1.536 GB 0.768/1.024 GB/s 1.28 GB 1.536/3.072 GB Mem. type GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 Mem. channel 5*64-bit 6*64-bit 3/4 *64-bit 5*64-bit 6*64-bit SLI SLI SLI SLI SLI PCIe 2.0*16 PCIe 2.0*16 PCIe 2.0*16 PCIe 2.0*16 PCIe 2.0*16 11 11 11 11 11 215 W 250 W 150/160 W 219 W 244 W No. FP32 operations/cycle Memory Mem. transfer rate (eff) Mem. interface Mem. bandwidth System Multi. CPU techn. Interface MS Direct X TDP Table: Main features of Nvidia’s GPGPUs-2 3. Overview of GPGPUs (18) Remarks 1) The GDDR3 memory has a double clocked data transfer Effective memory transfer rate = 2 x memory frequency The GDDR5 memory has a quad clocked data transfer Effective memory transfer rate = 4 x memory frequency 2) Both the GDDR3 and GDDR5 memories are 32-bit devices. Nevertheless, memory controllers of GPGPUs may be designed either to control a single 32-bit memory channel or dual memory channels, providing a 64-bit channel width. 3. Overview of GPGPUs (19) Examples for Nvidia cards Nvidia GeForce GTX 480 (GF 100 based) [47] 3. Overview of GPGPUs (20) Nvidia GeForce GTX 480 and 580 cards [77] GTX 480 (GF 100 based) GTX 580 (GF 110 based) 3. Overview of GPGPUs (21) A pair of GeForce GTX 480 cards [47] (GF100 based) 3. Overview of GPGPUs (22) HD 2900XT HD 3850 HD 3870 HD 4850 HD 4870 Core R600 R670 R670 RV770 (R700-based) RV770 (R700 based) Introduction 5/07 11/07 11/07 5/08 5/08 80 nm 55 nm 55 nm 55 nm 55 nm Nr. of transistors 700 mtrs 666 mtrs 666 mtrs 956 mtrs 956 mtrs Die are 408 mm2 192 mm2 192 mm2 260 mm2 260 mm2 Core frequency 740 MHz 670 MHz 775 MHz 625 MHz 750 MHz 320 320 320 800 800 740 MHz 670 MHz 775 MHz 625 MHz 750 MHz 2 2 2 2 2 Peak FP32 performance 471.6 GFLOPS 429 GFLOPS 496 GFLOPS 1000 GFLOPS 1200 GFLOPS Peak FP64 performance – – – 200 GFLOPS 240 GFLOPS 1600 Mb/s 1660 Mb/s 2250 Mb/s 2000 Mb/s 3600 Mb/s (GDDR5) 512-bit 256-bit 256-bit 265-bit 265-bit 105.6 GB/s 53.1 GB/s 720 GB/s 64 GB/s 118 GB/s Mem. size 512 MB 256 MB 512 MB 512 MB 512 MB Mem. type GDDR3 GDDR3 GDDR4 GDDR3 GDDR3/GDDR5 Mem. channel 8*64-bit 8*32-bit 8*32-bit 4*64-bit 4*64-bit Mem. contr. Ring bus Ring bus Ring bus Crossbar Crossbar CrossFire X CrossFire X CrossFire X CrossFire X CrossFire X PCIe x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16 10 10.1 10.1 10.1 10.1 150 W 75 W 105 W 110 W 150 W IC technology Computation No. of ALUs Shader frequency No. FP32 operations./cycle Memory Mem. transfer rate (eff) Mem. interface Mem. bandwidth System Multi. CPU techn. Interface MS Direct X TDP Max./Idle Table: Main features of AMD/ATIs GPGPUs-1 3. Overview of GPGPUs (23) Evergreen series HD 5850 HD 5870 HD 5970 Cypress PRO (RV870-based) Cypress XT (RV870-based) Hemlock XT (RV870-based) 9/09 9/09 11/09 40 nm 40 nm 40 nm Nr. of transistors 2154 mtrs 2154 mtrs 2*2154 mtrs Die are 334 mm2 334 mm2 2*334 mm2 Core frequency 725 MHz 850 MHz 725 MHz No. of SIMD cores / VLIW5 ALUs 18/16 20/16 2*20/16 No. of EUs 1440 1600 2*1600 725 MHz 850 MHz 725 MHz 2 2 2 Peak FP32 performance 2088 GFLOPS 2720 GFLOPS 4640 GFLOPS Peak FP64 performance 417.6 GFLOPS 544 GFLOPS 928 GFLOPS 4000 Mb/s 4800 Mb/s 4000 Mb/s 256-bit 256-bit 2*256-bit 128 GB/s 153.6 GB/s 2*128 GB/s Mem. size 1.0 GB 1.0/2.0 GB 2*(1.0/2.0) GB Mem. type GDDR5 GDDR5 GDDR5 Mem. channel 8*32-bit 8*32-bit 2*8*32-bit Multi. CPU techn. CrossFire X CrossFire X CrossFire X Interface PCIe 2.1*16 PCIe 2.1*16 PCIe 2.1*16 11 11 11 151/27 W 188/27 W 294/51 W Core Introduction IC technology Computation Shader frequency No. FP32 inst./cycle Memory Mem. transfer rate (eff) Mem. interface Mem. bandwidth System MS Direct X TDP Max./Idle Table: Main features of AMD/ATI’s GPGPUs-2 3. Overview of GPGPUs (24) Northerm Islands series HD 6850 HD 6870 Barts Pro Barts XT Introduction 10/10 10/10 IC technology 40 nm 40 nm Nr. of transistors 1700 mtrs 1700 mtrs Die are 255 mm2 255 mm2 Core frequency 775 MHz 900 MHz 12/16 14/16 960 1120 775 MHz 900 MHz 2 2 Peak FP32 performance 1488 GFLOPS 2016 GFLOPS Peak FP64 performance - - 4000 Mb/s 4200 Mb/s 256-bit 256-bit 128 GB/s 134.4 GB/s Mem. size 1 GB 1 GB Mem. type GDDR5 GDDR5 Mem. channel 8*32-bit 8*32-bit Multi. CPU techn. CrossFire X CrossFire X Interface PCIe 2.1*16 PCIe 2.1*16 11 11 127/19 W 151/19 W Core Computation No. of SIMD cores /VLIW5 ALUs No. of EUs Shader frequency No. FP32 inst./cycle Memory Mem. transfer rate (eff) Mem. interface Mem. bandwidth System MS Direct X TDP Max./Idle Table: Main features of AMD/ATI’s GPGPUs-3 3. Overview of GPGPUs (25) Northerm Islands series HD 6950 HD 6970 HD 6990 HD 6990 unlocked Core Cayman Pro Cayman XT Antilles Antilles Introduction 12/10 12/10 3/11 3/11 IC technology 40 nm 40 nm 40 nm 40 nm 2.64 billion 2.64 billion 2*2.64 billion 2*2.64 billion Die are 389 mm2 389 mm2 2*389 mm2 2*389 mm2 Core frequency 800 MHz 880 MHz 830 MHz 880 MHz No. of SIMD cores /VLIW4 ALUs 22/16 24/16 2*24/16 2*24/16 No. of EUs 1408 1536 2*1536 2*1536 800 MHz 880 MHz 830 MHz 880 MHz 4 4 4 4 Peak FP32 performance 2.25 TFLOPS 2.7 TFLOPS 5.1 TFLOPS 5.4 TFLOPS Peak FP64 performance 0.5625 TFLOPS 0.683 TFLOPS 1.275 TFLOPS 1.35 TFLOPS 5000 Mb/s 5500 Mb/s 5000 Mb/s 5000 Mb/s 256-bit 256-bit 256-bit 256-bit 160 GB/s 176 GB/s 2*160 GB/s 2*160 GB/s Mem. size 2 GB 2 GB 2*2 GB 2*2 GB Mem. type GDDR5 GDDR5 GDDR5 GDDR5 Mem. channel 8*32-bit 5*32-bit 2*8*32-bit 2*8*32-bit - - - - CrossFireX CrossFireX CrossFireX CrossFireX PCIe 2.1*16-bit PCIe 2.1*16-bit PCIe 2.1*16-bit PCIe 2.1*16-bit 11 11 11 11 200/20 W 250/20 W 350/37 W 415/37 W Nr. of transistors Computation Shader frequency No. FP32 inst./cycle / ALU Memory Mem. transfer rate (eff) Mem. interface Mem. bandwidth System ECC Multi. CPU techn. Interface MS Direct X TDP Max./Idle Table: Main features of AMD/ATIs GPGPUs-4 3. Overview of GPGPUs (26) Remark The Radeon HD 5xxx line of cards is designated also as the Evergreen series and the Radeon HD 6xxx line of cards is designated also as the Northern islands series. 3. Overview of GPGPUs (27) Examples for AMD cards HD 5870 (RV870 based) [41] 3. Overview of GPGPUs (28) HD 5970 (actually RV870 based) [80] ATI HD 5970: 2 x ATI HD 5870 with slightly reduced memory clock 3. Overview of GPGPUs (29) HD 5970 (actually RV870 based) [79] ATI HD 5970: 2 x ATI HD 5870 with slightly reduced memory clock 3. Overview of GPGPUs (30) AMD HD 6990 (actually Cayman based) [78] AMD HD 6990: 2 x ATI HD 6970 with slightly reduced memory and shader clock 3. Overview of GPGPUs (31) Price relations (as of 01/2011) Nvidia GTX 570 GTX 580 ~ 350 $ ~ 500 $ AMD HD 6970 HD 6990 (Dual 6970) ~ 400 $ ~ 700 $ 4. Overview of data parallel accelerators 4. Overview of data parallel accelerators (1) Data parallel accelerators Implementation alternatives of data parallel accelerators On card implementation On-die integration Recent implementations E.g. Emerging implementations GPU cards Intel’s Heavendahl Data-parallel accelerator cards AMD’s Torrenza integration technology Intel’s Sandy Bridge (2011) AMD’s Fusion (2008) integration technology 2010/2011 Trend Figure: Implementation alternatives of dedicated data parallel accelerators 4. Overview of data parallel accelerators (2) On-card accelerators Card implementations Single cards fitting into a free PCI Ex16 slot of the host computer. E.g. Nvidia Tesla C870 Nvidia Tesla C1060 Nvidia Tesla C2070 AMD FireStream 9170 AMD FireStream 9250 AMD FireStream 9370 Desktop implementations 1U server implementations Usually 4 cards Usually dual cards mounted into a 1U server rack, mounted into a box, connected two adapter cards connected to an that are inserted into adapter card that is inserted into a two free PCIEx16 slots of a server through two switches free PCI-E x16 slot of the and two cables. host PC through a cable. Nvidia Tesla D870 Nvidia Tesla S870 Nvidia Tesla S1070 Nvidia Tesla S2050/S2070 Figure: Implementation alternatives of on-card accelerators 4. Overview of data parallel accelerators (3) NVidia Tesla-1 G80-based GT200-based 6/08 6/07 Card C1060 C870 345.6 4 GB GDDR3 SP: 933 GFLOPS DP: 77.76 GFLOPS 1.5 GB GDDR3 SP: 345.6 GFLOPS DP: 6/07 Desktop D870 2*C870 incl. 3 GB GDDR3 SP: 691.2 GFLOPS DP: - IU Server 6/07 6/08 S870 S1070 4*C870 incl. 6 GB GDDR3 SP: 1382 GFLOPS DP: - CUDA 4*C1060 16 GB GDDR3 SP: 3732 GFLOPS DP: 311 GFLOPS 6/07 11/07 6/08 Version 1.0 Version 1.01 Version 2.0 2007 2008 Figure: Overview of Nvidia’s G80/G200-based Tesla family-1 4. Overview of data parallel accelerators (4) FB: Frame Buffer Figure: Main functional units of Nvidia’s Tesla C870 card [2] 4. Overview of data parallel accelerators (5) Figure: Nvida’s Tesla C870 and AMD’s FireStream 9170 cards [2], [3] 4. Overview of data parallel accelerators (6) Figure: Tesla D870 desktop implementation [4] 4. Overview of data parallel accelerators (7) Figure: Nvidia’s Tesla D870 desktop implementation [4] 4. Overview of data parallel accelerators (8) Figure: PCI-E x16 host adapter card of Nvidia’s Tesla D870 desktop [4] 4. Overview of data parallel accelerators (9) Figure: Concept of Nvidia’s Tesla S870 1U rack server [5] 4. Overview of data parallel accelerators (10) Figure: Internal layout of Nvidia’s Tesla S870 1U rack [6] 4. Overview of data parallel accelerators (11) Figure: Connection cable between Nvidia’s Tesla S870 1U rack and the adapter cards inserted into PCI-E x16 slots of the host server [6] 4. Overview of data parallel accelerators (12) NVidia Tesla-2 GF100 (Fermi)-based 11/09 Card C2050/C2070 3/6 GB GDDR5 SP: 1.03 TLOPS1 DP: 0.515 TFLOPS 08/10 04/10 Module M2050/M2070 M2070Q 3/6 GB GDDR5 SP: 1.03 TFLOPS1 DP: 0.515 TFLOPS 6 GB GDDR5 SP: 1.03 TFLOPS1 DP: 0.515 TFLOPS 11/09 IU Server S2050/S2070 4*C2050/C2070 12/24 GB GDDR31 SP: 4.1 TFLOPS DP: 8.2 TFLOPS 5/09 CUDA CUDA Version 2.2 6/09 Version 2.3 3/10 Version 3.0 6/10 Version 3.1 1/11 Version 3.2 6/10 OpenCL+ OpenCL 1.1 2009 2010 1: 2011 Without SF (Special Function) operations Figure: Overview of Nvidia’s GF100 (Fermi)-based Tesla family 4. Overview of data parallel accelerators (13) Fermi based Tesla devices Tesla C2050/C2070 Card [71] (11/2009) Single GPU Card 3/6 GB GDDR5 515 GFLOPS DP ECC Tesla S2050/S2070 1U [72] (11/2009) Four GPUs 12/16 GB GDDR5s 2060 GFLOPS DP ECC 4. Overview of data parallel accelerators (14) Tesla M2050/M2070/M2070Q Processor Module (Dual slot board with PCIe Gen. 2 x16 interface) (04/2010) Figure: Tesla M2050/M2070/M2070Q Processor Module [74] Used in the Tianhe-1A Chinese supercomputer (10/2010) Remark The M2070Q is an upgrade of the M2070 providing higher memory clock (introduced 08/2010) 4. Overview of data parallel accelerators (15) Tianhe-1A (10/2010) [48] • Upgraded version of the Tianhe-1 (China) • 2.6 PetaFLOPS (fastest supercomputer in the World in 2010) • 14 336 Intel Xeon 5670 • 7 168 Nvidia Tesla M2050 4. Overview of data parallel accelerators (16) Specification data of the Tesla M2050/M2070/M2070Q modules [74] (448 ALUs) (448 ALUs) Remark The M2070Q is an upgrade of the M2070, providing higher memory clock (introduced 08/2010) 4. Overview of data parallel accelerators (17) Support of ECC • Fermi based Tesla devices introduced the support of ECC. • By contrast recently neither Nvidia’s straightforward GPGPU cards nor AMD’s GPGPU or DPA devices support ECC [76]. 4. Overview of data parallel accelerators (18) Tesla S2050/S2070 1U The S2050/S2070 differ only in the memory size, the S2050 includes 12 GB, the S2070 24 GB. GPU Specification Number of processor cores: 448 Processor core clock: 1.15 GHz Memory clock: 1.546 GHz Memory interface: 384 bit System Specification Four Fermi GPUs 12.0/24.0 GB of GDDR5, configured as 3.0/6.0 GB per GPU. When ECC is turned on, Figure: Block diagram and technical specifications of Tesla S2050/S2070 [75] available memory is ~10.5 GB Typical power consumption: 900 W 4. Overview of data parallel accelerators (19) AMD FireStream-1 RV670-based 6/08 11/07 Card RV770-based 9170 9170 2 GB GDDR3 FP32: 500 GLOPS FP64:~200 GLOPS Shipped 6/08 Stream Computing SDK 10/08 9250 9250 1 GB GDDR3 FP32: 1000 GLOPS FP64: ~300 GFLOPS Shipped 12/07 09/08 Version 1.0 Version 1.2 Brook+ ACM/AMD Core Math Library CAL (Computer Abstor Layer) Brook+ ACM/AMD Core Math Library CAL (Computer Abstor Layer) Rapid Mind 2007 2008 Figure: Overview of AMD/ATI’s FireStream family-1 4. Overview of data parallel accelerators (20) AMD FireStream-2 In 01/11 Version 2.3 renamed to APP RV870-based 10/10 06/10 Card 9350/9370 9350/9370 2/4 GB GDDR5 FP32: 2016 GLOPS FP64: 403/528 GLOPS Stream Computing SDK 03/10 03/09 Version 2.01 Version 1.4 OpenCL 1.0 Brooks+ 2009 05/10 08/10 Version 2.1 OpenCL 1.0 Shipped Version 2.2 12/10 Version 23 OpenCL 1.1 OpenCL 1.1 2010 2011 APP: Accelerated Parallel Processing Figure: Overview of AMD/ATI’s FireStream family-2 4. Overview of data parallel accelerators (21) Nvidia Tesla cards Core type C870 C1060 C2050 C2070 Based on G80 GT200 T20 (GF100-based) Introduction 6/07 6/08 11/09 Core frequency 600 MHz 602 MHz 575 MHz ALU frequency 1350 MHz 1296 GHz 1150 MHz No. of SMs (cores) 16 30 14 No. of ALUs 128 240 448 Peak FP32 performance 345.6 GFLOPS 933 GFLOPS 1030.4 GFLOPS Peak FP64 performance - 77.76 GFLOPS 515.2 GFLOPS 1600 Gb/s 1600 Gb/s 3000 Gb/s 384-bit 512-bit 384-bit 768 GB/s 102 GB/s 144 GB/s Mem. size 1.5 GB 4 GB Mem. type GDDR3 GDDR3 GDDR5 - - ECC PCIe *16 PCIe 2.0*16 PCIe 2.0*16 171 W 200 W Core Memory Mem. transfer rate (eff) Mem. interface Mem. bandwidth 3 GB 6 GB System ECC Interface Power (max) 238 W 247 W Table: Main features of Nvidia’s data parallel accelerator cards (Tesla line) [73] 4. Overview of data parallel accelerators (22) AMD FireStream cards Core type 9170 9250 9350 9370 Based on RV670 RV770 RV870 RV870 11/07 6/08 10/10 10/10 Core frequency 800 MHz 625 MHz 700 MHz 825 MHz ALU frequency 800 MHz 325 MHz 700 MHz 825 MHz 320 800 1440 1600 Peak FP32 performance 512 GFLOPS 1 TFLOPS 2016 GFLOPS 2640 GFLOPS Peak FP64 performance ~200 GFLOPS ~250 GFLOPS 403.2 GFLOPS 528 GFLOPS 1600 Gb/s 1986 Gb/s 4000 Gb/s 4600 Gb/s 256-bit 256-bit 256-bit 256-bit 51.2 GB/s 63.5 GB/s 128 GB/s 147.2 GB/s Mem. size 2 GB 1 GB 2 GB 4 GB Mem. type GDDR3 GDDR3 GDDR5 GDDR5 - - - - PCIe 2.0*16 PCIe 2.0*16 PCIe 2.0*16 PCIe 2.0*16 150 W 150 W 150 W 225 W Introduction Core No. of EUs Memory Mem. transfer rate (eff) Mem. interface Mem. bandwidth System ECC Interface Power (max) Table: Main features of AMD/ATI’s data parallel accelerator cards (FireStream line) [67] 4. Overview of data parallel accelerators (23) Price relations (as of 1/2011) Nvidia Tesla C2050 C2070 S2050 S2070 ~ 2000 $ ~ 4000 $ ~ 13 000 $ ~ 19 000 $ NVidia GTX GTX580 ~ 500 $ 1. Introduction (8) Background slides for intro to SIMT processing 1. Introduction (8) Figure: Peak SP FP performance of Nvidia’s GPUs vs Intel’ P4 and Core2 processors [11] 1. Introduction (9) Figure: Bandwidth values of Nvidia’s GPU’s vs Intel’s P4 and Core2 processors [11] 5. References 5. References (1) 5. References (to all four sections) [1]: Torricelli F., AMD in HPC, HPC07, 2007 http://www.altairhyperworks.co.uk/html/en-GB/keynote2/Torricelli_AMD.pdf [2]: NVIDIA Tesla C870 GPU Computing Board, Board Specification, Jan. 2008, Nvidia [3] AMD FireStream 9170, 2008 http://ati.amd.com/technology/streamcomputing/product_firestream_9170.html [4]: NVIDIA Tesla D870 Deskside GPU Computing System, System Specification, Jan. 2008, Nvidia, http://www.nvidia.com/docs/IO/43395/D870-SystemSpec-SP-03718-001_v01.pdf [5]: Tesla S870 GPU Computing System, Specification, Nvida, March 13 2008, http://jp.nvidia.com/docs/IO/43395/S870-BoardSpec_SP-03685-001_v00b.pdf [6]: Torres G., Nvidia Tesla Technology, Nov. 2007, http://www.hardwaresecrets.com/article/495 [7]: R600-Family Instruction Set Architecture, Revision 0.31, May 2007, AMD [8]: Zheng B., Gladding D., Villmow M., Building a High Level Language Compiler for GPGPU, ASPLOS 2006, June 2008 [9]: Huddy R., ATI Radeon HD2000 Series Technology Overview, AMD Technology Day, 2007 http://ati.amd.com/developer/techpapers.html 5. References (2) [10]: Compute Abstraction Layer (CAL) Technology – Intermediate Language (IL), Version 2.0, AMD, Oct. 2008 [11]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 2.0, June 2008, Nvidia [12]: Kirk D. & Hwu W. W., ECE498AL Lectures 7: Threading Hardware in G80, 2007, University of Illinois, Urbana-Champaign, http://courses.ece.uiuc.edu/ece498/al1/ lectures/lecture7-threading%20hardware.ppt [13]: Kogo H., R600 (Radeon HD2900 XT), PC Watch, June 26 2008, http://pc.watch.impress.co.jp/docs/2008/0626/kaigai_3.pdf [14]: Goto H., Nvidia G80, PC Watch, April 16 2007, http://pc.watch.impress.co.jp/docs/2007/0416/kaigai350.htm [15]: Goto H., GeForce 8800GT (G92), PC Watch, Oct. 31 2007, http://pc.watch.impress.co.jp/docs/2007/1031/kaigai398_07.pdf [16]: Goto H., NVIDIA GT200 and AMD RV770, PC Watch, July 2 2008, http://pc.watch.impress.co.jp/docs/2008/0702/kaigai451.htm [17]: Shrout R., Nvidia GT200 Revealed – GeForce GTX 280 and GTX 260 Review, PC Perspective, June 16 2008, http://www.pcper.com/article.php?aid=577&type=expert&pid=3 5. References (3) [18]: http://en.wikipedia.org/wiki/DirectX [19]: Dietrich S., “Shader Model 3.0, April 2004, Nvidia, http://www.cs.umbc.edu/~olano/s2004c01/ch15.pdf [20]: Microsoft DirectX 10: The Next-Generation Graphics API, Technical Brief, Nov. 2006, Nvidia, http://www.nvidia.com/page/8800_tech_briefs.html [21]: Patidar S. & al., “Exploiting the Shader Model 4.0 Architecture, Center for Visual Information Technology, IIIT Hyderabad, March 2007, http://research.iiit.ac.in/~shiben/docs/SM4_Skp-Shiben-Jag-PJN_draft.pdf [22]: Nvidia GeForce 8800 GPU Architecture Overview, Vers. 0.1, Nov. 2006, Nvidia, http://www.nvidia.com/page/8800_tech_briefs.html [23]: Goto H., Graphics Pipeline Rendering History, Aug. 22 2008, PC Watch, http://pc.watch.impress.co.jp/docs/2008/0822/kaigai_06.pdf [24]: Fatahalian K., “From Shader Code to a Teraflop: How Shader Cores Work,” Workshop: Beyond Programmable Shading: Fundamentals, SIGGRAPH 2008, [25]: Kanter D., “NVIDIA’s GT200: Inside a Parallel Processor,” Real World Technologies, Sept. 8 2008, http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242 [26]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 1.1, Nov. 2007, Nvidia 5. References (4) [27]: Seiler L. & al., “Larrabee: A Many-Core x86 Architecture for Visual Computing,” ACM Transactions on Graphics, Vol. 27, No. 3, Article No. 18, Aug. 2008 [28]: Kogo H., “Larrabee”, PC Watch, Oct. 17, 2008, http://pc.watch.impress.co.jp/docs/2008/1017/kaigai472.htm [29]: Shrout R., IDF Fall 2007 Keynote, PC Perspective, Sept. 18, 2007, http://www.pcper.com/article.php?aid=453 [30]: Stokes J., Larrabee: Intel’s biggest leap ahead since the Pentium Pro,” Ars Technica, Aug. 04. 2008, http://arstechnica.com/news.ars/post/20080804-larrabeeintels-biggest-leap-ahead-since-the-pentium-pro.html [31]: Shimpi A. L. C Wilson D., “Intel's Larrabee Architecture Disclosure: A Calculated First Move, Anandtech, Aug. 4. 2008, http://www.anandtech.com/showdoc.aspx?i=3367&p=2 [32]: Hester P., “Multi_Core and Beyond: Evolving the x86 Architecture,” Hot Chips 19, Aug. 2007, http://www.hotchips.org/hc19/docs/keynote2.pdf [33]: AMD Stream Computing, User Guide, Oct. 2008, Rev. 1.2.1 http://ati.amd.com/technology/streamcomputing/ Stream_Computing_User_Guide.pdf [34]: Doggett M., Radeon HD 2900, Graphics Hardware Conf. Aug. 2007, http://www.graphicshardware.org/previous/www_2007/presentations/ doggett-radeon2900-gh07.pdf 5. References (5) [35]: Mantor M., “AMD’s Radeon Hd 2900,” Hot Chips 19, Aug. 2007, http://www.hotchips.org/archives/hc19/2_Mon/HC19.03/HC19.03.01.pdf [36]: Houston M., “Anatomy if AMD’s TeraScale Graphics Engine,”, SIGGRAPH 2008, http://s08.idav.ucdavis.edu/houston-amd-terascale.pdf [37]: Mantor M., “Entering the Golden Age of Heterogeneous Computing,” PEEP 2008, http://ati.amd.com/technology/streamcomputing/IUCAA_Pune_PEEP_2008.pdf [38]: Kogo H., RV770 Overview, PC Watch, July 02 2008, http://pc.watch.impress.co.jp/docs/2008/0702/kaigai_09.pdf [39]: Kanter D., Inside Fermi: Nvidia's HPC Push, Real World Technologies Sept 30 2009, http://www.realworldtech.com/includes/templates/articles.cfm? ArticleID=RWT093009110932&mode=print [40]: Wasson S., Inside Fermi: Nvidia's 'Fermi' GPU architecture revealed, Tech Report, Sept 30 2009, http://techreport.com/articles.x/17670/1 [41]: Wasson S., AMD's Radeon HD 5870 graphics processor, Tech Report, Sept 23 2009, http://techreport.com/articles.x/17618/1 [42]: Bell B., ATI Radeon HD 5870 Performance Preview , Firing Squad, Sept 22 2009, http://www.firingsquad.com/hardware/ ati_radeon_hd_5870_performance_preview/default.asp 5. References (6) [43]: Nvidia CUDA C Programming Guide, Version 3.2, October 22 2010 http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/ CUDA_C_Programming_Guide.pdf [44]: Hwu W., Kirk D., Nvidia, Advanced Algorithmic Techniques for GPUs, Berkeley, January 24-25 2011 http://iccs.lbl.gov/assets/docs/2011-01-24/lecture1_computational_thinking_ Berkeley_2011.pdf [45]: Wasson S., Nvidia's GeForce GTX 580 graphics processor Tech Report, Nov 9 2010, http://techreport.com/articles.x/19934/1 [46]: Shrout R., Nvidia GeForce 8800 GTX Review – DX10 and Unified Architecture, PC Perspective, Nov 8 2006 http://swfan.com/reviews/graphics-cards/nvidia-geforce-8800-gtx-review-dx10and-unified-architecture/g80-architecture [47]: Wasson S., Nvidia's GeForce GTX 480 and 470 graphics processors Tech Report, March 31 2010, http://techreport.com/articles.x/18682 [48]: Gangar K., Tianhe-1A from China is world’s fastest Supercomputer Tech Ticker, Oct 28 2010, http://techtickerblog.com/2010/10/28/tianhe-1afrom-china-is-worlds-fastest-supercomputer/ [49]: Smalley T., ATI Radeon HD 5870 Architecture Analysis, Bit-tech, Sept 30 2009, http://www.bit-tech.net/hardware/graphics/2009/09/30/ati-radeon-hd-5870architecture-analysis/8 5. References (7) [50]: Nvidia Compute PTX: Parallel Thread Execution, ISA, Version 2.2, Oct 14 2010, http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/ ptx_isa_2.2.pdf [51]: Kanter D., Intel's Sandy Bridge Microarchitecture, Real World Technologies, Sept 25 2010 http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=4 [52]: Nvidia CUDATM FermiTM Compatibility Guide for CUDA Applications, Version 1.0, February 2010, http://developer.download.nvidia.com/compute/cuda/3_0/ docs/NVIDIA_FermiCompatibilityGuide.pdf [53]: Hallock R., Dissecting Fermi, NVIDIA’s next generation GPU, Icrontic, Sept 30 2009, http://tech.icrontic.com/articles/nvidia_fermi_dissected/ [54]: Kirsch N., NVIDIA GF100 Fermi Architecture and Performance Preview, Legit Reviews, Jan 20 2010, http://www.legitreviews.com/article/1193/2/ [55]: Hoenig M., NVIDIA GeForce GTX 460 SE 1GB Review, Hardware Canucks, Nov 21 2010, http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/38178nvidia-geforce-gtx-460-se-1gb-review-2.html [56]: Glaskowsky P. N., Nvidia’s Fermi: The First Complete GPU Computing Architecture Sept 2009, http://www.nvidia.com/content/PDF/fermi_white_papers/ P.Glaskowsky_NVIDIA's_Fermi-The_First_Complete_GPU_Architecture.pdf [57]: Kirk D. & Hwu W. W., ECE498AL Lectures 4: CUDA Threads – Part 2, 2007-2009, University of Illinois, Urbana-Champaign, http://courses.engr.illinois.edu/ece498/ al/lectures/lecture4%20cuda%20threads%20part2%20spring%202009.ppt 5. References (8) [58]: Nvidia’s Next Generation CUDATM Compute Architecture: FermiTM, Version 1.1, 2009 http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_ Architecture_Whitepaper.pdf [59]: Kirk D. & Hwu W. W., ECE498AL Lectures 8: Threading Hardware in G80, 2007-2009, University of Illinois, Urbana-Champaign, http://courses.engr.illinois.edu/ece498/ al/lectures/lecture8-threading-hardware-spring-2009.ppt [60]: Wong H., Papadopoulou M.M., Sadooghi-Alvandi M., Moshovos A., Demystifying GPU Microarchitecture through Microbenchmarking, University of Toronto, 2010, http://www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf [61]: Pettersson J., Wainwright I., Radar Signal Processing with Graphics Processors (GPUs), SAAB Technologies, Jan 27 2010, http://www.hpcsweden.se/files/RadarSignalProcessingwithGraphicsProcessors.pdf [62]: Smith R., NVIDIA’s GeForce GTX 460: The $200 King, AnandTech, July 11 2010, http://www.anandtech.com/show/3809/nvidias-geforce-gtx-460-the-200-king/2 [63]: Angelini C., GeForce GTX 580 And GF110: The Way Nvidia Meant It To Be Played, Tom’s Hardware, Nov 9 2010, http://www.tomshardware.com/reviews/geforcegtx-580-gf110-geforce-gtx-480,2781.html [64]: NVIDIA G80: Architecture and GPU Analysis, Beyond3D, Nov. 8 2006, http://www.beyond3d.com/content/reviews/1/11 [65]: D. Kirk and W. Hwu, Programming Massively Parallel Processors, 2008 Chapter 3: CUDA Threads, http://courses.engr.illinois.edu/ece498/al/textbook/ Chapter3-CudaThreadingModel.pdf 5. References (9) [66]: NVIDIA Forums: General CUDA GPU Computing Discussion, 2008 http://forums.nvidia.com/index.php?showtopic=73056 [67]: Wikipedia: Comparison of AMD graphics processing units, 2011 http://en.wikipedia.org/wiki/Comparison_of_AMD_graphics_processing_units [68]: Nvidia OpenCL Overview, 2009 http://gpgpu.org/wp/wp-content/uploads/2009/06/05-OpenCLIntroduction.pdf [69]: Chester E., Nvidia GeForce GTX 460 1GB Fermi Review, Trusted Reviews, July 13 2010, http://www.trustedreviews.com/graphics/review/2010/07/13/ Nvidia-GeForce-GTX-460-1GB-Fermi/p1 [70]: NVIDIA GF100 Architecture Details, Geeks3D, 2008-2010, http://www.geeks3d.com/20100118/nvidia-gf100-architecture-details/ [71]: Murad A., Nvidia Tesla C2050 and C2070 Cards, Science and Technology Zone, 17 nov. 2009, http://forum.xcitefun.net/nvidia-tesla-c2050-and-c2070-cards-t39578.html [72]: New NVIDIA Tesla GPUs Reduce Cost Of Supercomputing By A Factor Of 10, Nvidia, Nov. 16 2009 http://www.nvidia.com/object/io_1258360868914.html [73]: Nvidia Tesla, Wikipedia, http://en.wikipedia.org/wiki/Nvidia_Tesla [74]: Tesla M2050 and Tesla M2070/M2070Q Dual-Slot Computing Processor Modules, Board Specification, v. 03, Nvidia, Aug. 2010, http://www.nvidia.asia/docs/IO/43395/BD-05238-001_v03.pdf 5. References (10) [75]: Tesla 1U gPU Computing System, Product Soecification, v. 04, Nvidia, June 2009, http://www.nvidia.com/docs/IO/43395/SP-04975-001-v04.pdf [76]: Kanter D., The Case for ECC Memory in Nvidia’s Next GPU, Realworkd Technologies, 19 Aug. 2009, http://www.realworldtech.com/page.cfm?ArticleID=RWT081909212132 [77]: Hoenig M., Nvidia GeForce 580 Review, HardwareCanucks, Nov. 8, 2010, http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/ 37789-nvidia-geforce-gtx-580-review-5.html [78]: Angelini C., AMD Radeon HD 6990 4 GB Review, Tom’s Hardware, March 8, 2011, http://www.tomshardware.com/reviews/radeon-hd-6990-antilles-crossfire,2878.html [79]: Tom’s Hardware Gallery, http://www.tomshardware.com/gallery/two-cypress-gpus,0101-2303697179-0-0-0-jpg-.html [80]: Tom’s Hardware Gallery, http://www.tomshardware.com/gallery/Bare-Radeon-HD-5970,0101-2303497179-0-0-0-jpg-.html [81]: CUDA, Wikipedia, http://en.wikipedia.org/wiki/CUDA [82]: GeForce Graphics Processors, Nvidia, http://www.nvidia.com/object/geforce_family.html [83]: Next Gen CUDA GPU Architecture, Code-Named “Fermi”, Press Presentation at Nvidia’s 2009 GPU Technology Conference, (GTC), Sept. 30 2009, http://www.nvidia.com/object/gpu_tech_conf_press_room.html 5. References (10) [84]: Tom’s Hardware Gallery, http://www.tomshardware.com/gallery/SM,0101-110801-0-14-15-1-jpg-.html [85]: Butler, M., Bulldozer, a new approach to multithreaded compute performance, Hot Chips 22, Aug. 24 2010 http://www.hotchips.org/index.php?page=hot-chips-22 . [86]: Voicu A., NVIDIA Fermi GPU and Architecture Analysis, Beyond 3D, 23rd Oct 2010, http://www.beyond3d.com/content/reviews/55/1 [87]: Chu M. M., GPU Computing: Past, Present and Future with ATI Stream Technology, AMD, March 9 2010, http://developer.amd.com/gpu_assets/GPU%20Computing%20-%20Past%20 Present%20and%20Future%20with%20ATI%20Stream%20Technology.pdf [88]: Smith R., AMD's Radeon HD 6970 & Radeon HD 6950: Paving The Future For AMD, AnandTech, Dec. 15 2010, http://www.anandtech.com/show/4061/amds-radeon-hd-6970-radeon-hd-6950 [89] Christian, AMD renames ATI Stream SDK, updates its with APU, OpenCL 1.1 support, Jan. 27 2011, http://www.tcmagazine.com/tcm/news/software/34765/ amd-renames-ati-stream-sdk-updates-its-apu-opencl-11-support [90]: User Guide: AMD Stream Computing, Revision 1.3.0, Dec. 2008, http://www.ele.uri.edu/courses/ele408/StreamGPU.pdf [91]: Programming Guide: ATI Stream Computing Compute Abstraction Layer (CAL), Revision 2.01, AMD, March 2010, http://developer.amd.com/gpu_assets/ATI_Stream_ SDK_CAL_Programming_Guide_v2.0.pdf 5. References (11) [92]: Technical Overview: AMD Stream Computing, Revision 1.2.1, Oct. 2008, http://www.cct.lsu.edu/~scheinin/Parallel/StreamComputingOverview.pdf [93]: AMD Accelerated Parallel Processing OpenCL Programming Guide, Revision 1.2, AMD, Jan. 2011, http://developer.amd.com/gpu/amdappsdk/assets/ AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf [94]: An Introduction to OpenCL, AMD, http://www.amd.com/us/products/technologies/ stream-technology/opencl/pages/opencl-intro.aspx [95]: Behr D., Introduction to OpenCL PPAM 2009, Sept. 15 2009, http://gpgpu.org/wp/wp-content/uploads/2009/09/B1-OpenCL-Introduction.pdf [96]: Gohara D.W. PhD, OpenCL Episode 2 – OpenCL Fundamentals, Aug. 26 2009, MacResearch, http://www.macresearch.org/files/opencl/Episode_2.pdf [97]: Kanter D., AMD's Cayman GPU Architecture, Real World Technologies, Dec. 14 2010, http://www.realworldtech.com/page.cfm?ArticleID=RWT121410213827&p=3 [98]: Hoenig M., AMD Radeon HD 6970 and HD 6950 Review, Hardware Canucks, Dec. 14 2010, http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/ 38899-amd-radeon-hd-6970-hd-6950-review-3.html [99]: Reference Guide: AMD HD 6900 Series Instruction Set Architecture, Revision 1.0, Febr. 2011, http://developer.amd.com/gpu/AMDAPPSDK/assets/ AMD_HD_6900_Series_Instruction_Set_Architecture.pdf [100]: Howes L., AMD and OpenCL, AMD Application Engineering, Dec. 2010, http://www.many-core.group.cam.ac.uk/ukgpucc2/talks/Howes.pdf 5. References (12) [101]: ATI R700-Family Instruction Set Architecture Reference Guide, Revision 1.0a, AMD, Febr. 2011, http://developer.amd.com/gpu_assets/R700-Family_Instruction_ Set_Architecture.pdf [102]: Piazza T., Dr. Jiang H., Microarchitecture Codename Sandy Bridge: Processor Graphics, Presentation ARCS002, IDF San Francisco, Sept. 2010 [103]: Bhaniramka P., Introduction to Compute Abstraction Layer (CAL), http://coachk.cs.ucf.edu/courses/CDA6938/AMD_course/M5%20%20Introduction%20to%20CAL.pdf [104]: Villmow M., ATI Stream Computing, ATI Intermediate Language (IL), May 30 2008, http://developer.amd.com/gpu/amdappsdk/assets/ATI%20Stream %20Computing%20-%20ATI%20Intermediate%20Language.ppt#547,9 [105]: Reference Guide: AMD Accelerated Parallel Processing Technology, AMD Intermediate Language (IL), Revision 2.0e, March 2011, http://developer.amd.com/gpu/AMDAPPSDK/assets/AMD_Intermediate_Language _(IL)_Specification_v2.pdf [106]: Hensley J., Hardware and Compute Abstraction Layers for Accelerated Computing Using Graphics Hardware and Conventional CPUs, AMD, 2007, http://www.ll.mit.edu/HPEC/agendas/proc07/Day3/10_Hensley_Abstract.pdf [107]: Hensley J., Yang J., Compute Abstraction Layer, AMD, Febr. 1 2008, http://coachk.cs.ucf.edu/courses/CDA6938/s08/UCF-2008-02-01a.pdf [108]: AMD Accelerated Parallel Processing (APP) SDK, AMD Developer Central, http://developer.amd.com/gpu/amdappsdk/pages/default.aspx 5. References (13) [109]: OpenCL™ and the AMD APP SDK v2.4, AMD Developer Central, April 6 2011, http://developer.amd.com/documentation/articles/pages/OpenCL-and-the-AMD-APPSDK.aspx [110]: Stone J., An Introduction to OpenCL, U. of Illinois at Urbana-Champign, Dec. 2009, http://www.ks.uiuc.edu/Research/gpu/gpucomputing.net [111]: Introduction to OpenCL Programming, AMD, No. 137-41768-10, Rev. A, May 2010, http://developer.amd.com/zones/OpenCLZone/courses/Documents/Introduction_ to_OpenCL_Programming%20Training_Guide%20(201005).pdf [112]: Evergreen Family Instruction Set Architecture, Instructions and Microcode Reference Guide, AMD, Febr. 2011, http://developer.amd.com/gpu/amdappsdk/assets/ AMD_Evergreen-Family_Instruction_Set_Architecture.pdf [113]: Intel 810 Chipset: Intel 82810/82810-DC100 Graphics and Memory Controller Hub (GMCH) Datasheet, June 1999 ftp://download.intel.com/design/chipsets/datashts/29065602.pdf [114]: Huynh A.T., AMD Announces "Fusion" CPU/GPU Program, Daily Tech, Oct. 25 2006, http://www.dailytech.com/article.aspx?newsid=4696 [115]: Grim B., AMD Fusion Family of APUs, Dec. 7 2010, http://www.mytechnology.eu/wpcontent/uploads/2011/01/AMD-Fusion-Press-Tour_EMEA.pdf [116]: Newell D., AMD Financial Analyst Day, Nov. 9 2010, http://www.rumorpedia.net/wp-content/uploads/2010/11/rumorpedia02.jpg [117]: De Maesschalck T., AMD starts shipping Ontario and Zacate CPUs, DarkVision Hardware, Nov. 10 2010, http://www.dvhardware.net/article46449.html 5. References (14) [118]: AMD Accelerated Parallel Processing (APP) SDK (formerly ATI Stream) with OpenCLTM 1.1 Support????? [119]: Burgess B., „Bobcat” AMD’s New Low Power x86 Core Architecture, Aug. 24 2010, http://www.hotchips.org/uploads/archive22/HC22.24.730-Burgess-AMDBobcat-x86.pdf [120]: AMD Ontario APU pictures, Xtreme Systems, Sept. 3 2010, http://www.xtremesystems.org/forums/showthread.php?t=258499 [121]: Stokes J., AMD reveals Fusion CPU+GPU, to challenge Intel in laptops, Febr. 8 2010, http://arstechnica.com/business/news/2010/02/amd-revealsfusion-cpugpu-to-challege-intel-in-laptops.ars [122]: AMD Unveils Future of Computing at Annual Financial Analyst Day, CDRinfo, Nov. 10 2010, http://www.cdrinfo.com/sections/news/Details.aspx?NewsId=28748 [123]: Shimpi A. L., The Intel Core i3 530 Review - Great for Overclockers & Gamers, AnandTech, Jan. 22 2010, http://www.anandtech.com/show/2921 [124]: Hagedoorn H. Mohammad S., Barling I. R., Core i5 2500K and Core i7 2600K review, Jan. 3 2011, http://www.guru3d.com/article/core-i5-2500k-and-core-i7-2600k-review/2 [125]: Wikipedia: Intel GMA, 2011, http://en.wikipedia.org/wiki/Intel_GMA [126]: Shimpi A. L., The Sandy Bridge Review: Intel Core i7-2600K, i5-2500K and Core i3-2100 Tested, AnandTech, Jan. 3 2011, http://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i72600k-i5-2500k-core-i3-2100-tested/11 5. References (15) [127]: Marques T., AMD Ontario, Zacate Die Sizes - Take 2 , Sept. 14 2010, http://www.siliconmadness.com/2010/09/amd-ontario-zacate-die-sizestake-2.html [128]: De Vries H., AMD Bulldozer, 8 core processor, Nov. 24 2010, http://chip-architect.com/ [129]: Intel® 845G/845GL/845GV Chipset Datasheet: Intel® 82845G/82845GL/82845GV Graphics and Memory Controller Hub (GMCH), Mai 2002 http://www.intel.com/design/chipsets/datashts/290746.htm [130]: Huynh A. T., Final AMD "Stars" Models Unveiled, Daily Tech, May 4 2007, http://www.dailytech.com/Final+AMD+Stars+Models+Unveiled+/article7157.htm [131]: AMD Fusion, Wikipedia, http://en.wikipedia.org/wiki/AMD_Fusion [132]: Nita S., AMD Llano APU to Get Dual-GPU Technology Similar to Hybrid CrossFire, Softpedia, Jan. 21 2011, http://news.softpedia.com/news/AMD-Llano-APU-toGet-Dual-GPU-Technology-Similar-to-Hybrid-CrossFire-179740.shtml [133]: Jotwani R., Sundaram S., Kosonocky S., Schaefer A., Andrade V. F., Novak A., Naffziger S., An x86-64 Core in 32 nm SOI CMOS, IEEE Xplore, Jan. 2011, http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5624589 [134]: Karmehed A., The graphical performance of the AMD A series APUs, Nordic Hardware, March 16 2011, http://www.nordichardware.com/news/69-cpu-chipset/42650-the-graphicalperformance-of-the-amd-a-series-apus.html 5. References (16) [135]: Butler M., „Bulldozer” A new approach to multithreaded compute performance, Aug. 24 2010, http://www.hotchips.org/uploads/archive22/HC22.24.720-Butler -AMD-Bulldozer.pdf [136]: „Bulldozer” and „Bobcat” AMD’s Latest x86 Core Innovations, HotChips22, http://www.slideshare.net/AMDUnprocessed/amd-hot-chips-bulldozer-bobcat -presentation-5041615 [137]: Altavilla D., Intel Arrandale Core i5 and Core i3 Mobile Unveiled, Hot Hardware, Jan. 04 2010, http://hothardware.com/Reviews/Intel-Arrandale-Core-i5-and-Core-i3-Mobile-Unveiled/ [138]: Dodeja A., Intel Arrandale, High Performance for the Masses, Hot Hardware, Review of the IDF San Francisco, Sept. 2009, http://akshaydodeja.com/intel-arrandale-high-performance-for-the-mass [139]: Shimpi A., An Intel Arrandale: 32nm review for Notebooks, core to be assigned Core i5 540M Reviewe . Anand Tech, 1/4/2010 http://www.anandtech.com/show/2902 [140]: Chiappeta M., Intel Clarkdale Core i5 Desktop Processor Debuts, Hot Hardware, Jan. 03 2010, http://hothardware.com/Articles/Intel-Clarkdale-Core-i5-Desktop-Processor-Debuts/ [141]: Thomas S. L., Desktop Platform Design Overview for Intel Microarchitecture (Nehalem) Based Platform, Presentation ARCS001, IDF 2009 [142]: Kahn O., Valentine B., Microarchitecture Codename Sandy Bridge: New Processor Innovations, Presentation ARCS001, IDF San Francisco Sept. 2010 5. References (17) [143]: Valich T., Intel's "Anti AMD Fusion" Sandy Bridge CPU tapes out, July 5 2009, http://www.brightsideofnews.com/news/2009/7/5/intels-anti-amd-fusion-sandybridge-cpu-tapes-out.aspx