as of 1/2011

advertisement
GPGPUs/DPAs
Dezső Sima
April 2011
(v1.0, Last updated 04/15/2011)
© Dezső Sima 2011
1. Introduction (1)
Aim
Brief introduction and overview.
Contents
1.Introduction
2. Basics of the SIMT execution
3. Overview of GPGPUs
4. Overview of data parallel accelerators
5. References
1. Introduction
1. Introduction (2)
Representation of objects by triangles
Vertex
Edge
Surface
Vertices
• have three spatial coordinates
• supplementary information necessary to render the object, such as
•
•
•
•
color
texture
reflectance properties
etc.
1. Introduction (3)
Main types of shaders in GPUs
Shaders
Vertex shaders
Transform each vertex’s
3D-position in the virtual space
to the 2D coordinate,
at which it appears on the screen
Pixel shaders
(Fragment shaders)
Geometry shaders
Calculate the color
of the pixels
Can add or remove
vertices from a mesh
1. Introduction (4)
DirectX version
Pixel SM
Vertex SM
Supporting OS
8.0 (11/2000)
1.0, 1.1
1.0, 1.1
Windows 2000
8.1 (10/2001)
1.2, 1.3, 1.4
1.0, 1.1
Windows XP/
Windows Server 2003
9.0 (12/2002)
2.0
2.0
9.0a (3/2003)
2_A, 2_B
2.x
9.0c (8/2004)
3.0
3.0
Windows XP SP2
10.0 (11/2006)
4.0
4.0
Windows Vista
10.1 (2/2008)
4.1
4.1
Windows Vista SP1/
Windows Server 2008
11 (in development)
5.0
5.0
Table: Pixel/vertex shader models (SM) supported by subsequent versions of DirectX
and MS’s OSs [18], [21]
DirectX: Microsoft’s API set for MM/3D
1. Introduction (3)
Convergence of important features of the vertex and pixel shader models
Subsequent shader models introduce typically, a number of new/enhanced features.
Differences between the vertex and pixel shader models in subsequent shader models
concerning precision requirements, instruction sets and programming resources.
Shader model 2 [19]
• Different precision requirements
Vertex shader: FP32 (coordinates)
Pixel shader: FX24 (3 colors x 8)
• Different instructions
• Different resources (e.g. registers)
Shader model 3 [19]
• Unified precision requirements for both shaders (FP32)
with the option to specify partial precision (FP16 or FP24)
by adding a modifier to the shader code
• Different instructions
• Different resources (e.g. registers)
1. Introduction (3)
Shader model 4 (introduced with DirectX10) [20]
• Unified precision requirements for both shaders (FP32)
with the possibility to use new data formats.
• Unified instruction set
• Unified resources (e.g. temporary and constant registers)
Shader architectures of GPUs prior to SM4
GPUs prior to SM4 (DirectX 10):
have separate vertex and pixel units with different features.
Drawback of having separate units for vertex and pixel shading
• Inefficiency of the hardware implementation
• (Vertex shaders and pixel shaders often have complementary load patterns [21]).
1. Introduction (5)
Unified shader model (introduced in the SM 4.0 of DirectX 10.0)
Unified, programable shader architecture
The same (programmable) processor can be used to implement all shaders;
• the vertex shader
• the pixel shader and
• the geometry shader (new feature of the SMl 4)
1. Introduction (6)
Figure: Principle of the unified shader architecture [22]
1. Introduction (7)
Based on its FP32 computing capability and the large number of FP-units available
the unified shader is a prospective candidate for speeding up HPC!
GPUs with unified shader architectures also termed as
GPGPUs
(General Purpose GPUs)
or
cGPUs
(computational GPUs)
1. Introduction (8)
Peak FP32/FP64 performance of Nvidia’s GPUs vs Intel’ P4 and Core2 processors [43]
1. Introduction (9)
Evolution of the FP-32 performance of GPGPUs [44]
1. Introduction (9)
Evolution of the bandwidth of Nvidia’s GPU’s vs Intel’s P4 and Core2 processors [43]
1. Introduction (10)
Figure: Contrasting the utilization of the silicon area in CPUs and GPUs [11]
1. Introduction (9)
Background slides to Introduction
1. Introduction
Figure: Peak SP FP performance of Nvidia’s GPUs vs Intel’ P4 and Core2 processors [11]
1. Introduction
Figure: Bandwidth values of Nvidia’s GPU’s vs Intel’s P4 and Core2 processors [11]
2. Basics of the SIMT execution
2. Basics of the SIMT execution (1)
Main alternatives of data parallel execution
Data parallel execution
SIMD execution
SIMT execution
• One dimensional data parallel execution, • Two dimensional data parallel execution,
i.e. it performs the same operation
i.e. it performs the same operation
on all elements of given
on all elements of given
FX/FP input vectors
FX/FP input arrays (matrices)
• is massively multithreaded,
and provides
• data dependent flow control as well as
• barrier synchronization
Needs an FX/FP SIMD extension
of the ISA
E.g.
2. and 3. generation
superscalars
Needs an FX/FP SIMT extension
of the ISA and the API
GPGPUs,
data parallel accelerators
Figure: Main alternatives of data parallel execution
2. Basics of the SIMT execution (2)
Scalar, SIMD and SIMT execution
Scalar execution
SIMD execution
SIMT execution
Domain of execution:
single data elements
Domain of execution:
elements of vectors
Domain of execution:
elements of matrices
(at the programming level)
Figure: Domains of execution in case of scalar, SIMD and SIMT execution
Remark
SIMT execution is also termed as SPMD (Single_Program Multiple_Data) execution (Nvidia)
2. Basics of the SIMT execution (3)
Key components of the implementation of SIMT execution
• Data parallel execution
• Massive multithreading
• Data dependent flow control
• Barrier synchronization
2. Basics of the SIMT execution (4)
Data parallel execution
Performed by SIMT cores
SIMT cores execute the same instruction stream on a number of ALUs
(i.e. all ALUs of a SIMT core perform typically the same operation).
SIMT core
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
Figure: Basic layout of a SIMT core
SIMT cores are the basic building blocks of GPGPU or data parallel accelerators.
During SIMT execution 2-dimensional matrices will be mapped to blocks of SIMT cores.
2. Basics of the SIMT execution (5)
Remark 1
Different manufacturers designate SIMT cores differently, such as
• streaming multiprocessor (Nvidia),
• superscalar shader processor (AMD),
• wide SIMD processor, CPU core (Intel).
2. Basics of the SIMT execution (6)
Each ALU is allocated a working register set (RF)
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
RF
RF
RF
RF
RF
RF
RF
RF
Figure: Main functional blocks of a SIMT core
2. Basics of the SIMT execution (7)
SIMT ALUs perform typically, RRR operations, that is
ALUs take their operands from and write the calculated results to the register set
(RF) allocated to them.
RF
ALU
Figure: Principle of operation of the SIMD ALUs
2. Basics of the SIMT execution (8)
Remark 2
Actually, the register sets (RF) allocated to each ALU are given parts of a
large enough register file.
RF
RF
RF
RF
RF
RF
RF
RF
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
Figure: Allocation of distinct parts of a large register set as workspaces of the ALUs
2. Basics of the SIMT execution (9)
Basic operation of recent SIMT ALUs
• execute basically SP FP-MADD (simple precision i.e. 32-bit.
Multiply-Add) instructions of the form axb+c ,
RF
• are pipelined,
capable of starting a new operation every new clock cycle,
(more precisely, every shader clock cycle),
That is, without further enhancements
their peak performance is 2 SP FP operations/cycle
ALU
• need a few number of clock cycles, e.g. 2 or 4 shader cycles,
to present the results of the SP FMADD operations to the RF,
2. Basics of the SIMT execution (10)
Additional operations provided by SIMT ALUs
•
•
•
FX operations and FX/FP conversions,
DP FP operations,
trigonometric functions (usually supported by special functional units).
2. Basics of the SIMT execution (11)
Massive multithreading
Aim of massive multithreading
to speed up computations by increasing the utilization of available computing resources
in case of stalls (e.g. due to cache misses).
Principle
• Suspend stalled threads from execution and allocate ready to run threads for execution.
• When a large enough number of threads are available long stalls can be hidden.
2. Basics of the SIMT execution (12)
Multithreading is implemented by
creating and managing parallel executable threads for each data element of the
execution domain.
Same instructions
for all data elements
Figure: Parallel executable threads for each element of the execution domain
2. Basics of the SIMT execution (13)
Effective implementation of multithreading
if thread switches, called context switches, do not cause cycle penalties.
Achieved by
• providing separate contexts (register space) for each thread, and
• implementing a zero-cycle context switch mechanism.
2. Basics of the SIMT execution (14)
SIMT core
Fetch/Decode
CTX
CTX
CTX
CTX
CTX
CTX
CTX
CTX
CTX
CTX
CTX
CTX
CTX
CTX
CTX
CTX
CTX
CTX
CTX
CTX
CTX
CTX
CTX
CTX
Actual context
CTX
CTX
CTX
CTX
CTX
CTX
CTX
CTX
Context switch
CTX
CTX
CTX
CTX
CTX
CTX
CTX
CTX
CTX
CTX
CTX
CTX
CTX
CTX
CTX
CTX
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
Register file (RF)
Figure: Providing separate thread contexts for each thread allocated for execution in a SIMT ALU
2. Basics of the SIMT execution (15)
Data dependent flow control
Implemented by SIMT branch processing
In SIMT processing both paths of a branch are executed subsequently such that
for each path the prescribed operations are executed only on those data elements which
fulfill the data condition given for that path (e.g. xi > 0).
Example
2. Basics of the SIMT execution (16)
Figure: Execution of branches [24]
The given condition will be checked separately for each thread
2. Basics of the SIMT execution (17)
First all ALUs meeting the condition execute the prescibed three operations,
then all ALUs missing the condition execute the next two operatons
Figure: Execution of branches [24]
2. Basics of the SIMT execution (18)
Figure: Resuming instruction stream processing after executing a branch [24]
2. Basics of the SIMT execution (19)
Barrier synchronization
Lets wait all threads for completing all prior instructions before executing the next instruction.
Implemented e.g. in AMD’s Intermediate Language (IL) by the fence threads instruction [10].
Remark
In the R600 ISA this instruction is coded by setting the BARRIER field of the Control Flow
(CF) instruction format [7].
2. Basics of the SIMT execution (20)
Principle of SIMT execution assuming serial kernel processing
Host
Device
Each kernel invocation
lets execute all
thread blocks (Block(i,j))
belonging to the
related Grid
Remark
In the Figure CUDA terminology
is used.
Figure: Hierarchy of
threads [25]
2. Basics of the SIMT execution (21)
Remark
A parallel kernel processing is also possible assuming advanced GPGPU devices
(such as Nvidia’s Fermi or AMD’s HD 69xx GPGPUs) and appropriate software support.
3. Overview of GPGPUs
3. Overview of GPGPUs (1)
Basic implementation alternatives of the SIMT execution
GPGPUs
Data parallel accelerators
Dedicated units
Programmable GPUs
supporting data parallel execution
with appropriate
with appropriate
programming environments
programming environment
Have display outputs
E.g.
Nvidia’s 8800 and GTX lines
AMD’s HD 38xx, HD48xx lines
No display outputs
Have larger memories
than GPGPUs
Nvidia’s Tesla lines
AMD’s FireStream lines
Figure: Basic implementation alternatives of the SIMT execution
3. Overview of GPGPUs (2)
GPGPUs
AMD/ATI’s line
Nvidia’s line
90 nm
G80
80 nm
Shrink
65 nm
G92
Enhanced
arch.
G200
Shrink
40 nm
R600
Shrink
Enhanced
arch.
GF100
(Fermi)
55 nm
RV670
Enhanced
arch.
RV770
Shrink
Enhanced Enhanced
arch.
arch.
RV870
Figure: Overview of Nvidia’s and AMD/ATI’s GPGPU lines
Cayman
3. Overview of GPGPUs (3)
NVidia
10/07
11/06
G80
G92
GT200
90 nm/681 mtrs
65 nm/754 mtrs
65 nm/1400 mtrs
Cores
Cards
6/08
8800 GTS
96 ALUs
320-bit
8800 GTX
8800 GT
GTX260
GTX280
128 ALUs
384-bit
112 ALUs
256-bit
192 ALUs
448-bit
240 ALUs
512-bit
OpenCL
OpenCL
Standard
11/07
6/07
CUDA
Version 1.0
Version 1.1
6/08
11/08
Version 2.0
Version 2.1
AMD/ATI
Cores
Cards
11/05
5/07
11/07
5/08
R500
R600
R670
RV770
80 nm/681 mtrs
55 nm/666 mtrs
55 nm/956 mtrs
(Xbox)
HD 2900XT
HD 3850
HD 3870
HD 4850
HD 4870
48 ALUs
320 ALUs
512-bit
320 ALUs
256-bit
320 ALUs
256-bit
800 ALUs
256-bit
800 ALUs
256-bit
OpenCL
12/08
OpenCL
11/07
Brooks+
Standard
9/08 12/08
Brook+
Brook+ 1.2
Brook+ 1.3
(SDK v.1.0)
(SDK v.1.2)
(SDK v.1.3)
6/08
RapidMind
3870
support
2005
2006
2007
2008
Figure: Overview of GPGPUs and their basic software support (1)
3. Overview of GPGPUs (4)
NVidia
Cores
3/10
07/10
11/10
GF100 (Fermi)
GF104 (Fermi)
GF110 (Fermi)
40 nm/3000 mtrs
40 nm/1950 mtrs
40 nm/3000 mtrs
1/11
Cards
GTX 470
448 ALUs
320-bit
OpenCL
GTX 460
10/09
6/10
OpenCL 1.0
OpenCL 1.0
OpenCL 1.1
SDK 1.0 Early release
SDK 1.0
SDK 1.1
Version 22
3/10
Version 3.0
Version 2.3
GTX 580
336 ALUs
192/256-bit
6/09
5/09 6/09
CUDA
GTX 480
480 ALUs
384-bit
512 ALUs
384-bit
6/10
1/11
Version 3.1
Version 3.2
GTX 560 Ti
480 ALUs
384-bit
3/11
Version 4.0
Beta
AMD/ATI
9/09
Cores
10/10
RV870 (Cypress)
Cayman Pro/XT
40 nm/2100 mtrs
40 nm/1700 mtrs
40 nm/2640 mtrs
Cards
HD 5850/70
HD 6850/70
1440/1600 ALUs
256-bit
OpenCL
03/10
OpenCL 1.0
OpenCL 1.0
HD 6950/70
1408/1536 ALUs
256-bit
08/10
OpenCL 1.1
(SDK V.2.2)
(SDK V.2.01)
Brook+ 1.4
(SDK V.1.4 Beta)
RapidMind
960/1120 ALUs
256-bit
11/09
(SDK V.2.0)
3/09
Brooks+
12/10
Barts Pro/XT
8/09
Intel bought RapidMind
2009
2010
Figure: Overview of GPGPUs and their basic software support (2)
2011
3. Overview of GPGPUs (5)
Remarks on AMD-based graphics cards [45], [66]
Beginning with their Cypress-based HD 5xxx line and SDK v.2.0 AMD left Brook+
and started supporting OpenCL as the basis of their HLL programming language.
AMD/ATI
10/10
9/09
Cores
RV870 (Cypress)
Cayman Pro/XT
40 nm/2100 mtrs
40 nm/1700 mtrs
40 nm/2640 mtrs
Cards
HD 5850/70
HD 6850/70
1440/1600 ALUs
256-bit
OpenCL
03/10
OpenCL 1.0
OpenCL 1.0
HD 6950/70
1408/1536 ALUs
256-bit
08/10
OpenCL 1.1
(SDK V.2.2)
(SDK V.2.01)
Brook+ 1.4
(SDK V.2.01)
RapidMind
960/1120 ALUs
256-bit
11/09
(SDK V.2.0)
3/09
Brooks+
12/10
Barts Pro/XT
8/09
Intel bought RapidMind
2009
2010
2011
As a consequence AMD changed also
• both the microarchitecture of their GPGPUs (by introducing Local and Global Data Share
memories) and
• their terminology by introducing Pre-OpenCL and OpenCL terminology, as discussed
in Section 5.2.
3. Overview of GPGPUs (6)
Remarks on Fermi-based graphics cards [45], [66]
FP64 speed
• ½ of the FP32 speed for the Tesla 20-series
• 1/8 of the SP32 speed for the GeForce GTX 470/480/570/580 cards
1/12 for other GForce GTX4xx cards
ECC
available only on the Tesla 20-series
Number of DMA engines
Tesla 20-series has 2 DMA Engines (copy engines). GeForce cards have 1 DMA Engine.
This means that CUDA applications can overlap computation and communication on Tesla
using bi-directional communication over PCI-e.
Memory size
Tesla 20 products have larger on board memory (3GB and 6GB)
3. Overview of GPGPUs (7)
Positioning Nvidia’s discussed GPGPU cards in their entire product portfolio [82]
3. Overview of GPGPUs (8)
Nvidia’s compute capability concept
Nvidia manages the continuous evolution by
a) defining sets of capabilities and features designated as compute capability versions,
b) specifying which compute capability version is supported by their
• programming environments, represented by their SDKs, and
• GPGPU lines,
c) and specifying compatibility rules. among them.
3. Overview of GPGPUs (9)
a) Defined sets of
compute capability versions
by Nvidia-1 [81]
3. Overview of GPGPUs (10)
a) Defined sets of compute capability versions by Nvidia-2 [81]
3. Overview of GPGPUs (11)
b1) Compute capability versions of the PTX ISAs generated by different releases of
CUDA SDKs [50]
Fermi
3. Overview of GPGPUs (12)
b2) Support of the compute capability versions by Nvidia’s GPGPU cards [81]
Capability
GPGPU cores
GPGPU devices
1.0
G80
GeForce 8800GTX/Ultra/GTS, Tesla C/D/S870, FX4/5600,
360M
1.1
G86, G84, G98, G96, G96b, G94,
G94b, G92, G92b
GeForce 8400GS/GT, 8600GT/GTS, 8800GT/GTS,
9600GT/GSO, 9800GT/GTX/GX2, GTS 250, GT 120/30,
FX 4/570, 3/580, 17/18/3700, 4700x2, 1xxM, 32/370M,
3/5/770M, 16/17/27/28/36/37/3800M, NVS420/50
1.2
GT218, GT216, GT215
GeForce 210, GT 220/40, FX380 LP, 1800M, 370/380M, NVS
2/3100M
1.3
GT200, GT200b
GTX 260/75/80/85, 295, Tesla C/M1060, S1070, CX, FX
3/4/5800
2.0
GF100, GF110
GTX 465, 470/80, Tesla C2050/70, S/M2050/70, Quadro
600,4/5/6000, Plex7000, GTX570, GTX580
2.1
GF108, GF106, GF104, GF114
GT 420/30/40, GTS 450, GTX 460, 500M
3. Overview of GPGPUs (13)
c) Compatibility rules related to compute capability versions [50]
The basic rule is forward compatibility within the main versions (versions 1.x and 2.x),
but not across main versions.
This is interpreted as follows
Object files (called CUBIN files) compiled to a particular compute capability, are supported
on all devices having the same or higher version number within the same main version.
E.g. object files compiled to the compute capability 1.0 are supported on all 1.x devices
but not supported on compute capability 2.0 (Fermi) devices.
For more details see [52].
3. Overview of GPGPUs (14)
8800 GTS
8800 GTX
8800 GT
GTX 260
GTX 280
Core
G80
G80
G92
GT200
GT200
Introduction
11/06
11/06
10/07
6/08
6/08
IC technology
90 nm
90 nm
65 nm
65 nm
65 nm
Nr. of transistors
681 mtrs
681 mtrs
754 mtrs
1400 mtrs
1400 mtrs
Die are
480 mm2
480 mm2
324 mm2
576 mm2
576 mm2
Core frequency
500 MHz
575 MHz
600 MHz
576 MHz
602 MHz
No of SMs (cores)
12
16
14
24
30
No.of FP32 EUss
96
128
112
192
240
Shader frequency
1.2 GHz
1.35 GHz
1.512 GHz
1.242 GHz
1.296 GHz
3
3
Computation
21
No. FP32 operations./cycle
Peak FP32 performance
230.4 GFLOPS
345.61 GFLOPS
508 GFLOPS
715 GFLOPS
933 GFLOPS
Peak FP64 performance
–
–
–
59.62 GFLOPS
77.76 GFLOPS
1600 Mb/s
1800 Mb/s
1800 Mb/s
1998 Mb/s
2214 Mb/s
Mem. interface
320-bit
384-bit
256-bit
448-bit
512-bit
Mem. bandwidth
64 GB/s
86.4 GB/s
57.6 GB/s
111.9 GB/s
141.7 GB/s
Mem. size
320 MB
768 MB
512 MB
896 MB
1.0 GB
Mem. type
GDDR3
GDDR3
GDDR3
GDDR3
GDDR3
Mem. channel
6*64-bit
6*64-bit
4*64-bit
8*64-bit
8*64-bit
SLI
SLI
SLI
SLI
SLI
PCIe x16
PCIe x16
PCIe 2.0x16
PCIe 2.0x16
PCIe 2.0x16
10
10
10
10.1 subset
10.1 subset
146 W
155 W
105 W
182 W
236 W
Memory
Mem. transfer rate (eff)
System
Multi. CPU techn.
Interface
MS Direct X
TDP
1:
Nvidia takes the FP32 capable Texture Processing Units also into consideration and calculates with 3 FP32 operations/cycle
Table: Main features of Nvidia’s GPGPUs-1
3. Overview of GPGPUs (15)
Remarks
In publications there are conflicting statements about whether or not the GT80 makes use
of dual issue (including a MAD and a Mul operation) within a period of four shader cycles or not.
Official specifications [22] declare the capability of dual issue, but other literature sources [64]
and even a textbook, co-authored by one of the chief developers of the GT80 (D. Kirk [65])
deny it.
A clarification could be found in a blog [66], revealing that the higher figure given in Nvidia’s
specifications includes calculations made both by the ALUs in the SMs and by the texture
processing units TPU).
Nevertheless, the TPUs can not be directly accessed by CUDA except for graphical tasks,
such as texture filtering.
Accordingly, in our discussion focusing on numerical calculations it is fair to take only
the MAD operations into account for specifying the peak numerical performance.
3. Overview of GPGPUs (16)
Structure of an SM of the G80 architecture
Texture processing Units
consisting of
• TA: Texture Address units
• TF: Texture Filter Units
They are FP32 or FP16 capable [46]
3. Overview of GPGPUs (17)
GTX 470
GTX 480
GTX 460
GTX 570
GTX 580
GF100
GF100
GF104
GF110
GF110
3/10
3/10
7/10
12/10
11/10
40 nm
40 nm
40 nm
40 nm
40 nm
Nr. of transistors
3200 mtrs
3200 mtrs
1950 mtrs
3000 mtrs
3000 mtrs
Die are
529 mm2
529 mm2
367 mm2
520 mm2
520 mm2
732 MHz
772 MHz
Core
Introduction
IC technology
Core frequency
Computation
No of SMs (cores)
14
15
7
15
16
No. of FP32 EUs
448
480
336
480
512
Shader frequency
1215 MHz
1401 MHz
1350 MHz
1464 MHz
1544 MHz
2
2
3
2
2
Peak FP32 performance
1088 GFLOPS
1345 GFLOPS
9072 GFLOPS
1405 GFLOPS
1581 GFLOPS
Peak FP64 performance
136 GFLOPS
168 GFLOPS
75.6 GFLOPS
175.6 GFLOPS
197.6 GFLOPS
3348 Mb/s
3698 Mb/s
3600 Mb/s
3800 Mb/s
4008 Mb/s
320-bit
384-bit
192/256-bit
320-bit
384-bit
133.9 GB/s
177.4 GB/s
86.4/115.2 GB/s
152 GB/s
192.4 GB/s
Mem. size
1.28 GB
1.536 GB
0.768/1.024 GB/s
1.28 GB
1.536/3.072 GB
Mem. type
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
Mem. channel
5*64-bit
6*64-bit
3/4 *64-bit
5*64-bit
6*64-bit
SLI
SLI
SLI
SLI
SLI
PCIe 2.0*16
PCIe 2.0*16
PCIe 2.0*16
PCIe 2.0*16
PCIe 2.0*16
11
11
11
11
11
215 W
250 W
150/160 W
219 W
244 W
No. FP32 operations/cycle
Memory
Mem. transfer rate (eff)
Mem. interface
Mem. bandwidth
System
Multi. CPU techn.
Interface
MS Direct X
TDP
Table: Main features of Nvidia’s GPGPUs-2
3. Overview of GPGPUs (18)
Remarks
1) The GDDR3 memory has a double clocked data transfer
Effective memory transfer rate = 2 x memory frequency
The GDDR5 memory has a quad clocked data transfer
Effective memory transfer rate = 4 x memory frequency
2) Both the GDDR3 and GDDR5 memories are 32-bit devices.
Nevertheless, memory controllers of GPGPUs may be designed either to control a single
32-bit memory channel or dual memory channels, providing a 64-bit channel width.
3. Overview of GPGPUs (19)
Examples for Nvidia cards
Nvidia GeForce GTX 480 (GF 100 based) [47]
3. Overview of GPGPUs (20)
Nvidia GeForce GTX 480 and 580 cards [77]
GTX 480
(GF 100 based)
GTX 580
(GF 110 based)
3. Overview of GPGPUs (21)
A pair of GeForce GTX 480 cards [47]
(GF100 based)
3. Overview of GPGPUs (22)
HD 2900XT
HD 3850
HD 3870
HD 4850
HD 4870
Core
R600
R670
R670
RV770 (R700-based)
RV770 (R700 based)
Introduction
5/07
11/07
11/07
5/08
5/08
80 nm
55 nm
55 nm
55 nm
55 nm
Nr. of transistors
700 mtrs
666 mtrs
666 mtrs
956 mtrs
956 mtrs
Die are
408 mm2
192 mm2
192 mm2
260 mm2
260 mm2
Core frequency
740 MHz
670 MHz
775 MHz
625 MHz
750 MHz
320
320
320
800
800
740 MHz
670 MHz
775 MHz
625 MHz
750 MHz
2
2
2
2
2
Peak FP32 performance
471.6 GFLOPS
429 GFLOPS
496 GFLOPS
1000 GFLOPS
1200 GFLOPS
Peak FP64 performance
–
–
–
200 GFLOPS
240 GFLOPS
1600 Mb/s
1660 Mb/s
2250 Mb/s
2000 Mb/s
3600 Mb/s (GDDR5)
512-bit
256-bit
256-bit
265-bit
265-bit
105.6 GB/s
53.1 GB/s
720 GB/s
64 GB/s
118 GB/s
Mem. size
512 MB
256 MB
512 MB
512 MB
512 MB
Mem. type
GDDR3
GDDR3
GDDR4
GDDR3
GDDR3/GDDR5
Mem. channel
8*64-bit
8*32-bit
8*32-bit
4*64-bit
4*64-bit
Mem. contr.
Ring bus
Ring bus
Ring bus
Crossbar
Crossbar
CrossFire X
CrossFire X
CrossFire X
CrossFire X
CrossFire X
PCIe x16
PCIe 2.0x16
PCIe 2.0x16
PCIe 2.0x16
PCIe 2.0x16
10
10.1
10.1
10.1
10.1
150 W
75 W
105 W
110 W
150 W
IC technology
Computation
No. of ALUs
Shader frequency
No. FP32 operations./cycle
Memory
Mem. transfer rate (eff)
Mem. interface
Mem. bandwidth
System
Multi. CPU techn.
Interface
MS Direct X
TDP Max./Idle
Table: Main features of AMD/ATIs GPGPUs-1
3. Overview of GPGPUs (23)
Evergreen series
HD 5850
HD 5870
HD 5970
Cypress PRO (RV870-based)
Cypress XT (RV870-based)
Hemlock XT (RV870-based)
9/09
9/09
11/09
40 nm
40 nm
40 nm
Nr. of transistors
2154 mtrs
2154 mtrs
2*2154 mtrs
Die are
334 mm2
334 mm2
2*334 mm2
Core frequency
725 MHz
850 MHz
725 MHz
No. of SIMD cores / VLIW5 ALUs
18/16
20/16
2*20/16
No. of EUs
1440
1600
2*1600
725 MHz
850 MHz
725 MHz
2
2
2
Peak FP32 performance
2088 GFLOPS
2720 GFLOPS
4640 GFLOPS
Peak FP64 performance
417.6 GFLOPS
544 GFLOPS
928 GFLOPS
4000 Mb/s
4800 Mb/s
4000 Mb/s
256-bit
256-bit
2*256-bit
128 GB/s
153.6 GB/s
2*128 GB/s
Mem. size
1.0 GB
1.0/2.0 GB
2*(1.0/2.0) GB
Mem. type
GDDR5
GDDR5
GDDR5
Mem. channel
8*32-bit
8*32-bit
2*8*32-bit
Multi. CPU techn.
CrossFire X
CrossFire X
CrossFire X
Interface
PCIe 2.1*16
PCIe 2.1*16
PCIe 2.1*16
11
11
11
151/27 W
188/27 W
294/51 W
Core
Introduction
IC technology
Computation
Shader frequency
No. FP32 inst./cycle
Memory
Mem. transfer rate (eff)
Mem. interface
Mem. bandwidth
System
MS Direct X
TDP Max./Idle
Table: Main features of AMD/ATI’s GPGPUs-2
3. Overview of GPGPUs (24)
Northerm Islands series
HD 6850
HD 6870
Barts Pro
Barts XT
Introduction
10/10
10/10
IC technology
40 nm
40 nm
Nr. of transistors
1700 mtrs
1700 mtrs
Die are
255 mm2
255 mm2
Core frequency
775 MHz
900 MHz
12/16
14/16
960
1120
775 MHz
900 MHz
2
2
Peak FP32 performance
1488 GFLOPS
2016 GFLOPS
Peak FP64 performance
-
-
4000 Mb/s
4200 Mb/s
256-bit
256-bit
128 GB/s
134.4 GB/s
Mem. size
1 GB
1 GB
Mem. type
GDDR5
GDDR5
Mem. channel
8*32-bit
8*32-bit
Multi. CPU techn.
CrossFire X
CrossFire X
Interface
PCIe 2.1*16
PCIe 2.1*16
11
11
127/19 W
151/19 W
Core
Computation
No. of SIMD cores /VLIW5 ALUs
No. of EUs
Shader frequency
No. FP32 inst./cycle
Memory
Mem. transfer rate (eff)
Mem. interface
Mem. bandwidth
System
MS Direct X
TDP Max./Idle
Table: Main features of AMD/ATI’s GPGPUs-3
3. Overview of GPGPUs (25)
Northerm Islands series
HD 6950
HD 6970
HD 6990
HD 6990 unlocked
Core
Cayman Pro
Cayman XT
Antilles
Antilles
Introduction
12/10
12/10
3/11
3/11
IC technology
40 nm
40 nm
40 nm
40 nm
2.64 billion
2.64 billion
2*2.64 billion
2*2.64 billion
Die are
389 mm2
389 mm2
2*389 mm2
2*389 mm2
Core frequency
800 MHz
880 MHz
830 MHz
880 MHz
No. of SIMD cores /VLIW4 ALUs
22/16
24/16
2*24/16
2*24/16
No. of EUs
1408
1536
2*1536
2*1536
800 MHz
880 MHz
830 MHz
880 MHz
4
4
4
4
Peak FP32 performance
2.25 TFLOPS
2.7 TFLOPS
5.1 TFLOPS
5.4 TFLOPS
Peak FP64 performance
0.5625 TFLOPS
0.683 TFLOPS
1.275 TFLOPS
1.35 TFLOPS
5000 Mb/s
5500 Mb/s
5000 Mb/s
5000 Mb/s
256-bit
256-bit
256-bit
256-bit
160 GB/s
176 GB/s
2*160 GB/s
2*160 GB/s
Mem. size
2 GB
2 GB
2*2 GB
2*2 GB
Mem. type
GDDR5
GDDR5
GDDR5
GDDR5
Mem. channel
8*32-bit
5*32-bit
2*8*32-bit
2*8*32-bit
-
-
-
-
CrossFireX
CrossFireX
CrossFireX
CrossFireX
PCIe 2.1*16-bit
PCIe 2.1*16-bit
PCIe 2.1*16-bit
PCIe 2.1*16-bit
11
11
11
11
200/20 W
250/20 W
350/37 W
415/37 W
Nr. of transistors
Computation
Shader frequency
No. FP32 inst./cycle / ALU
Memory
Mem. transfer rate (eff)
Mem. interface
Mem. bandwidth
System
ECC
Multi. CPU techn.
Interface
MS Direct X
TDP Max./Idle
Table: Main features of AMD/ATIs GPGPUs-4
3. Overview of GPGPUs (26)
Remark
The Radeon HD 5xxx line of cards is designated also as the Evergreen series and
the Radeon HD 6xxx line of cards is designated also as the Northern islands series.
3. Overview of GPGPUs (27)
Examples for AMD cards
HD 5870 (RV870 based) [41]
3. Overview of GPGPUs (28)
HD 5970 (actually RV870 based) [80]
ATI HD 5970: 2 x ATI HD 5870 with slightly reduced memory clock
3. Overview of GPGPUs (29)
HD 5970 (actually RV870 based) [79]
ATI HD 5970: 2 x ATI HD 5870 with slightly reduced memory clock
3. Overview of GPGPUs (30)
AMD HD 6990 (actually Cayman based) [78]
AMD HD 6990: 2 x ATI HD 6970 with slightly reduced memory and shader clock
3. Overview of GPGPUs (31)
Price relations (as of 01/2011)
Nvidia
GTX 570
GTX 580
~ 350 $
~ 500 $
AMD
HD 6970
HD 6990
(Dual 6970)
~ 400 $
~ 700 $
4. Overview of data parallel accelerators
4. Overview of data parallel accelerators (1)
Data parallel accelerators
Implementation alternatives of data parallel accelerators
On card
implementation
On-die
integration
Recent
implementations
E.g.
Emerging
implementations
GPU cards
Intel’s Heavendahl
Data-parallel
accelerator cards
AMD’s Torrenza
integration technology
Intel’s Sandy Bridge (2011)
AMD’s Fusion (2008)
integration technology
2010/2011
Trend
Figure: Implementation alternatives of dedicated data parallel accelerators
4. Overview of data parallel accelerators (2)
On-card accelerators
Card
implementations
Single cards fitting
into a free PCI Ex16 slot
of the host computer.
E.g.
Nvidia Tesla C870
Nvidia Tesla C1060
Nvidia Tesla C2070
AMD FireStream 9170
AMD FireStream 9250
AMD FireStream 9370
Desktop
implementations
1U server
implementations
Usually 4 cards
Usually dual cards
mounted into a 1U server rack,
mounted into a box,
connected two adapter cards
connected to an
that are inserted into
adapter card
that is inserted into a two free PCIEx16 slots of a server
through two switches
free PCI-E x16 slot of the
and two cables.
host PC through a cable.
Nvidia Tesla D870
Nvidia Tesla S870
Nvidia Tesla S1070
Nvidia Tesla S2050/S2070
Figure: Implementation alternatives of on-card accelerators
4. Overview of data parallel accelerators (3)
NVidia Tesla-1
G80-based
GT200-based
6/08
6/07
Card
C1060
C870
345.6
4 GB GDDR3
SP: 933
GFLOPS
DP: 77.76 GFLOPS
1.5 GB GDDR3
SP: 345.6 GFLOPS
DP:
6/07
Desktop
D870
2*C870 incl.
3 GB GDDR3
SP: 691.2 GFLOPS
DP:
-
IU Server
6/07
6/08
S870
S1070
4*C870 incl.
6 GB GDDR3
SP: 1382 GFLOPS
DP:
-
CUDA
4*C1060
16 GB GDDR3
SP: 3732 GFLOPS
DP: 311 GFLOPS
6/07
11/07
6/08
Version 1.0
Version 1.01
Version 2.0
2007
2008
Figure: Overview of Nvidia’s G80/G200-based Tesla family-1
4. Overview of data parallel accelerators (4)
FB: Frame Buffer
Figure: Main functional units of Nvidia’s Tesla C870 card [2]
4. Overview of data parallel accelerators (5)
Figure: Nvida’s Tesla C870 and
AMD’s FireStream 9170 cards [2], [3]
4. Overview of data parallel accelerators (6)
Figure: Tesla D870 desktop implementation [4]
4. Overview of data parallel accelerators (7)
Figure: Nvidia’s Tesla D870 desktop implementation [4]
4. Overview of data parallel accelerators (8)
Figure: PCI-E x16 host adapter card of Nvidia’s Tesla D870 desktop [4]
4. Overview of data parallel accelerators (9)
Figure: Concept of Nvidia’s Tesla S870 1U rack server [5]
4. Overview of data parallel accelerators (10)
Figure: Internal layout of Nvidia’s Tesla S870 1U rack [6]
4. Overview of data parallel accelerators (11)
Figure: Connection cable between Nvidia’s Tesla S870 1U rack and the adapter cards
inserted into PCI-E x16 slots of the host server [6]
4. Overview of data parallel accelerators (12)
NVidia Tesla-2
GF100 (Fermi)-based
11/09
Card
C2050/C2070
3/6 GB GDDR5
SP: 1.03 TLOPS1
DP: 0.515 TFLOPS
08/10
04/10
Module
M2050/M2070
M2070Q
3/6 GB GDDR5
SP: 1.03 TFLOPS1
DP: 0.515 TFLOPS
6 GB GDDR5
SP: 1.03 TFLOPS1
DP: 0.515 TFLOPS
11/09
IU Server
S2050/S2070
4*C2050/C2070
12/24 GB GDDR31
SP: 4.1 TFLOPS
DP: 8.2 TFLOPS
5/09
CUDA
CUDA
Version 2.2
6/09
Version 2.3
3/10
Version 3.0
6/10
Version 3.1
1/11
Version 3.2
6/10
OpenCL+
OpenCL 1.1
2009
2010
1:
2011
Without SF (Special Function) operations
Figure: Overview of Nvidia’s GF100 (Fermi)-based Tesla family
4. Overview of data parallel accelerators (13)
Fermi based Tesla devices
Tesla C2050/C2070 Card [71]
(11/2009)
Single GPU Card
3/6 GB GDDR5
515 GFLOPS DP
ECC
Tesla S2050/S2070 1U [72]
(11/2009)
Four GPUs
12/16 GB GDDR5s
2060 GFLOPS DP
ECC
4. Overview of data parallel accelerators (14)
Tesla M2050/M2070/M2070Q Processor Module
(Dual slot board with PCIe Gen. 2 x16 interface)
(04/2010)
Figure: Tesla M2050/M2070/M2070Q Processor Module [74]
Used in the Tianhe-1A Chinese supercomputer (10/2010)
Remark
The M2070Q is an upgrade of the M2070 providing higher memory clock (introduced 08/2010)
4. Overview of data parallel accelerators (15)
Tianhe-1A (10/2010) [48]
• Upgraded version of the Tianhe-1 (China)
• 2.6 PetaFLOPS (fastest supercomputer in the World in 2010)
• 14 336 Intel Xeon 5670
• 7 168 Nvidia Tesla M2050
4. Overview of data parallel accelerators (16)
Specification data of the Tesla M2050/M2070/M2070Q modules [74]
(448 ALUs)
(448 ALUs)
Remark
The M2070Q is an upgrade of the M2070, providing higher memory clock (introduced 08/2010)
4. Overview of data parallel accelerators (17)
Support of ECC
• Fermi based Tesla devices introduced the support of ECC.
• By contrast recently neither Nvidia’s straightforward GPGPU cards nor AMD’s GPGPU or
DPA devices support ECC [76].
4. Overview of data parallel accelerators (18)
Tesla S2050/S2070 1U
The S2050/S2070 differ only in the memory size, the S2050 includes 12 GB, the S2070 24 GB.
GPU Specification
 Number of processor cores: 448
 Processor core clock: 1.15 GHz
 Memory clock: 1.546 GHz
 Memory interface: 384 bit
System Specification
 Four Fermi GPUs
 12.0/24.0 GB of GDDR5,

configured as 3.0/6.0 GB per GPU.
 When ECC is turned on,
Figure: Block diagram and technical specifications
of Tesla S2050/S2070 [75]
 available memory is ~10.5 GB
 Typical power consumption: 900 W
4. Overview of data parallel accelerators (19)
AMD FireStream-1
RV670-based
6/08
11/07
Card
RV770-based
9170
9170
2 GB GDDR3
FP32: 500 GLOPS
FP64:~200 GLOPS
Shipped
6/08
Stream Computing
SDK
10/08
9250
9250
1 GB GDDR3
FP32: 1000 GLOPS
FP64: ~300 GFLOPS
Shipped
12/07
09/08
Version 1.0
Version 1.2
Brook+
ACM/AMD Core Math Library
CAL (Computer Abstor Layer)
Brook+
ACM/AMD Core Math Library
CAL (Computer Abstor Layer)
Rapid Mind
2007
2008
Figure: Overview of AMD/ATI’s FireStream family-1
4. Overview of data parallel accelerators (20)
AMD FireStream-2
In 01/11 Version 2.3
renamed to APP
RV870-based
10/10
06/10
Card
9350/9370
9350/9370
2/4 GB GDDR5
FP32: 2016 GLOPS
FP64: 403/528 GLOPS
Stream Computing
SDK
03/10
03/09
Version 2.01
Version 1.4
OpenCL 1.0
Brooks+
2009
05/10
08/10
Version 2.1
OpenCL 1.0
Shipped
Version 2.2
12/10
Version 23
OpenCL 1.1 OpenCL 1.1
2010
2011
APP: Accelerated Parallel Processing
Figure: Overview of AMD/ATI’s FireStream family-2
4. Overview of data parallel accelerators (21)
Nvidia Tesla cards
Core type
C870
C1060
C2050
C2070
Based on
G80
GT200
T20 (GF100-based)
Introduction
6/07
6/08
11/09
Core frequency
600 MHz
602 MHz
575 MHz
ALU frequency
1350 MHz
1296 GHz
1150 MHz
No. of SMs (cores)
16
30
14
No. of ALUs
128
240
448
Peak FP32 performance
345.6 GFLOPS
933 GFLOPS
1030.4 GFLOPS
Peak FP64 performance
-
77.76 GFLOPS
515.2 GFLOPS
1600 Gb/s
1600 Gb/s
3000 Gb/s
384-bit
512-bit
384-bit
768 GB/s
102 GB/s
144 GB/s
Mem. size
1.5 GB
4 GB
Mem. type
GDDR3
GDDR3
GDDR5
-
-
ECC
PCIe *16
PCIe 2.0*16
PCIe 2.0*16
171 W
200 W
Core
Memory
Mem. transfer rate (eff)
Mem. interface
Mem. bandwidth
3 GB
6 GB
System
ECC
Interface
Power (max)
238 W
247 W
Table: Main features of Nvidia’s data parallel accelerator cards (Tesla line) [73]
4. Overview of data parallel accelerators (22)
AMD FireStream cards
Core type
9170
9250
9350
9370
Based on
RV670
RV770
RV870
RV870
11/07
6/08
10/10
10/10
Core frequency
800 MHz
625 MHz
700 MHz
825 MHz
ALU frequency
800 MHz
325 MHz
700 MHz
825 MHz
320
800
1440
1600
Peak FP32 performance
512 GFLOPS
1 TFLOPS
2016 GFLOPS
2640 GFLOPS
Peak FP64 performance
~200 GFLOPS
~250 GFLOPS
403.2 GFLOPS
528 GFLOPS
1600 Gb/s
1986 Gb/s
4000 Gb/s
4600 Gb/s
256-bit
256-bit
256-bit
256-bit
51.2 GB/s
63.5 GB/s
128 GB/s
147.2 GB/s
Mem. size
2 GB
1 GB
2 GB
4 GB
Mem. type
GDDR3
GDDR3
GDDR5
GDDR5
-
-
-
-
PCIe 2.0*16
PCIe 2.0*16
PCIe 2.0*16
PCIe 2.0*16
150 W
150 W
150 W
225 W
Introduction
Core
No. of EUs
Memory
Mem. transfer rate (eff)
Mem. interface
Mem. bandwidth
System
ECC
Interface
Power (max)
Table: Main features of AMD/ATI’s data parallel accelerator cards (FireStream line) [67]
4. Overview of data parallel accelerators (23)
Price relations (as of 1/2011)
Nvidia Tesla
C2050
C2070
S2050
S2070
~ 2000 $
~ 4000 $
~ 13 000 $
~ 19 000 $
NVidia GTX
GTX580
~
500 $
1. Introduction (8)
Background slides for intro to SIMT processing
1. Introduction (8)
Figure: Peak SP FP performance of Nvidia’s GPUs vs Intel’ P4 and Core2 processors [11]
1. Introduction (9)
Figure: Bandwidth values of Nvidia’s GPU’s vs Intel’s P4 and Core2 processors [11]
5. References
5. References (1)
5. References (to all four sections)
[1]: Torricelli F., AMD in HPC, HPC07, 2007
http://www.altairhyperworks.co.uk/html/en-GB/keynote2/Torricelli_AMD.pdf
[2]: NVIDIA Tesla C870 GPU Computing Board, Board Specification, Jan. 2008, Nvidia
[3] AMD FireStream 9170, 2008
http://ati.amd.com/technology/streamcomputing/product_firestream_9170.html
[4]: NVIDIA Tesla D870 Deskside GPU Computing System, System Specification, Jan. 2008,
Nvidia,
http://www.nvidia.com/docs/IO/43395/D870-SystemSpec-SP-03718-001_v01.pdf
[5]: Tesla S870 GPU Computing System, Specification, Nvida, March 13 2008,
http://jp.nvidia.com/docs/IO/43395/S870-BoardSpec_SP-03685-001_v00b.pdf
[6]: Torres G., Nvidia Tesla Technology, Nov. 2007,
http://www.hardwaresecrets.com/article/495
[7]: R600-Family Instruction Set Architecture, Revision 0.31, May 2007, AMD
[8]: Zheng B., Gladding D., Villmow M., Building a High Level Language Compiler for GPGPU,
ASPLOS 2006, June 2008
[9]: Huddy R., ATI Radeon HD2000 Series Technology Overview, AMD Technology Day, 2007
http://ati.amd.com/developer/techpapers.html
5. References (2)
[10]: Compute Abstraction Layer (CAL) Technology – Intermediate Language (IL),
Version 2.0, AMD, Oct. 2008
[11]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 2.0,
June 2008, Nvidia
[12]: Kirk D. & Hwu W. W., ECE498AL Lectures 7: Threading Hardware in G80, 2007,
University of Illinois, Urbana-Champaign, http://courses.ece.uiuc.edu/ece498/al1/
lectures/lecture7-threading%20hardware.ppt
[13]: Kogo H., R600 (Radeon HD2900 XT), PC Watch, June 26 2008,
http://pc.watch.impress.co.jp/docs/2008/0626/kaigai_3.pdf
[14]: Goto H., Nvidia G80, PC Watch, April 16 2007,
http://pc.watch.impress.co.jp/docs/2007/0416/kaigai350.htm
[15]: Goto H., GeForce 8800GT (G92), PC Watch, Oct. 31 2007,
http://pc.watch.impress.co.jp/docs/2007/1031/kaigai398_07.pdf
[16]: Goto H., NVIDIA GT200 and AMD RV770, PC Watch, July 2 2008,
http://pc.watch.impress.co.jp/docs/2008/0702/kaigai451.htm
[17]: Shrout R., Nvidia GT200 Revealed – GeForce GTX 280 and GTX 260 Review,
PC Perspective, June 16 2008,
http://www.pcper.com/article.php?aid=577&type=expert&pid=3
5. References (3)
[18]: http://en.wikipedia.org/wiki/DirectX
[19]: Dietrich S., “Shader Model 3.0, April 2004, Nvidia,
http://www.cs.umbc.edu/~olano/s2004c01/ch15.pdf
[20]: Microsoft DirectX 10: The Next-Generation Graphics API, Technical Brief, Nov. 2006,
Nvidia, http://www.nvidia.com/page/8800_tech_briefs.html
[21]: Patidar S. & al., “Exploiting the Shader Model 4.0 Architecture, Center for
Visual Information Technology, IIIT Hyderabad, March 2007,
http://research.iiit.ac.in/~shiben/docs/SM4_Skp-Shiben-Jag-PJN_draft.pdf
[22]: Nvidia GeForce 8800 GPU Architecture Overview, Vers. 0.1, Nov. 2006, Nvidia,
http://www.nvidia.com/page/8800_tech_briefs.html
[23]: Goto H., Graphics Pipeline Rendering History, Aug. 22 2008, PC Watch,
http://pc.watch.impress.co.jp/docs/2008/0822/kaigai_06.pdf
[24]: Fatahalian K., “From Shader Code to a Teraflop: How Shader Cores Work,”
Workshop: Beyond Programmable Shading: Fundamentals, SIGGRAPH 2008,
[25]: Kanter D., “NVIDIA’s GT200: Inside a Parallel Processor,” Real World Technologies,
Sept. 8 2008, http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242
[26]: Nvidia CUDA Compute Unified Device Architecture Programming Guide,
Version 1.1, Nov. 2007, Nvidia
5. References (4)
[27]: Seiler L. & al., “Larrabee: A Many-Core x86 Architecture for Visual Computing,”
ACM Transactions on Graphics, Vol. 27, No. 3, Article No. 18, Aug. 2008
[28]: Kogo H., “Larrabee”, PC Watch, Oct. 17, 2008,
http://pc.watch.impress.co.jp/docs/2008/1017/kaigai472.htm
[29]: Shrout R., IDF Fall 2007 Keynote, PC Perspective, Sept. 18, 2007,
http://www.pcper.com/article.php?aid=453
[30]: Stokes J., Larrabee: Intel’s biggest leap ahead since the Pentium Pro,” Ars Technica,
Aug. 04. 2008, http://arstechnica.com/news.ars/post/20080804-larrabeeintels-biggest-leap-ahead-since-the-pentium-pro.html
[31]: Shimpi A. L. C Wilson D., “Intel's Larrabee Architecture Disclosure: A Calculated
First Move, Anandtech, Aug. 4. 2008,
http://www.anandtech.com/showdoc.aspx?i=3367&p=2
[32]: Hester P., “Multi_Core and Beyond: Evolving the x86 Architecture,” Hot Chips 19,
Aug. 2007, http://www.hotchips.org/hc19/docs/keynote2.pdf
[33]: AMD Stream Computing, User Guide, Oct. 2008, Rev. 1.2.1
http://ati.amd.com/technology/streamcomputing/
Stream_Computing_User_Guide.pdf
[34]: Doggett M., Radeon HD 2900, Graphics Hardware Conf. Aug. 2007,
http://www.graphicshardware.org/previous/www_2007/presentations/
doggett-radeon2900-gh07.pdf
5. References (5)
[35]: Mantor M., “AMD’s Radeon Hd 2900,” Hot Chips 19, Aug. 2007,
http://www.hotchips.org/archives/hc19/2_Mon/HC19.03/HC19.03.01.pdf
[36]: Houston M., “Anatomy if AMD’s TeraScale Graphics Engine,”, SIGGRAPH 2008,
http://s08.idav.ucdavis.edu/houston-amd-terascale.pdf
[37]: Mantor M., “Entering the Golden Age of Heterogeneous Computing,” PEEP 2008,
http://ati.amd.com/technology/streamcomputing/IUCAA_Pune_PEEP_2008.pdf
[38]: Kogo H., RV770 Overview, PC Watch, July 02 2008,
http://pc.watch.impress.co.jp/docs/2008/0702/kaigai_09.pdf
[39]: Kanter D., Inside Fermi: Nvidia's HPC Push, Real World Technologies Sept 30 2009,
http://www.realworldtech.com/includes/templates/articles.cfm?
ArticleID=RWT093009110932&mode=print
[40]: Wasson S., Inside Fermi: Nvidia's 'Fermi' GPU architecture revealed,
Tech Report, Sept 30 2009, http://techreport.com/articles.x/17670/1
[41]: Wasson S., AMD's Radeon HD 5870 graphics processor,
Tech Report, Sept 23 2009, http://techreport.com/articles.x/17618/1
[42]: Bell B., ATI Radeon HD 5870 Performance Preview ,
Firing Squad, Sept 22 2009, http://www.firingsquad.com/hardware/
ati_radeon_hd_5870_performance_preview/default.asp
5. References (6)
[43]: Nvidia CUDA C Programming Guide, Version 3.2, October 22 2010
http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/
CUDA_C_Programming_Guide.pdf
[44]: Hwu W., Kirk D., Nvidia, Advanced Algorithmic Techniques for GPUs, Berkeley,
January 24-25 2011
http://iccs.lbl.gov/assets/docs/2011-01-24/lecture1_computational_thinking_
Berkeley_2011.pdf
[45]: Wasson S., Nvidia's GeForce GTX 580 graphics processor
Tech Report, Nov 9 2010, http://techreport.com/articles.x/19934/1
[46]: Shrout R., Nvidia GeForce 8800 GTX Review – DX10 and Unified Architecture,
PC Perspective, Nov 8 2006
http://swfan.com/reviews/graphics-cards/nvidia-geforce-8800-gtx-review-dx10and-unified-architecture/g80-architecture
[47]: Wasson S., Nvidia's GeForce GTX 480 and 470 graphics processors
Tech Report, March 31 2010, http://techreport.com/articles.x/18682
[48]: Gangar K., Tianhe-1A from China is world’s fastest Supercomputer
Tech Ticker, Oct 28 2010, http://techtickerblog.com/2010/10/28/tianhe-1afrom-china-is-worlds-fastest-supercomputer/
[49]: Smalley T., ATI Radeon HD 5870 Architecture Analysis, Bit-tech, Sept 30 2009,
http://www.bit-tech.net/hardware/graphics/2009/09/30/ati-radeon-hd-5870architecture-analysis/8
5. References (7)
[50]: Nvidia Compute PTX: Parallel Thread Execution, ISA, Version 2.2, Oct 14 2010,
http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/
ptx_isa_2.2.pdf
[51]: Kanter D., Intel's Sandy Bridge Microarchitecture, Real World Technologies,
Sept 25 2010
http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=4
[52]: Nvidia CUDATM FermiTM Compatibility Guide for CUDA Applications, Version 1.0,
February 2010, http://developer.download.nvidia.com/compute/cuda/3_0/
docs/NVIDIA_FermiCompatibilityGuide.pdf
[53]: Hallock R., Dissecting Fermi, NVIDIA’s next generation GPU, Icrontic, Sept 30 2009,
http://tech.icrontic.com/articles/nvidia_fermi_dissected/
[54]: Kirsch N., NVIDIA GF100 Fermi Architecture and Performance Preview,
Legit Reviews, Jan 20 2010, http://www.legitreviews.com/article/1193/2/
[55]: Hoenig M., NVIDIA GeForce GTX 460 SE 1GB Review, Hardware Canucks, Nov 21 2010,
http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/38178nvidia-geforce-gtx-460-se-1gb-review-2.html
[56]: Glaskowsky P. N., Nvidia’s Fermi: The First Complete GPU Computing Architecture
Sept 2009, http://www.nvidia.com/content/PDF/fermi_white_papers/
P.Glaskowsky_NVIDIA's_Fermi-The_First_Complete_GPU_Architecture.pdf
[57]: Kirk D. & Hwu W. W., ECE498AL Lectures 4: CUDA Threads – Part 2, 2007-2009,
University of Illinois, Urbana-Champaign, http://courses.engr.illinois.edu/ece498/
al/lectures/lecture4%20cuda%20threads%20part2%20spring%202009.ppt
5. References (8)
[58]: Nvidia’s Next Generation CUDATM Compute Architecture: FermiTM, Version 1.1, 2009
http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_
Architecture_Whitepaper.pdf
[59]: Kirk D. & Hwu W. W., ECE498AL Lectures 8: Threading Hardware in G80, 2007-2009,
University of Illinois, Urbana-Champaign, http://courses.engr.illinois.edu/ece498/
al/lectures/lecture8-threading-hardware-spring-2009.ppt
[60]: Wong H., Papadopoulou M.M., Sadooghi-Alvandi M., Moshovos A., Demystifying GPU
Microarchitecture through Microbenchmarking, University of Toronto, 2010,
http://www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf
[61]: Pettersson J., Wainwright I., Radar Signal Processing with Graphics Processors
(GPUs), SAAB Technologies, Jan 27 2010,
http://www.hpcsweden.se/files/RadarSignalProcessingwithGraphicsProcessors.pdf
[62]: Smith R., NVIDIA’s GeForce GTX 460: The $200 King, AnandTech, July 11 2010,
http://www.anandtech.com/show/3809/nvidias-geforce-gtx-460-the-200-king/2
[63]: Angelini C., GeForce GTX 580 And GF110: The Way Nvidia Meant It To Be Played,
Tom’s Hardware, Nov 9 2010, http://www.tomshardware.com/reviews/geforcegtx-580-gf110-geforce-gtx-480,2781.html
[64]: NVIDIA G80: Architecture and GPU Analysis, Beyond3D, Nov. 8 2006,
http://www.beyond3d.com/content/reviews/1/11
[65]: D. Kirk and W. Hwu, Programming Massively Parallel Processors, 2008
Chapter 3: CUDA Threads, http://courses.engr.illinois.edu/ece498/al/textbook/
Chapter3-CudaThreadingModel.pdf
5. References (9)
[66]: NVIDIA Forums: General CUDA GPU Computing Discussion, 2008
http://forums.nvidia.com/index.php?showtopic=73056
[67]: Wikipedia: Comparison of AMD graphics processing units, 2011
http://en.wikipedia.org/wiki/Comparison_of_AMD_graphics_processing_units
[68]: Nvidia OpenCL Overview, 2009
http://gpgpu.org/wp/wp-content/uploads/2009/06/05-OpenCLIntroduction.pdf
[69]: Chester E., Nvidia GeForce GTX 460 1GB Fermi Review, Trusted Reviews,
July 13 2010, http://www.trustedreviews.com/graphics/review/2010/07/13/
Nvidia-GeForce-GTX-460-1GB-Fermi/p1
[70]: NVIDIA GF100 Architecture Details, Geeks3D, 2008-2010,
http://www.geeks3d.com/20100118/nvidia-gf100-architecture-details/
[71]: Murad A., Nvidia Tesla C2050 and C2070 Cards, Science and Technology Zone,
17 nov. 2009,
http://forum.xcitefun.net/nvidia-tesla-c2050-and-c2070-cards-t39578.html
[72]: New NVIDIA Tesla GPUs Reduce Cost Of Supercomputing By A Factor Of 10,
Nvidia, Nov. 16 2009
http://www.nvidia.com/object/io_1258360868914.html
[73]: Nvidia Tesla, Wikipedia, http://en.wikipedia.org/wiki/Nvidia_Tesla
[74]: Tesla M2050 and Tesla M2070/M2070Q Dual-Slot Computing Processor Modules,
Board Specification, v. 03, Nvidia, Aug. 2010,
http://www.nvidia.asia/docs/IO/43395/BD-05238-001_v03.pdf
5. References (10)
[75]: Tesla 1U gPU Computing System, Product Soecification, v. 04, Nvidia, June 2009,
http://www.nvidia.com/docs/IO/43395/SP-04975-001-v04.pdf
[76]: Kanter D., The Case for ECC Memory in Nvidia’s Next GPU, Realworkd Technologies,
19 Aug. 2009,
http://www.realworldtech.com/page.cfm?ArticleID=RWT081909212132
[77]: Hoenig M., Nvidia GeForce 580 Review, HardwareCanucks, Nov. 8, 2010,
http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/
37789-nvidia-geforce-gtx-580-review-5.html
[78]: Angelini C., AMD Radeon HD 6990 4 GB Review, Tom’s Hardware, March 8, 2011,
http://www.tomshardware.com/reviews/radeon-hd-6990-antilles-crossfire,2878.html
[79]: Tom’s Hardware Gallery,
http://www.tomshardware.com/gallery/two-cypress-gpus,0101-2303697179-0-0-0-jpg-.html
[80]: Tom’s Hardware Gallery,
http://www.tomshardware.com/gallery/Bare-Radeon-HD-5970,0101-2303497179-0-0-0-jpg-.html
[81]: CUDA, Wikipedia, http://en.wikipedia.org/wiki/CUDA
[82]: GeForce Graphics Processors, Nvidia, http://www.nvidia.com/object/geforce_family.html
[83]: Next Gen CUDA GPU Architecture, Code-Named “Fermi”, Press Presentation at
Nvidia’s 2009 GPU Technology Conference, (GTC), Sept. 30 2009,
http://www.nvidia.com/object/gpu_tech_conf_press_room.html
5. References (10)
[84]: Tom’s Hardware Gallery,
http://www.tomshardware.com/gallery/SM,0101-110801-0-14-15-1-jpg-.html
[85]: Butler, M., Bulldozer, a new approach to multithreaded compute performance,
Hot Chips 22, Aug. 24 2010
http://www.hotchips.org/index.php?page=hot-chips-22
.
[86]: Voicu A., NVIDIA Fermi GPU and Architecture Analysis, Beyond 3D, 23rd Oct 2010,
http://www.beyond3d.com/content/reviews/55/1
[87]: Chu M. M., GPU Computing: Past, Present and Future with ATI Stream Technology,
AMD, March 9 2010,
http://developer.amd.com/gpu_assets/GPU%20Computing%20-%20Past%20
Present%20and%20Future%20with%20ATI%20Stream%20Technology.pdf
[88]: Smith R., AMD's Radeon HD 6970 & Radeon HD 6950: Paving The Future For AMD,
AnandTech, Dec. 15 2010,
http://www.anandtech.com/show/4061/amds-radeon-hd-6970-radeon-hd-6950
[89] Christian, AMD renames ATI Stream SDK, updates its with APU, OpenCL 1.1 support,
Jan. 27 2011, http://www.tcmagazine.com/tcm/news/software/34765/
amd-renames-ati-stream-sdk-updates-its-apu-opencl-11-support
[90]: User Guide: AMD Stream Computing, Revision 1.3.0, Dec. 2008,
http://www.ele.uri.edu/courses/ele408/StreamGPU.pdf
[91]: Programming Guide: ATI Stream Computing Compute Abstraction Layer (CAL),
Revision 2.01, AMD, March 2010, http://developer.amd.com/gpu_assets/ATI_Stream_
SDK_CAL_Programming_Guide_v2.0.pdf
5. References (11)
[92]: Technical Overview: AMD Stream Computing, Revision 1.2.1, Oct. 2008,
http://www.cct.lsu.edu/~scheinin/Parallel/StreamComputingOverview.pdf
[93]: AMD Accelerated Parallel Processing OpenCL Programming Guide, Revision 1.2,
AMD, Jan. 2011, http://developer.amd.com/gpu/amdappsdk/assets/
AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf
[94]: An Introduction to OpenCL, AMD, http://www.amd.com/us/products/technologies/
stream-technology/opencl/pages/opencl-intro.aspx
[95]: Behr D., Introduction to OpenCL PPAM 2009, Sept. 15 2009,
http://gpgpu.org/wp/wp-content/uploads/2009/09/B1-OpenCL-Introduction.pdf
[96]: Gohara D.W. PhD, OpenCL Episode 2 – OpenCL Fundamentals, Aug. 26 2009,
MacResearch, http://www.macresearch.org/files/opencl/Episode_2.pdf
[97]: Kanter D., AMD's Cayman GPU Architecture, Real World Technologies, Dec. 14 2010,
http://www.realworldtech.com/page.cfm?ArticleID=RWT121410213827&p=3
[98]: Hoenig M., AMD Radeon HD 6970 and HD 6950 Review, Hardware Canucks,
Dec. 14 2010, http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/
38899-amd-radeon-hd-6970-hd-6950-review-3.html
[99]: Reference Guide: AMD HD 6900 Series Instruction Set Architecture, Revision 1.0,
Febr. 2011, http://developer.amd.com/gpu/AMDAPPSDK/assets/
AMD_HD_6900_Series_Instruction_Set_Architecture.pdf
[100]: Howes L., AMD and OpenCL, AMD Application Engineering, Dec. 2010,
http://www.many-core.group.cam.ac.uk/ukgpucc2/talks/Howes.pdf
5. References (12)
[101]: ATI R700-Family Instruction Set Architecture Reference Guide, Revision 1.0a,
AMD, Febr. 2011, http://developer.amd.com/gpu_assets/R700-Family_Instruction_
Set_Architecture.pdf
[102]: Piazza T., Dr. Jiang H., Microarchitecture Codename Sandy Bridge: Processor
Graphics, Presentation ARCS002, IDF San Francisco, Sept. 2010
[103]: Bhaniramka P., Introduction to Compute Abstraction Layer (CAL),
http://coachk.cs.ucf.edu/courses/CDA6938/AMD_course/M5%20%20Introduction%20to%20CAL.pdf
[104]: Villmow M., ATI Stream Computing, ATI Intermediate Language (IL),
May 30 2008, http://developer.amd.com/gpu/amdappsdk/assets/ATI%20Stream
%20Computing%20-%20ATI%20Intermediate%20Language.ppt#547,9
[105]: Reference Guide: AMD Accelerated Parallel Processing Technology,
AMD Intermediate Language (IL), Revision 2.0e, March 2011,
http://developer.amd.com/gpu/AMDAPPSDK/assets/AMD_Intermediate_Language
_(IL)_Specification_v2.pdf
[106]: Hensley J., Hardware and Compute Abstraction Layers for Accelerated Computing
Using Graphics Hardware and Conventional CPUs, AMD, 2007,
http://www.ll.mit.edu/HPEC/agendas/proc07/Day3/10_Hensley_Abstract.pdf
[107]: Hensley J., Yang J., Compute Abstraction Layer, AMD, Febr. 1 2008,
http://coachk.cs.ucf.edu/courses/CDA6938/s08/UCF-2008-02-01a.pdf
[108]: AMD Accelerated Parallel Processing (APP) SDK, AMD Developer Central,
http://developer.amd.com/gpu/amdappsdk/pages/default.aspx
5. References (13)
[109]: OpenCL™ and the AMD APP SDK v2.4, AMD Developer Central, April 6 2011,
http://developer.amd.com/documentation/articles/pages/OpenCL-and-the-AMD-APPSDK.aspx
[110]: Stone J., An Introduction to OpenCL, U. of Illinois at Urbana-Champign, Dec. 2009,
http://www.ks.uiuc.edu/Research/gpu/gpucomputing.net
[111]: Introduction to OpenCL Programming, AMD, No. 137-41768-10, Rev. A, May 2010,
http://developer.amd.com/zones/OpenCLZone/courses/Documents/Introduction_
to_OpenCL_Programming%20Training_Guide%20(201005).pdf
[112]: Evergreen Family Instruction Set Architecture, Instructions and Microcode Reference
Guide, AMD, Febr. 2011, http://developer.amd.com/gpu/amdappsdk/assets/
AMD_Evergreen-Family_Instruction_Set_Architecture.pdf
[113]: Intel 810 Chipset: Intel 82810/82810-DC100 Graphics and Memory Controller Hub
(GMCH) Datasheet, June 1999
ftp://download.intel.com/design/chipsets/datashts/29065602.pdf
[114]: Huynh A.T., AMD Announces "Fusion" CPU/GPU Program, Daily Tech, Oct. 25 2006,
http://www.dailytech.com/article.aspx?newsid=4696
[115]: Grim B., AMD Fusion Family of APUs, Dec. 7 2010, http://www.mytechnology.eu/wpcontent/uploads/2011/01/AMD-Fusion-Press-Tour_EMEA.pdf
[116]: Newell D., AMD Financial Analyst Day, Nov. 9 2010,
http://www.rumorpedia.net/wp-content/uploads/2010/11/rumorpedia02.jpg
[117]: De Maesschalck T., AMD starts shipping Ontario and Zacate CPUs, DarkVision
Hardware, Nov. 10 2010, http://www.dvhardware.net/article46449.html
5. References (14)
[118]: AMD Accelerated Parallel Processing (APP) SDK (formerly ATI Stream) with
OpenCLTM 1.1 Support?????
[119]: Burgess B., „Bobcat” AMD’s New Low Power x86 Core Architecture, Aug. 24 2010,
http://www.hotchips.org/uploads/archive22/HC22.24.730-Burgess-AMDBobcat-x86.pdf
[120]: AMD Ontario APU pictures, Xtreme Systems, Sept. 3 2010,
http://www.xtremesystems.org/forums/showthread.php?t=258499
[121]: Stokes J., AMD reveals Fusion CPU+GPU, to challenge Intel in laptops,
Febr. 8 2010, http://arstechnica.com/business/news/2010/02/amd-revealsfusion-cpugpu-to-challege-intel-in-laptops.ars
[122]: AMD Unveils Future of Computing at Annual Financial Analyst Day, CDRinfo,
Nov. 10 2010, http://www.cdrinfo.com/sections/news/Details.aspx?NewsId=28748
[123]: Shimpi A. L., The Intel Core i3 530 Review - Great for Overclockers & Gamers,
AnandTech, Jan. 22 2010, http://www.anandtech.com/show/2921
[124]: Hagedoorn H. Mohammad S., Barling I. R., Core i5 2500K and Core i7 2600K review,
Jan. 3 2011,
http://www.guru3d.com/article/core-i5-2500k-and-core-i7-2600k-review/2
[125]: Wikipedia: Intel GMA, 2011, http://en.wikipedia.org/wiki/Intel_GMA
[126]: Shimpi A. L., The Sandy Bridge Review: Intel Core i7-2600K, i5-2500K and
Core i3-2100 Tested, AnandTech, Jan. 3 2011,
http://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i72600k-i5-2500k-core-i3-2100-tested/11
5. References (15)
[127]: Marques T., AMD Ontario, Zacate Die Sizes - Take 2 , Sept. 14 2010,
http://www.siliconmadness.com/2010/09/amd-ontario-zacate-die-sizestake-2.html
[128]: De Vries H., AMD Bulldozer, 8 core processor, Nov. 24 2010,
http://chip-architect.com/
[129]: Intel® 845G/845GL/845GV Chipset Datasheet: Intel® 82845G/82845GL/82845GV
Graphics and Memory Controller Hub (GMCH), Mai 2002
http://www.intel.com/design/chipsets/datashts/290746.htm
[130]: Huynh A. T., Final AMD "Stars" Models Unveiled, Daily Tech, May 4 2007,
http://www.dailytech.com/Final+AMD+Stars+Models+Unveiled+/article7157.htm
[131]: AMD Fusion, Wikipedia, http://en.wikipedia.org/wiki/AMD_Fusion
[132]: Nita S., AMD Llano APU to Get Dual-GPU Technology Similar to Hybrid CrossFire,
Softpedia, Jan. 21 2011, http://news.softpedia.com/news/AMD-Llano-APU-toGet-Dual-GPU-Technology-Similar-to-Hybrid-CrossFire-179740.shtml
[133]: Jotwani R., Sundaram S., Kosonocky S., Schaefer A., Andrade V. F., Novak A.,
Naffziger S., An x86-64 Core in 32 nm SOI CMOS, IEEE Xplore, Jan. 2011,
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5624589
[134]: Karmehed A., The graphical performance of the AMD A series APUs, Nordic
Hardware, March 16 2011,
http://www.nordichardware.com/news/69-cpu-chipset/42650-the-graphicalperformance-of-the-amd-a-series-apus.html
5. References (16)
[135]: Butler M., „Bulldozer” A new approach to multithreaded compute performance,
Aug. 24 2010, http://www.hotchips.org/uploads/archive22/HC22.24.720-Butler
-AMD-Bulldozer.pdf
[136]: „Bulldozer” and „Bobcat” AMD’s Latest x86 Core Innovations, HotChips22,
http://www.slideshare.net/AMDUnprocessed/amd-hot-chips-bulldozer-bobcat
-presentation-5041615
[137]: Altavilla D., Intel Arrandale Core i5 and Core i3 Mobile Unveiled, Hot Hardware,
Jan. 04 2010,
http://hothardware.com/Reviews/Intel-Arrandale-Core-i5-and-Core-i3-Mobile-Unveiled/
[138]: Dodeja A., Intel Arrandale, High Performance for the Masses, Hot Hardware,
Review of the IDF San Francisco, Sept. 2009,
http://akshaydodeja.com/intel-arrandale-high-performance-for-the-mass
[139]: Shimpi A., An Intel Arrandale: 32nm review for Notebooks, core to be assigned Core i5 540M Reviewe
. Anand Tech, 1/4/2010
http://www.anandtech.com/show/2902
[140]: Chiappeta M., Intel Clarkdale Core i5 Desktop Processor Debuts, Hot Hardware,
Jan. 03 2010,
http://hothardware.com/Articles/Intel-Clarkdale-Core-i5-Desktop-Processor-Debuts/
[141]: Thomas S. L., Desktop Platform Design Overview for Intel Microarchitecture (Nehalem)
Based Platform, Presentation ARCS001, IDF 2009
[142]: Kahn O., Valentine B., Microarchitecture Codename Sandy Bridge: New Processor
Innovations, Presentation ARCS001, IDF San Francisco Sept. 2010
5. References (17)
[143]: Valich T., Intel's "Anti AMD Fusion" Sandy Bridge CPU tapes out, July 5 2009,
http://www.brightsideofnews.com/news/2009/7/5/intels-anti-amd-fusion-sandybridge-cpu-tapes-out.aspx
Download