Parallel Implementation of Irregular Terrain Model on Nvidia GPU

advertisement
Multicores, Multiprocessors, and
Clusters
Computer Architecture
Applications
suggest how
to improve
technology,
provide
revenue to
fund
development
Applications
Technology
Improved
technologies
make new
applications
possible
Cost of software development
makes compatibility a major
force in market
2
Crossroads
First Microprocessor Intel 4004, 1971
• 4-bit accumulator
architecture
• 8mm pMOS
• 2,300 transistors
• 3 x 4 mm2
• 750kHz clock
• 8-16 cycles/inst.
Hardware
• Team from IBM building PC
prototypes in 1979
• Motorola 68000 chosen initially,
but 68000 was late
• 8088 is 8-bit bus version of 8086
=> allows cheaper system
• Estimated sales of 250,000
• 100,000,000s sold
[ Personal Computing Ad, 11/81]
4
Crossroads
DYSEAC, first mobile computer!
• 900 vacuum
tubes
• memory of
512 words of
45 bits each
• Carried in two tractor trailers, 12 tons + 8 tons
• Built for US Army Signal Corps
End of Uniprocessors
Intel cancelled high
performance uniprocessor,
joined IBM and Sun for
multiple processors
6
Trends
• Shrinking of transistor sizes: 250nm (1997) 
130nm (2002)  65nm (2007)  32nm (2010)
28nm(2011, AMD GPU, Xilinx FPGA)
22nm(2011, Intel Ivy Bridge, die shrink of the
Sandy Bridge architecture)
• Transistor density increases by 35% per year and die size
increases by 10-20% per year… more cores!
7
Trends
Transistors: 1.43x / year
Cores: 1.2 - 1.4x
Performance: 1.15x
Frequency: 1.05x
Power: 1.04x
2004
2010
Source: Micron University Symp.
8
Crossroads
1996
When I took this class!
2002
2009
2011
Reduced ILP to 1 chapter!
Shift to multicore!
Reduced emphasis on ILP Request, Data, Thread,
Introduce thread level P.
Instruction Level
Introduce: GPU, cloud
computing, Smart phones,
tablets!
9
Introduction
• Goal: connecting multiple computers
to get higher performance
– Multiprocessors
– Scalability, availability, power efficiency
• Job-level (process-level) parallelism
– High throughput for independent jobs
• Parallel processing program
– Single program run on multiple processors
• Multicore microprocessors
– Chips with multiple processors (cores)
Parallel Programming
• Parallel software is the problem
• Need to get significant performance
improvement
– Otherwise, just use a faster uniprocessor,
since it’s easier!
• Difficulties
– Partitioning
– Coordination
– Communications overhead
Parallel Programming
• MPI, OpenMP, and Stream Processing
are methods of distributing workloads
on computers.
• Key: Overlapping program architecture
with the target hardware architecture
Shared Memory
• SMP: shared memory multiprocessor
– Small number of cores
– Share single memory with uniform memory latency (symmetric)
• SGI Altix UV 1000 (ARDC, December 2011 )
–
–
–
–
58 nodes, 928 cores
up to 2,560 cores with architectural support to 327,680
support for up to 16TB of global shared memory.
Programming: Parallel OpenMP or Threaded
Example: Sum Reduction
• Sum 100,000 numbers on 100 processor UMA
– Each processor has ID: 0 ≤ Pn ≤ 99
– Partition 1000 numbers per processor
– Initial summation on each processor
sum[Pn] = 0;
for (i = 1000*Pn;
i < 1000*(Pn+1); i = i + 1)
sum[Pn] = sum[Pn] + A[i];
• Now need to add these partial sums
– Reduction: divide and conquer
– Half the processors add pairs, then quarter, …
– Need to synchronize between reduction steps
Example: Sum Reduction
half = 100;
repeat
synch();
if (half%2 != 0 && Pn == 0)
sum[0] = sum[0] + sum[half-1];
/* Conditional sum needed when half is odd;
Processor0 gets missing element */
half = half/2; /* dividing line on who sums */
if (Pn < half) sum[Pn] = sum[Pn] +
sum[Pn+half];
until (half == 1);
Distributed Memory
• Distributed shared memory (DSM)
– Memory distributed among processors
– Non-uniform memory access/latency (NUMA)
– Processors connected via direct (switched) and non-direct (multihop) interconnection networks
– Hardware sends/receives messages between processors
Distributed Memory
•
•
•
•
URDC SGI ICE 8400
179 nodes, 2112 cores
up to 768 cores in a single rack,
scalable from 32 to tens of
thousands of nodes
– lower cost than SMP!
• distributed memory system (Cluster),
• typically using MPI programming.
Sum Reduction (Again)
• Sum 100,000 on 100 processors
• First distribute 100 numbers to each
– The do partial sums
sum = 0;
for (i = 0; i<1000; i = i + 1)
sum = sum + AN[i];
• Reduction
– Half the processors send, other half
receive and add
– The quarter send, quarter receive and
add,…
Sum Reduction (Again)
• Given send() and receive() operations
limit = 100; half = 100;/* 100 processors */
repeat
half = (half+1)/2; /* send vs. receive
dividing line */
if (Pn >= half && Pn < limit)
send(Pn - half, sum);
if (Pn < (limit/2))
sum = sum + receive();
limit = half; /* upper limit of senders */
until (half == 1); /* exit with final sum */
– Send/receive also provide synchronization
– Assumes send/receive take similar time to addition
Matrix Multiplication
C0
C1
C2
C3
C4
C5
C6
C7
C8
A0
A1
A2
A3
A4
A5
A6
A7
A8
=
X
B0
B1
B2
B3
B4
B5
B6
B7
B8
Message Passing Interface
• language-independent communications
protocol
– provides a means to enable communication
between different CPUs
• point-to-point and collective communication
• is a specification, not an implementation
• standard for communication among
processes on a distributed memory system
– does not mean that its usage is restricted
• processes do not have anything in common,
and each has its own memory space.
Message Passing Interface
• set of subroutines used explicitly to
communicate between processes.
• MPI programs are truly "multi-processing"
• Parallelization can not be done
automatically or semi-automatically as in
"multi-threading" programs
• function and subroutine calls have to be
inserted into the code
• alter the algorithm of the code with respect
to the serial version.
Is it a curse?
• The need to include the parallelism explicitly in the
program
• is a curse
– more work and requires more planning than multithreading,
• and a blessing
– often leads to more reliable and scalable code
– the behavior is in the hands of the programmer.
– Well-written MPI codes can be made to scale for
thousands of CPUs.
OpenMP
• a system of so-called "compiler directives" that are
used to express parallelism on a shared-memory
machine.
• an industry standard
– most parallel enabled compilers that are used
on SMP machines are capable of processing
OpenMP directives.
• OpenMP is not a “language”
• Instead, OpenMP specifies a set of subroutines in
an existing language (FORTRAN, C) for parallel
programming on a shared memory machine
Systems using OpenMP
• SMP (Symmetric Multi-Processor)
– designed for shared-memory machines
• advantage of not requiring communication between
processors
• allow multi-threading,
– dynamic form of parallelism in which sub-processes are
created and destroyed during program execution.
• OpenMP will not work on distributed-memory
clusters
OpenMP
compiler directives are inserted by the programmer, which allows stepwise
parallelization of pre-existing serial programs
Multiplication
for (ii = 0; ii < nrows; ii++){
for(jj = 0; jj < ncols; jj++){
for (kk = 0; kk < nrows; kk++){
array[ii][jj] = array[ii][kk] *
array[kk][jj] + array[ii][jj];
}
}
}
=
X
Multiplication
unified code: OpenMP constructs are treated as
comments when sequential compilers are used.
#pragma omp parallel for shared(array,
ncols, nrows) private(ii, jj, kk)
for (ii = 0; ii < nrows; ii++){
for(jj = 0; jj < ncols; jj++){
for (kk = 0; kk < nrows; kk++){
array[ii][jj] = array[ii]kk] *
array[kk][jj] + array[ii][jj];
}
}
}
Why is OpenMP popular?
• The simplicity and ease of use
– No message passing
– data layout and decomposition is handled automatically
by directives.
• OpenMP directives may be incorporated
incrementally.
– program can be parallelized one portion after another
and thus no dramatic change to code is needed
– original (serial) code statements need not, in general, be
modified when parallelized with OpenMP. This reduces
the chance of inadvertently introducing bugs and helps
maintenance as well.
• The code is in effect a serial code and more
readable
• Code size increase is generally smaller.
OpenMP Tradeoffs
• Cons
– currently only runs efficiently in shared-memory
multiprocessor platforms
– requires a compiler that supports OpenMP.
– scalability is limited by memory architecture.
– reliable error handling is missing.
– synchronization between subsets of threads is not
allowed.
– mostly used for loop parallelization
MPI Tradeoffs
• Pros of MPI
– does not require shared memory architectures which
are more expensive than distributed memory
architectures
– can be used on a wider range of problems since it
exploits both task parallelism and data parallelism
– can run on both shared memory and distributed
memory architectures
– highly portable with specific optimization for the
implementation on most hardware
• Cons of MPI
– requires more programming changes to go from serial to
parallel version
– can be harder to debug
Another Example
• Consider the following code fragment that finds the
sum of f(x) for 0 <= x < n.
for(ii = 0; ii < n; ii++){
sum = sum + some_complex_long_fuction(a[ii]);
}
Solution
for(ii = 0; ii < n; ii++){
sum = sum + some_complex_long_fuction(a[ii]);}
#pragma omp parallel for shared(sum, a, n) private(ii,
value)
for (ii = 0; ii < n; ii++) {
value = some_complex_long_fuction(a[ii]);
#pragma omp critical
sum = sum + value;
}
or better, you can use the reduction clause to get
#pragma omp parallel for private(sum) reduction(+: sum)
for(ii = 0; ii < n; ii++){
sum = sum + some_complex_long_fuction(a[ii]);
}
Measuring Performance
• Two primary metrics: wall clock time (response
time for a program) and throughput (jobs
performed in unit time)
– If we upgrade a machine with a new processor what do
we increase?
– If we add a new machine to the lab what do we
increase?
• Performance is measured with benchmark suites:
a collection of programs that are likely relevant to
the user
– SPEC CPU 2006: cpu-oriented (desktops)
– SPECweb, TPC: throughput-oriented (servers)
– EEMBC: for embedded processors/workloads
Measuring Performance
• Elapsed Time
– counts everything (disk and memory accesses, I/O ,
etc.) a useful number, but often not good for
comparison purposes
• CPU time
– doesn't count I/O or time spent running other programs
can be broken up into system time, and user
time
Benchmark Games
• An embarrassed Intel Corp. acknowledged Friday that a
bug in a software program known as a compiler had led the
company to overstate the speed of its microprocessor chips
on an industry benchmark by 10 percent. However,
industry analysts said the coding error…was a sad
commentary on a common industry practice of “cheating”
on standardized performance tests…The error was pointed
out to Intel two days ago by a competitor, Motorola …came
in a test known as SPECint92…Intel acknowledged that it
had “optimized” its compiler to improve its test scores. The
company had also said that it did not like the practice but
felt to compelled to make the optimizations because its
competitors were doing the same thing…At the heart of
Intel’s problem is the practice of “tuning” compiler programs
to recognize certain computing problems in the test and
then substituting special handwritten pieces of code…
Saturday, January 6, 1996 New York Times
SPEC CPU2000
Problems of Benchmarking
• Hard to evaluate real benchmarks:
– Machine not built yet, simulators too slow
– Benchmarks not ported
– Compilers not ready
• Benchmark performance is composition of
hardware and software (program, input,
compiler, OS) performance, which must all
be specified
Xe
on
3G
Hz
5
12
K
B(
o
)
d)
pt
im
iz e
B(
-x
W
-O
3xW
xB
(-O
0)
-O
3-
B
12
K
3G
Hz
5
MB
,
.6G
Hz
/1
MB
,
12
K
3G
Hz
5
.6G
Hz
/1
Xe
on
B1
B1
Xe
on
Compiler and Performance
Application Set
500
450
400
350
300
250
200
150
100
50
0
Amdahl's Law
• The performance enhancement of an
improvement is limited by how much the
improved feature is used. In other words:
Don’t expect an enhancement proportional
to how much you enhanced something.
• Example:
"Suppose a program runs in 100 seconds on a
machine, with multiply operations responsible for
80 seconds of this time. How much do we have to
improve the speed of multiplication if we want the
program to run 4 times faster?"
Amdahl's Law
1. Speed up = 4
2. Old execution time = 100
3. New execution time = 100/4 = 25
4. If 80 seconds is used by the affected part =>
5. Unaffected part = 100-80 = 20 sec
6. Execution time new = Execution time unaffected +
Execution time affected / Improvement
7. 25= 20 + 80/Improvement
8. Improvement = 16
How about 5X speedup?
Amdahl's Law
• An application is “almost all” parallel: 90%.
Speedup using
– 10 processors => 5.3x
– 100 processors => 9.1x
– 1000 processors => 9.9x
Stream Processing
• streaming data in and out of an execution core without
utilizing inter-thread communication, scattered (i.e.,
random) writes or even reads, or local memory.
– hardware is drastically simplified
– specialized chips (graphics processing unit)
Parallelism
• ILP exploits implicit parallel operations within
a loop or straight-line code segment
• TLP explicitly represented by the use of
multiple threads of execution that are
inherently parallel
Time (processor cycle)
Multithreaded Categories
Superscalar
Fine-Grained Coarse-Grained
Thread 1
Thread 2
Multiprocessing
Thread 3
Thread 4
Simultaneous
Multithreading
Thread 5
Idle slot
Graphics Processing Units
• Few hundred $ = hundreds of parallel FPUs
– High performance computing more accessible
– Blossomed with easy programming environment
• GPUs and CPUs do not go back in computer
architecture genealogy to a common
ancestor
– Primary ancestors of GPUs: Graphics
accelerators
History of GPUs
• Early video cards
– Frame buffer memory with address generation for
video output
• 3D graphics processing
– Originally high-end computers (e.g., SGI)
– 3D graphics cards for PCs and game consoles
• Graphics Processing Units
– Processors oriented to 3D graphics tasks
– Vertex/pixel processing, shading, texture mapping,
ray tracing
Graphics in the System
Graphics Processing Units
• Given the hardware invested to do graphics
well, how can we supplement it to improve
performance of a wider range of applications?
• Basic idea:
– Heterogeneous execution model
• CPU is the host, GPU is the device
– Develop a C-like programming language for GPU
– Unify all forms of GPU parallelism as CUDA thread
– Programming model: “Single Instruction Multiple
Thread”
Programming the GPU
• Compute Unified Device Architecture-CUDA
– Elegant solution to problem of expressing
parallelism
• Not all algorithms, but enough to matter
• Challenge: Coordinating HOST and CPU
– Scheduling of computation
– Data transfer
• GPU offers every type of parallelism that can
be captured by the programming
environment
Programming Model
• CUDA’s design goals
– extend a standard sequential programming language,
specifically C/C++,
• focus on the important issues of parallelism—how to craft efficient
parallel algorithms—rather than grappling with the mechanics of
an unfamiliar and complicated language.
– minimalist set of abstractions for expressing parallelism
• highly scalable parallel code that can run across tens of
thousands of concurrent threads and hundreds of processor
cores.
GTX570 GPU
Global Memory
1,280MB
L2 Cache
640KB
Texture Cache
8KB
Up to 1536
Threads/SM
L1 Cache
16KB
Constant Cache
8KB
SM 0
Shared Memory
48KB
Registers
32,768
SM 14
Shared Memory
48KB
Registers
32,768
32
cores
32
cores
Programming the GPU
• CUDA Programming Model
– Single Instruction Multiple Thread (SIMT)
• A thread is associated with each data
element
• Threads are organized into blocks
• Blocks are organized into a grid
• GPU hardware handles thread management,
not applications or OS
GPU Threads in SM (GTX570)
• 32 threads within a block work collectively
 Memory access optimization, latency hiding
GPU Threads in SM (GTX570)
Kernel Grid
Block 0
Block 1
Block 2
Block 3
Block 4
Block 5
Block 6
Block 7
Block 8
Block 9
Block 10
Block 11
Block 12
Block 13
Block 14
Block 15
Device with 4 Multiprocessors
MP 0
MP 1
MP 2
MP 3
Block 0
Block 1
Block 2
Block 3
Block 4
Block 5
Block 6
Block 7
• Up to 1024 Threads/Block
and 8 Active Blocks per SM
Programming the GPU
Matrix Multiplication
Matrix Multiplication
• For a 4096x4096 matrix multiplication
- Matrix C will require calculation of 16,777,216 matrix
cells.
• On the GPU each cell is calculated by its own thread.
• We can have 23,040 active threads (GTX570), which means
we can have this many matrix cells calculated in parallel.
• On a general purpose processor we can only calculate one
cell at a time.
• Each thread exploits the GPUs fine granularity by computing
one element of Matrix C.
• Sub-matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU
bandwidth.
Solving Systems of Equations
Thread Organization
•If we expand to 4096 equations, we can process each row
completely in parallel with 4096 threads
•We will require 4096 kernel launches. One for each equation
Results
CPU Configuration: Intel Xeon @2.33GHz with 2GB RAM
GPU Configuration: NVIDIA Tesla C1060 @1.3GHz
*For
single precision, speedup improves by at least a factor of 2X
Execution time includes data transfer from host to device
Programming the GPU
• Distinguishing execution place of functions:
 _device_ or _global_ => GPU Device
 Variables declared are allocated to the GPU memory
 _host_ => System processor (HOST)
• Function call




Name<<dimGrid, dimBlock>>(..parameter list..)
blockIdx: block identifier
threadIdx: threads per block identifier
blockDim: threads per block
Programming the GPU
//Invoke DAXPY
daxpy(n,2.0,x,y);
//DAXPY in C
void daxpy(int n, double a,
double* x, double* y)
{
for (int i=0;i<n;i++)
y[i]= a*x[i]+ y[i]
}
Programming the GPU
//Invoke DAXPY with 256 threads per Thread Block
_host_
int nblocks = (n+255)/256;
daxpy<<<nblocks, 256>>> (n,2.0,x,y);
//DAXPY in CUDA
_device_
void daxpy(int n,double a,double* x,double* y){
int i=blockIDx.x*blockDim.x+threadIdx.x;
if (i<n)
y[i]= a*x[i]+ y[i]
}
Programming the GPU
• CUDA
• Hardware handles thread management
• Invisible to the programmer (productivity),
• Performance programmers need to know the
operation principles of the threads!
• Productivity vs. performance
• How much power to be given to the
programmer, CUDA is still evolving!
Efficiency Considerations
• Avoid execution divergence
– threads within a warp follow different execution paths.
– Divergence between warps is ok
• Allow loading a block of data into SM
– process it there, and then write the final result back out to
external memory.
• Coalesce memory accesses
– Access executive words instead of gather-scatter
• Create enough parallel work
– 5K to 10K threads
Efficiency Considerations
• GPU Architecture
– Each SM executes multiple warps in a time-sharing
fashion while one or more are waiting for memory values
• Hiding the execution cost of warps that are executed concurrently.
– How many memory requests can be serviced and how
many warps can be executed together while one warp is
waiting for memory values.
OpenMP vs CUDA
#pragma omp parallel for shared(A) private(i,j)
for (i = 0; i < 32; i++){
for (j = 0; j < 32; j++)
value=some_function(A[i][j]}
GPU Architectures
• Processing is highly data-parallel
– GPUs are highly multithreaded
– Use thread switching to hide memory latency
• Less reliance on multi-level caches
– Graphics memory is wide and high-bandwidth
• Trend toward general purpose GPUs
– Heterogeneous CPU/GPU systems
– CPU for sequential code, GPU for parallel code
• Programming languages/APIs
– DirectX, OpenGL
– C for Graphics (Cg), High Level Shader Language
(HLSL)
– Compute Unified Device Architecture (CUDA)
Easy to Learn
Takes time to master
Example Systems
2 × quad-core
Intel Xeon e5345
(Clovertown)
2 × quad-core
AMD Opteron X4 2356
(Barcelona)
Example Systems
2 × oct-core
Sun UltraSPARC
T2 5140 (Niagara 2)
2 × oct-core
IBM Cell QS20
IBM Cell Broadband Engine
128 , 128-bit registers
128-bit vector/cycle
Abbreviations
PPE: PowerPC Engine
SPE: Synergistic
Processing Element
MFC: Memory Flow
Controller
32KB,
I, D
512KB
U
LS:
Local Store
SIMD: Single Instruction
Multiple Data
300-600
cycles
CELL BE Programming Model
No direct access to DRAM from LS of SPE,
Buffer size: 16KB
• CASE STUDIES
White Spaces After Digital TV
• In telecommunications, white spaces refer to vacant
frequency bands between licensed broadcast
channels or services like wireless microphones.
• After the transition to digital TV in the U.S. in June
2009, the amount of white space exceeded the
amount of occupied spectrum even in major cities.
• Utilization of white spaces for digital communications
requires propagation loss models to detect occupied
frequencies in near real-time for operation without
causing harmful interference to a DTV signal, or other
wireless systems operating on a previously vacant
channel.
Challenge
• Irregular Terrain Model (ITM), also known as the
Longley-Rice model, is used to make predictions
of radio field strength based on the elevation
profile of terrains between the transmitter and the
receiver.
– Due to constant changes in terrain topography and variations in radio propagation,
there is a pressing need for computational resources capable of running hundreds
of thousands of transmission loss calculations per second.
ITM
• Given the path length (d) for a radio
transmitter T, a circle is drawn around T
with radius d.
• Along the contour line, 64 hypothetical
receivers (Ri) are placed with equal
distance from each other.
• Vector lines from T to each Ri are further
partitioned into 0.5 km sectors (Sj ).
• Atmospheric and geographic conditions
along each sector form the profile of that
terrain (used 256K profiles).
• For each profile, ITM involves independent
computations based on atmospheric and
geographic conditions followed by
transmission loss calculations.
GPU Strategies for ITM
• ITM requires 45 registers
• Each profile is 1KB ( radio frequency, path length, antenna heights, surface transfer
impedance, plus 157 elevation points)
reduces register count to 37
1.5 GB per
GPU
8 KB /
multiprocessor
8 KB /
multiprocessor
16
KB
How many threads / MP?
GPU Strategies for ITM
128*16 threads
16*16 threads
192*16 threads
GPU Strategies for ITM
GPU Strategies for ITM
GPU Strategies for ITM
IBM CELL BE
• Workload: 256k profiles.
• Strategies:
– Message Queue (MQ),
– DMA and double buffering with various buffer sizes (DDB-n),
– SIMD with buffer size of 16KB (DDB-16+SIMD-)
• FG: fine grained; CG: coarse grained.
– Profile level SIMDization (CG) improves performance by 7.5x over MQ
Productivity
Comparison
•
•
Productivity from code development perspective
Based on personal experience of a Ph.D. student
– with C/C++ knowledge and the serial version of the ITM code in hand,
– without prior background on the Cell BE and GPU programming environments.
•
Data logged for the “learning curve” and “design and debugging” times individually.
Instruction and Data Streams
• An alternate classification
Data Streams
Single
Instruction Single
Streams
Multiple
Multiple
SISD:
Intel Pentium 4
SIMD: SSE
instructions of x86
MISD:
No examples today
MIMD:
Intel Xeon e5345
• SPMD: Single Program Multiple Data
– A parallel program on a MIMD computer
SIMD
• Operate elementwise on vectors of data
– E.g., MMX and SSE instructions in x86
• Multiple data elements in 128-bit wide registers
• All processors execute the same
instruction at the same time
– Each with different data address, etc.
• Simplifies synchronization
• Reduced instruction control hardware
• Works best for highly data-parallel
applications
Vector Processors
• Highly pipelined function units
• Stream data from/to vector registers to units
– Data collected from memory into registers
– Results stored from registers to memory
• Example: Vector extension
– 32 × 64-element registers (64-bit elements)
– Vector instructions
• lv, sv: load/store vector
• addv.d: add vectors of double
• addvs.d: add scalar to each element of vector of double
• Significantly reduces instruction-fetch
bandwidth
Vector Processors
Example: (Y = a × X + Y)
• Conventional MIPS code
l.d
$f0,a($sp)
addiu r4,$s0,#512
load
loop: l.d
$f2,0($s0)
mul.d $f2,$f2,$f0
l.d
$f4,0($s1)
add.d $f4,$f4,$f2
s.d
$f4,0($s1)
addiu $s0,$s0,#8
addiu $s1,$s1,#8
subu $t0,r4,$s0
bne
$t0,$zero,loop
• Vector MIPS code
l.d
$f0,a($sp)
lv
$v1,0($s0)
mulvs.d $v2,$v1,$f0
lv
$v3,0($s1)
addv.d $v4,$v2,$v3
sv
$v4,0($s1)
;load scalar a
;upper bound of what to
;load x(i)
;a × x(i)
;load y(i)
;a × x(i) + y(i)
;store into y(i)
;increment index to x
;increment index to y
;compute bound
;check if done
;load scalar a
;load vector x
;vector-scalar multiply
;load vector y
;add y to product
;store the result
Matrix Multiplication
• CPU Configuration: Intel Xeon @2.33GHz with 2GB RAM
• GPU Configuration: NVIDIA Tesla C1060 @1.3GHz
• +For multiplication, matrix size larger than 4096x4096 stresses
host device’s RAM
• *For single precision, speedup improves by at least a factor of 2X
Matrix Size
256x256
512x512
1024x1024
2048x2048
4096x4096+
CPU Time
GPU Time
GPU
(sec)
(sec)
Speedup*
0.159
0.002
71
1.518
0.009
169
25.773
0.037
682
547.882
0.208
2623
4556.700
1.362
3345
Download