Parallel Computing Explained - Florida International University

advertisement
Parallel Computing Explained
Slides Prepared from the CI-Tutor Courses at NCSA
http://ci-tutor.ncsa.uiuc.edu/
By
S. Masoud Sadjadi
School of Computing and Information Sciences
Florida International University
March 2009
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
5 Parallel Code Tuning
6 Timing and Profiling
7 Cache Tuning
8 Parallel Performance Analysis
9 About the IBM Regatta P690
Agenda
1 Parallel Computing Overview
1.1 Introduction to Parallel Computing
1.1.1 Parallelism in our Daily Lives
1.1.2 Parallelism in Computer Programs
1.1.3 Parallelism in Computers
1.1.3.4 Disk Parallelism
1.1.4 Performance Measures
1.1.5 More Parallelism Issues
1.2 Comparison of Parallel Computers
1.3 Summary
Parallel Computing Overview
 Who should read this chapter?
 New Users – to learn concepts and terminology.
 Intermediate Users – for review or reference.
 Management Staff – to understand the basic concepts – even if
you don’t plan to do any programming.
 Note: Advanced users may opt to skip this chapter.
Introduction to Parallel Computing
 High performance parallel computers
 can solve large problems much faster than a desktop computer
 fast CPUs, large memory, high speed interconnects, and high speed
input/output
 able to speed up computations
 by making the sequential components run faster
 by doing more operations in parallel
 High performance parallel computers are in demand
 need for tremendous computational capabilities in science,
engineering, and business.
 require gigabytes/terabytes f memory and gigaflops/teraflops of
performance
 scientists are striving for petascale performance
Introduction to Parallel Computing
 HPPC are used in a wide variety of disciplines.







Meteorologists: prediction of tornadoes and thunderstorms
Computational biologists: analyze DNA sequences
Pharmaceutical companies: design of new drugs
Oil companies: seismic exploration
Wall Street: analysis of financial markets
NASA: aerospace vehicle design
Entertainment industry: special effects in movies and
commercials
 These complex scientific and business applications all need to
perform computations on large datasets or large equations.
Parallelism in our Daily Lives
 There are two types of processes that occur in computers and
in our daily lives:
 Sequential processes
 occur in a strict order
 it is not possible to do the next step until the current one is completed.
 Examples
 The passage of time: the sun rises and the sun sets.
 Writing a term paper: pick the topic, research, and write the paper.
 Parallel processes
 many events happen simultaneously
 Examples
 Plant growth in the springtime
 An orchestra
Agenda
1 Parallel Computing Overview
1.1 Introduction to Parallel Computing
1.1.1 Parallelism in our Daily Lives
1.1.2 Parallelism in Computer Programs
1.1.2.1 Data Parallelism
1.1.2.2 Task Parallelism
1.1.3 Parallelism in Computers
1.1.3.4 Disk Parallelism
1.1.4 Performance Measures
1.1.5 More Parallelism Issues
1.2 Comparison of Parallel Computers
1.3 Summary
Parallelism in Computer Programs
 Conventional wisdom:
 Computer programs are sequential in nature
 Only a small subset of them lend themselves to parallelism.
 Algorithm: the "sequence of steps" necessary to do a computation.
 The first 30 years of computer use, programs were run sequentially.
 The 1980's saw great successes with parallel computers.
 Dr. Geoffrey Fox published a book entitled Parallel Computing
Works!
 many scientific accomplishments resulting from parallel computing
 Computer programs are parallel in nature
 Only a small subset of them need to be run sequentially
Parallel Computing
 What a computer does when it carries out more than one
computation at a time using more than one processor.
 By using many processors at once, we can speedup the execution
 If one processor can perform the arithmetic in time t.
 Then ideally p processors can perform the arithmetic in time t/p.
 What if I use 100 processors? What if I use 1000 processors?
 Almost every program has some form of parallelism.
 You need to determine whether your data or your program can be
partitioned into independent pieces that can be run simultaneously.
 Decomposition is the name given to this partitioning process.
 Types of parallelism:
 data parallelism
 task parallelism.
Data Parallelism
 The same code segment runs concurrently on each processor,
but each processor is assigned its own part of the data to
work on.
 Do loops (in Fortran) define the parallelism.
 The iterations must be independent of each other.
 Data parallelism is called "fine grain parallelism" because the
computational work is spread into many small subtasks.
 Example
 Dense linear algebra, such as matrix multiplication, is a perfect
candidate for data parallelism.
An example of data parallelism
Original Sequential Code
DO K=1,N
DO J=1,N
DO I=1,N
C(I,J) = C(I,J) +
A(I,K)*B(K,J)
END DO
END DO
END DO
Parallel Code
!$OMP PARALLEL DO
DO K=1,N
DO J=1,N
DO I=1,N
C(I,J) = C(I,J) +
A(I,K)*B(K,J)
END DO
END DO
END DO
!$END PARALLEL DO
Quick Intro to OpenMP
 OpenMP is a portable standard for parallel directives
covering both data and task parallelism.
 More information about OpenMP is available on the OpenMP
website.
 We will have a lecture on Introduction to OpenMP later.
 With OpenMP, the loop that is performed in parallel is the
loop that immediately follows the Parallel Do directive.
 In our sample code, it's the K loop:
 DO K=1,N
OpenMP Loop Parallelism
Iteration-Processor
Assignments
The code segment running
on each processor
Processor
Iterations
of K
Data
Elements
proc0
K=1:5
A(I, 1:5)
B(1:5 ,J)
proc1
K=6:10
A(I, 6:10)
B(6:10 ,J)
proc2
K=11:15
A(I, 11:15)
B(11:15 ,J)
proc3
K=16:20
A(I, 16:20)
B(16:20 ,J)
DO J=1,N
DO I=1,N
C(I,J) = C(I,J) +
A(I,K)*B(K,J)
END DO
END DO
OpenMP Style of Parallelism
 can be done incrementally as follows:
Parallelize the most computationally intensive loop.
2. Compute performance of the code.
3. If performance is not satisfactory, parallelize another loop.
4. Repeat steps 2 and 3 as many times as needed.
1.
 The ability to perform incremental parallelism is considered a
positive feature of data parallelism.
 It is contrasted with the MPI (Message Passing Interface)
style of parallelism, which is an "all or nothing" approach.
Task Parallelism
 Task parallelism may be thought of as the opposite of data





parallelism.
Instead of the same operations being performed on different parts
of the data, each process performs different operations.
You can use task parallelism when your program can be split into
independent pieces, often subroutines, that can be assigned to
different processors and run concurrently.
Task parallelism is called "coarse grain" parallelism because the
computational work is spread into just a few subtasks.
More code is run in parallel because the parallelism is
implemented at a higher level than in data parallelism.
Task parallelism is often easier to implement and has less overhead
than data parallelism.
Task Parallelism
 The abstract code shown in the diagram is decomposed into
4 independent code segments that are labeled A, B, C, and D.
The right hand side of the diagram illustrates the 4 code
segments running concurrently.
Task Parallelism
Original Code
Parallel Code
program main
program main
!$OMP PARALLEL
!$OMP SECTIONS
code segment labeled
!$OMP SECTION
code segment labeled
!$OMP SECTION
code segment labeled
!$OMP SECTION
code segment labeled
!$OMP END SECTIONS
!$OMP END PARALLEL
end
code segment labeled A
code segment labeled B
code segment labeled C
code segment labeled D
end
A
B
C
D
OpenMP Task Parallelism
 With OpenMP, the code that follows each SECTION(S)
directive is allocated to a different processor. In our sample
parallel code, the allocation of code segments to processors is
as follows.
Processor
Code
proc0
code segment
labeled A
proc1
code segment
labeled B
proc2
code segment
labeled C
proc3
code segment
labeled D
Parallelism in Computers
 How parallelism is exploited and enhanced within the
operating system and hardware components of a parallel
computer:
 operating system
 arithmetic
 memory
 disk
Operating System Parallelism
 All of the commonly used parallel computers run a version of the
Unix operating system. In the table below each OS listed is in fact
Unix, but the name of the Unix OS varies with each vendor.
Parallel Computer
OS
SGI Origin2000
IRIX
HP V-Class
HP-UX
Cray T3E
Unicos
IBM SP
AIX
Workstation
Clusters
Linux
 For more information about Unix, a collection of Unix documents
is available.
Two Unix Parallelism Features
 background processing facility
 With the Unix background processing facility you can run the
executable a.out in the background and simultaneously view the
man page for the etime function in the foreground. There are
two Unix commands that accomplish this:
a.out > results &
man etime
 cron feature
 With the Unix cron feature you can submit a job that will run at
a later time.
Arithmetic Parallelism
 Multiple execution units
 facilitate arithmetic parallelism.
 The arithmetic operations of add, subtract, multiply, and divide (+ - * /) are
each done in a separate execution unit. This allows several execution units to be
used simultaneously, because the execution units operate independently.
 Fused multiply and add
 is another parallel arithmetic feature.
 Parallel computers are able to overlap multiply and add. This arithmetic is named
MultiplyADD (MADD) on SGI computers, and Fused Multiply Add (FMA) on
HP computers. In either case, the two arithmetic operations are overlapped and
can complete in hardware in one computer cycle.
 Superscalar arithmetic
 is the ability to issue several arithmetic operations per computer cycle.
 It makes use of the multiple, independent execution units. On superscalar
computers there are multiple slots per cycle that can be filled with work. This
gives rise to the name n-way superscalar, where n is the number of slots per
cycle. The SGI Origin2000 is called a 4-way superscalar computer.
Memory Parallelism
 memory interleaving
 memory is divided into multiple banks, and consecutive data elements are
interleaved among them. For example if your computer has 2 memory banks,
then data elements with even memory addresses would fall into one bank, and
data elements with odd memory addresses into the other.
 multiple memory ports
 Port means a bi-directional memory pathway. When the data elements that are
interleaved across the memory banks are needed, the multiple memory ports
allow them to be accessed and fetched in parallel, which increases the memory
bandwidth (MB/s or GB/s).
 multiple levels of the memory hierarchy
 There is global memory that any processor can access. There is memory that is
local to a partition of the processors. Finally there is memory that is local to a
single processor, that is, the cache memory and the memory elements held in
registers.
 Cache memory
 Cache is a small memory that has fast access compared with the larger main
memory and serves to keep the faster processor filled with data.
Memory Parallelism
Memory Hierarchy
Cache Memory
Disk Parallelism
 RAID (Redundant Array of Inexpensive Disk)
 RAID disks are on most parallel computers.
 The advantage of a RAID disk system is that it provides a
measure of fault tolerance.
 If one of the disks goes down, it can be swapped out, and the
RAID disk system remains operational.
 Disk Striping
 When a data set is written to disk, it is striped across the RAID
disk system. That is, it is broken into pieces that are written
simultaneously to the different disks in the RAID disk system.
When the same data set is read back in, the pieces are read in
parallel, and the full data set is reassembled in memory.
Agenda
1 Parallel Computing Overview
1.1 Introduction to Parallel Computing
1.1.1 Parallelism in our Daily Lives
1.1.2 Parallelism in Computer Programs
1.1.3 Parallelism in Computers
1.1.3.4 Disk Parallelism
1.1.4 Performance Measures
1.1.5 More Parallelism Issues
1.2 Comparison of Parallel Computers
1.3 Summary
Performance Measures
 Peak Performance
 is the top speed at which the computer can operate.
 It is a theoretical upper limit on the computer's performance.
 Sustained Performance
 is the highest consistently achieved speed.
 It is a more realistic measure of computer performance.
 Cost Performance
 is used to determine if the computer is cost effective.
 MHz
 is a measure of the processor speed.
 The processor speed is commonly measured in millions of cycles per second,
where a computer cycle is defined as the shortest time in which some work can be
done.
 MIPS
 is a measure of how quickly the computer can issue instructions.
 Millions of instructions per second is abbreviated as MIPS, where the instructions
are computer instructions such as: memory reads and writes, logical operations ,
floating point operations, integer operations, and branch instructions.
Performance Measures
 Mflops (Millions of floating point operations per second)
 measures how quickly a computer can perform floating-point operations
such as add, subtract, multiply, and divide.
 Speedup
 measures the benefit of parallelism.
 It shows how your program scales as you compute with more processors,
compared to the performance on one processor.
 Ideal speedup happens when the performance gain is linearly proportional to
the number of processors used.
 Benchmarks
 are used to rate the performance of parallel computers and parallel
programs.
 A well known benchmark that is used to compare parallel computers is the
Linpack benchmark.
 Based on the Linpack results, a list is produced of the Top 500
Supercomputer Sites. This list is maintained by the University of Tennessee
and the University of Mannheim.
More Parallelism Issues
 Load balancing
 is the technique of evenly dividing the workload among the processors.
 For data parallelism it involves how iterations of loops are allocated to processors.
 Load balancing is important because the total time for the program to complete is
the time spent by the longest executing thread.
 The problem size
 must be large and must be able to grow as you compute with more processors.
 In order to get the performance you expect from a parallel computer you need to
run a large application with large data sizes, otherwise the overhead of passing
information between processors will dominate the calculation time.
 Good software tools
 are essential for users of high performance parallel computers.
 These tools include:
 parallel compilers
 parallel debuggers
 performance analysis tools
 parallel math software
 The availability of a broad set of application software is also important.
More Parallelism Issues
 The high performance computing market is risky and chaotic. Many
supercomputer vendors are no longer in business, making the
portability of your application very important.
 A workstation farm
 is defined as a fast network connecting heterogeneous workstations.
 The individual workstations serve as desktop systems for their owners.
 When they are idle, large problems can take advantage of the unused
cycles in the whole system.
 An application of this concept is the SETI project.You can participate in
searching for extraterrestrial intelligence with your home PC. More
information about this project is available at the SETI Institute.
 Condor
 is software that provides resource management services for applications that
run on heterogeneous collections of workstations.
 Miron Livny at the University of Wisconsin at Madison is the director of the
Condor project, and has coined the phrase high throughput computing to describe
this process of harnessing idle workstation cycles. More information is available
at the Condor Home Page.
Agenda
1 Parallel Computing Overview
1.1 Introduction to Parallel Computing
1.2 Comparison of Parallel Computers
1.2.1 Processors
1.2.2 Memory Organization
1.2.3 Flow of Control
1.2.4 Interconnection Networks
1.2.4.1 Bus Network
1.2.4.2 Cross-Bar Switch Network
1.2.4.3 Hypercube Network
1.2.4.4 Tree Network
1.2.4.5 Interconnection Networks Self-test
1.2.5 Summary of Parallel Computer Characteristics
1.3 Summary
Comparison of Parallel Computers
 Now you can explore the hardware components of parallel
computers:
 kinds of processors
 types of memory organization
 flow of control
 interconnection networks
 You will see what is common to these parallel computers,
and what makes each one of them unique.
Kinds of Processors
 There are three types of parallel computers:
1.
computers with a small number of powerful processors
 Typically have tens of processors.
 The cooling of these computers often requires very sophisticated and
expensive equipment, making these computers very expensive for computing
centers.
 They are general-purpose computers that perform especially well on
applications that have large vector lengths.
 The examples of this type of computer are the Cray SV1 and the Fujitsu
VPP5000.
Kinds of Processors
 There are three types of parallel computers:
computers with a large number of less powerful processors
2.
 Named a Massively Parallel Processor (MPP), typically have thousands of





processors.
The processors are usually proprietary and air-cooled.
Because of the large number of processors, the distance between the furthest
processors can be quite large requiring a sophisticated internal network that
allows distant processors to communicate with each other quickly.
These computers are suitable for applications with a high degree of
concurrency.
The MPP type of computer was popular in the 1980s.
Examples of this type of computer were the Thinking Machines CM-2
computer, and the computers made by the MassPar company.
Kinds of Processors
 There are three types of parallel computers:
3.
computers that are medium scale in between the two extremes
 Typically have hundreds of processors.
 The processor chips are usually not proprietary; rather they are commodity
processors like the Pentium III.
 These are general-purpose computers that perform well on a wide range of
applications.
 The most common example of this class is the Linux Cluster.
Trends and Examples
 Processor trends :
Decade Processor Type
Computer Example
1970s
Pipelined, Proprietary
Cray-1
1980s
Massively Parallel, Proprietary
Thinking Machines CM2
1990s
Superscalar, RISC, Commodity SGI Origin2000
2000s
CISC, Commodity
Workstation Clusters
 The processors on today’s commonly used parallel computers:
Computer
Processor
SGI Origin2000
MIPS RISC R12000
HP V-Class
HP PA 8200
Cray T3E
Compaq Alpha
IBM SP
IBM Power3
Workstation Clusters
Intel Pentium III, Intel Itanium
Memory Organization
 The following paragraphs describe the three types of
memory organization found on parallel computers:
 distributed memory
 shared memory
 distributed shared memory
Distributed Memory
 In distributed memory computers, the total memory is partitioned
into memory that is private to each processor.
 There is a Non-Uniform Memory Access time (NUMA), which is
proportional to the distance between the two communicating
processors.
 On NUMA computers,
data is accessed the
quickest from a private
memory, while data from
the most distant
processor takes the
longest to access.
 Some examples are the
Cray T3E, the IBM SP,
and workstation clusters.
Distributed Memory
 When programming distributed memory computers, the
code and the data should be structured such that the bulk of
a processor’s data accesses are to its own private (local)
memory.
 This is called having
good data locality.
 Today's distributed
memory computers use
message passing such as
MPI to communicate
between processors as
shown in the following
example:
Distributed Memory
 One advantage of distributed memory computers is that they
are easy to scale. As the demand for resources grows,
computer centers can easily add more memory and
processors.
 This is often called the LEGO block approach.
 The drawback is that programming of distributed memory
computers can be quite complicated.
Shared Memory
 In shared memory computers, all processors have access to a single pool
of centralized memory with a uniform address space.
 Any processor can address any memory location at the same speed so
there is Uniform Memory Access time (UMA).
 Processors communicate with each other through the shared memory.
 The advantages and
disadvantages of shared
memory machines are
roughly the opposite of
distributed memory
computers.
 They are easier to program
because they resemble the
programming of single
processor machines
 But they don't scale like
their distributed memory
counterparts
Distributed Shared Memory
 In Distributed Shared Memory (DSM) computers, a cluster or partition of
processors has access to a common shared memory.
 It accesses the memory of a different processor cluster in a NUMA fashion.
 Memory is physically distributed but logically shared.
 Attention to data locality again is important.
 Distributed shared memory
computers combine the best
features of both distributed
memory computers and
shared memory computers.
 That is, DSM computers have
both the scalability of
distributed memory
computers and the ease of
programming of shared
memory computers.
 Some examples of DSM
computers are the SGI
Origin2000 and the HP VClass computers.
Trends and Examples
 Memory organization
trends:
Decade
Memory Organization
Example
1970s
Shared Memory
Cray-1
1980s
Distributed Memory
Thinking Machines CM-2
1990s
Distributed Shared Memory
SGI Origin2000
2000s
Distributed Memory
Workstation Clusters
 The memory
organization of
today’s commonly
used parallel
computers:
Computer
Memory Organization
SGI Origin2000
DSM
HP V-Class
DSM
Cray T3E
Distributed
IBM SP
Distributed
Workstation Clusters
Distributed
Flow of Control
 When you look at the control of flow you will see three types
of parallel computers:
 Single Instruction Multiple Data (SIMD)
 Multiple Instruction Multiple Data (MIMD)
 Single Program Multiple Data (SPMD)
Flynn’s Taxonomy
 Flynn’s Taxonomy, devised in 1972 by Michael Flynn of Stanford
University, describes computers by how streams of instructions interact
with streams of data.
 There can be single or multiple instruction streams, and there can be
single or multiple data streams. This gives rise to 4 types of computers as
shown in the diagram below:
 Flynn's taxonomy
names the 4 computer
types SISD, MISD,
SIMD and MIMD.
 Of these 4, only SIMD
and MIMD are
applicable to parallel
computers.
 Another computer
type, SPMD, is a special
case of MIMD.
SIMD Computers
 SIMD stands for Single Instruction Multiple Data.
 Each processor follows the same set of instructions.
 With different data elements being allocated to each processor.
 SIMD computers have distributed memory with typically thousands of simple processors,
and the processors run in lock step.
 SIMD computers, popular in the 1980s, are useful for fine grain data parallel applications,
such as neural networks.
 Some examples of SIMD computers
were the Thinking Machines CM-2
computer and the computers from the
MassPar company.
 The processors are commanded by the
global controller that sends
instructions to the processors.
 It says add, and they all add.
 It says shift to the right, and they all
shift to the right.
 The processors are like obedient
soldiers, marching in unison.
MIMD Computers
 MIMD stands for Multiple Instruction Multiple Data.
 There are multiple instruction streams with separate code segments distributed






among the processors.
MIMD is actually a superset of SIMD, so that the processors can run the same
instruction stream or different instruction streams.
In addition, there are multiple data streams; different data elements are allocated
to each processor.
MIMD computers can have either distributed memory or shared memory.
While the processors on SIMD
computers run in lock step, the
processors on MIMD computers
run independently of each other.
MIMD computers can be used for
either data parallel or task parallel
applications.
Some examples of MIMD
computers are the SGI Origin2000
computer and the HP V-Class
computer.
SPMD Computers
 SPMD stands for Single Program Multiple Data.
 SPMD is a special case of MIMD.
 SPMD execution happens when a MIMD computer is programmed to have the
same set of instructions per processor.
 With SPMD computers, while the processors are running the same code
segment, each processor can run that code segment asynchronously.
 Unlike SIMD, the synchronous execution of instructions is relaxed.
 An example is the execution of an if statement on a SPMD computer.
 Because each processor computes with its own partition of the data elements, it
may evaluate the right hand side of the if statement differently from another
processor.
 One processor may take a certain branch of the if statement, and another
processor may take a different branch of the same if statement.
 Hence, even though each processor has the same set of instructions, those
instructions may be evaluated in a different order from one processor to the next.
 The analogies we used for describing SIMD computers can be modified for
MIMD computers.
 Instead of the SIMD obedient soldiers, all marching in unison, in the MIMD world
the processors march to the beat of their own drummer.
Summary of SIMD versus MIMD
SIMD
MIMD
distributed memory
distriuted memory
or
shared memory
Code Segment
same per
processor
same
or
different
Processors
Run In
lock step
asynchronously
Data
Elements
different per
processor
different per
processor
data parallel
data parallel
or
task parallel
Memory
Applications
Trends and Examples
 Flow of control trends:
Decade
Flow of Control
Computer Example
1980's
SIMD
Thinking Machines CM-2
1990's
MIMD
SGI Origin2000
2000's
MIMD
Workstation Clusters
 The flow of control on today:
Computer
Flow of Control
SGI Origin2000
MIMD
HP V-Class
MIMD
Cray T3E
MIMD
IBM SP
MIMD
Workstation Clusters
MIMD
Agenda
1 Parallel Computing Overview
1.1 Introduction to Parallel Computing
1.2 Comparison of Parallel Computers
1.2.1 Processors
1.2.2 Memory Organization
1.2.3 Flow of Control
1.2.4 Interconnection Networks
1.2.4.1 Bus Network
1.2.4.2 Cross-Bar Switch Network
1.2.4.3 Hypercube Network
1.2.4.4 Tree Network
1.2.4.5 Interconnection Networks Self-test
1.2.5 Summary of Parallel Computer Characteristics
1.3 Summary
Interconnection Networks
 What exactly is the interconnection network?
 The interconnection network is made up of the wires and cables that define how the
multiple processors of a parallel computer are connected to each other and to the
memory units.
 The time required to transfer data is dependent upon the specific type of the
interconnection network.
 This transfer time is called the communication time.
 What network characteristics are important?
 Diameter: the maximum distance that data must travel for 2 processors to
communicate.
 Bandwidth: the amount of data that can be sent through a network connection.
 Latency: the delay on a network while a data packet is being stored and forwarded.
 Types of Interconnection Networks
The network topologies (geometric arrangements of the computer network
connections) are:
 Bus
 Cross-bar Switch
 Hybercube
 Tree
Interconnection Networks
 The aspects of network issues are:







Cost
Scalability
Reliability
Suitable Applications
Data Rate
Diameter
Degree
 General Network Characteristics
 Some networks can be compared in terms of their degree and diameter.
 Degree: how many communicating wires are coming out of each processor.
 A large degree is a benefit because it has multiple paths.
 Diameter: This is the distance between the two processors that are farthest
apart.
 A small diameter corresponds to low latency.
Bus Network
 Bus topology is the original coaxial cable-based Local Area Network
(LAN) topology in which the medium forms a single bus to which all
stations are attached.
 The positive aspects
 It is also a mature technology that is well known and reliable.
 The cost is also very low.
 simple to construct.
 The negative aspects
 limited data
transmission rate.
 not scalable in terms
of performance.
 Example: SGI Power
Challenge.
 Only scaled to 18
processors.
Cross-Bar Switch Network
 A cross-bar switch is a network that works through a switching mechanism to
access shared memory.
 it scales better than the bus network but it costs significantly more.
 The telephone system uses this type of network. An example of a computer
with this type of network is the HP V-Class.
 Here is a diagram of a
cross-bar switch
network which shows
the processors talking
through the
switchboxes to store or
retrieve data in
memory.
 There are multiple
paths for a processor to
communicate with a
certain memory.
 The switches determine
the optimal route to
take.
Cross-Bar Switch Network
 In a hypercube network, the processors are connected as if they
were corners of a multidimensional cube. Each node in an N
dimensional cube is directly connected to N other nodes.
 The fact that the number of directly
connected, "nearest neighbor",
nodes increases with the total size of
the network is also highly desirable
for a parallel computer.
 The degree of a hypercube network
is log n and the diameter is log n,
where n is the number of
processors.
 Examples of computers with this
type of network are the CM-2,
NCUBE-2, and the Intel iPSC860.
Tree Network
 The processors are the bottom nodes of the tree. For a processor




to retrieve data, it must go up in the network and then go back
down.
This is useful for decision making applications that can be mapped
as trees.
The degree of a tree network is 1. The diameter of the network is
2 log (n+1)-2 where n is the number of processors.
The Thinking Machines CM-5 is an
example of a parallel computer
with this type of network.
Tree networks are very suitable for
database applications because it
allows multiple searches through
the database at a time.
Interconnected Networks
 Torus Network: A mesh with wrap-around connections in




both the x and y directions.
Multistage Network: A network with more than one
networking unit.
Fully Connected Network: A network where every processor
is connected to every other processor.
Hypercube Network: Processors are connected as if they
were corners of a multidimensional cube.
Mesh Network: A network where each interior processor is
connected to its four nearest neighbors.
Interconnected Networks
 Bus Based Network: Coaxial cable based LAN topology in
which the medium forms a single bus to which all stations are
attached.
 Cross-bar Switch Network: A network that works through a
switching mechanism to access shared memory.
 Tree Network: The processors are the bottom nodes of the
tree.
 Ring Network: Each processor is connected to two others
and the line of connections forms a circle.
Summary of Parallel Computer
Characteristics
 How many processors does the computer have?
 10s?
 100s?
 1000s?
 How powerful are the processors?
 what's the MHz rate
 what's the MIPS rate
 What's the instruction set architecture?
 RISC
 CISC
Summary of Parallel Computer
Characteristics
 How much memory is available?
 total memory
 memory per processor
 What kind of memory?
 distributed memory
 shared memory
 distributed shared memory
 What type of flow of control?
 SIMD
 MIMD
 SPMD
Summary of Parallel Computer
Characteristics
 What is the interconnection network?










Bus
Crossbar
Hypercube
Tree
Torus
Multistage
Fully Connected
Mesh
Ring
Hybrid
Design decisions made by some of the
major parallel computer vendors
Computer
Programming
Style
OS
Processors
Memory
Flow of
Control
Network
SGI
Origin2000
OpenMP
MPI
IRIX
MIPS RISC
R10000
DSM
MIMD
Crossbar
Hypercube
HP V-Class
OpenMP
MPI
HP-UX
HP PA 8200
DSM
MIMD
Crossbar
Ring
Cray T3E
SHMEM
Unicos
Compaq Alpha
Distributed MIMD
Torus
IBM SP
MPI
AIX
IBM Power3
Distributed MIMD
IBM Switch
Linux
Intel Pentium
III
Distributed MIMD
Myrinet
Tree
Workstation
MPI
Clusters
Summary
 This completes our introduction to parallel computing.
 You have learned about parallelism in computer programs, and
also about parallelism in the hardware components of parallel
computers.
 In addition, you have learned about the commonly used parallel
computers, and how these computers compare to each other.
 There are many good texts which provide an introductory
treatment of parallel computing. Here are two useful references:
Highly Parallel Computing, Second Edition
George S. Almasi and Allan Gottlieb
Benjamin/Cummings Publishers, 1994
Parallel Computing Theory and Practice
Michael J. Quinn
McGraw-Hill, Inc., 1994
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
2.1 Automatic Compiler Parallelism
2.2 Data Parallelism by Hand
2.3 Mixing Automatic and Hand Parallelism
2.4 Task Parallelism
2.5 Parallelism Issues
3 Porting Issues
4 Scalar Tuning
5 Parallel Code Tuning
6 Timing and Profiling
7 Cache Tuning
8 Parallel Performance Analysis
9 About the IBM Regatta P690
How to Parallelize a Code
 This chapter describes how to turn a single processor
program into a parallel one, focusing on shared memory
machines.
 Both automatic compiler parallelization and parallelization by
hand are covered.
 The details for accomplishing both data parallelism and task
parallelism are presented.
Automatic Compiler Parallelism
 Automatic compiler parallelism enables you to use a
single compiler option and let the compiler do the work.
 The advantage of it is that it’s easy to use.
 The disadvantages are:
 The compiler only does loop level parallelism, not task
parallelism.
 The compiler wants to parallelize every do loop in your code.
If you have hundreds of do loops this creates way too much
parallel overhead.
Automatic Compiler Parallelism
 To use automatic compiler parallelism on a Linux system
with the Intel compilers, specify the following.
ifort -parallel -O2 ... prog.f
 The compiler creates conditional code that will run with any
number of threads.
 Specify the number of threads and make sure you still get the
right answers with setenv:
setenv OMP_NUM_THREADS 4 a.out > results
Data Parallelism by Hand
 First identify the loops that use most of the CPU time (the Profiling




lecture describes how to do this).
By hand, insert into the code OpenMP directive(s) just before the
loop(s) you want to make parallel.
Some code modifications may be needed to remove data dependencies
and other inhibitors of parallelism.
Use your knowledge of the code and data to assist the compiler.
For the SGI Origin2000 computer, insert into the code an OpenMP
directive just before the loop that you want to make parallel.
!$OMP PARALLEL
DO do i=1,n
… lots of computation ...
end do
!$OMP END PARALLEL DO
Data Parallelism by Hand
 Compile with the mp compiler option.
f90 -mp ... prog.f
 As before, the compiler generates conditional code that will run with any
number of threads.
 If you want to rerun your program with a different number of threads, you do
not need to recompile, just re-specify the setenv command.
setenv OMP_NUM_THREADS 8
a.out > results2
 The setenv command can be placed anywhere before the a.out command.
 The setenv command must be typed exactly as indicated. If you have a typo,
you will not receive a warning or error message. To make sure that the setenv
command is specified correctly, type:
setenv
 It produces a listing of your environment variable settings.
Mixing Automatic and Hand Parallelism
 You can have one source file parallelized automatically by the
compiler, and another source file parallelized by hand.
Suppose you split your code into two files named prog1.f and
prog2.f.
f90 -c -apo … prog1.f
(automatic // for prog1.f)
f90 -c -mp … prog2.f
prog2.f)
(by hand // for
f90 prog1.o prog2.o
executable)
(creates one
a.out > results
(runs the executable)
Task Parallelism
 You can accomplish task parallelism as follows:
!$OMP PARALLEL
!$OMP SECTIONS
… lots of computation in part A …
!$OMP SECTION
… lots of computation in part B ...
!$OMP SECTION
… lots of computation in part C ...
!$OMP END SECTIONS
!$OMP END PARALLEL
 Compile with the mp compiler option.
f90 -mp … prog.f
 Use the setenv command to specify the number of threads.
setenv OMP_NUM_THREADS 3
a.out > results
Parallelism Issues
 There are some issues to consider when parallelizing a
program.
 Should data parallelism or task parallelism be used?
 Should automatic compiler parallelism or parallelism
by hand be used?
 Which loop in a nested loop situation should be the
one that becomes parallel?
 How many threads should be used?
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
3.1 Recompile
3.2 Word Length
3.3 Compiler Options for Debugging
3.4 Standards Violations
3.5 IEEE Arithmetic Differences
3.6 Math Library Differences
3.7 Compute Order Related Differences
3.8 Optimization Level Too High
3.9 Diagnostic Listings
3.10 Further Information
Porting Issues
 In order to run a computer program that presently runs on a
workstation, a mainframe, a vector computer, or another parallel
computer, on a new parallel computer you must first "port" the code.
 After porting the code, it is important to have some benchmark results
you can use for comparison.
 To do this, run the original program on a well-defined dataset, and save the
results from the old or “baseline” computer.
 Then run the ported code on the new computer and compare the results.
 If the results are different, don't automatically assume that the new results
are wrong – they may actually be better. There are several reasons why this
might be true, including:
 Precision Differences - the new results may actually be more accurate than the baseline
results.
 Code Flaws - porting your code to a new computer may have uncovered a hidden flaw in
the code that was already there.
 Detection methods for finding code flaws, solutions, and workarounds
are provided in this lecture.
Recompile
 Some codes just need to be recompiled to get accurate results.
 The compilers available on the NCSA computer platforms are
shown in the following table:
Language
SGI Origin2000
MIPSpro
Portland
Group
IA-32 Linux
Intel
GNU
Portland
Group
Intel
GNU
g77
pgf77
ifort
g77
pgf90
ifort
Fortran 77
f77
ifort
Fortran 90
f90
ifort
Fortran 90
f95
ifort
High
Performance
Fortran
C
C++
IA-64 Linux
ifort
pghpf
cc
CC
pghpf
icc
icpc
gcc
g++
pgcc
pgCC
icc
icpc
gcc
g++
Word Length
 Code flaws can occur when you are porting your code to a
different word length computer.
 For C, the size of an integer variable differs depending on the
machine and how the variable is generated. On the IA32 and IA64
Linux clusters, the size of an integer variable is 4 and 8 bytes,
respectively. On the SGI Origin2000, the corresponding value is 4
bytes if the code is compiled with the –n32 flag, and 8 bytes if
compiled without any flags or explicitly with the –64 flag.
 For Fortran, the SGI MIPSpro and Intel compilers contain the
following flags to set default variable size.
 -in where n is a number: set the default INTEGER to INTEGER*n.
The value of n can be 4 or 8 on SGI, and 2, 4, or 8 on the Linux
clusters.
 -rn where n is a number: set the default REAL to REAL*n. The value
of n can be 4 or 8 on SGI, and 4, 8, or 16 on the Linux clusters.
Compiler Options for Debugging
 On the SGI Origin2000, the MIPSpro compilers include
debugging options via the –DEBUG:group. The syntax is as
follows:
-DEBUG:option1[=value1]:option2[=value2]...
 Two examples are:
 Array-bound checking: check for subscripts out of range at
runtime.
-DEBUG:subscript_check=ON
 Force all un-initialized stack, automatic and dynamically
allocated variables to be initialized.
-DEBUG:trap_uninitialized=ON
Compiler Options for Debugging
 On the IA32 Linux cluster, the Fortran compiler is
equipped with the following –C flags for runtime
diagnostics:
 -CA: pointers and allocatable references
 -CB: array and subscript bounds
 -CS: consistent shape of intrinsic procedure
 -CU: use of uninitialized variables
 -CV: correspondence between dummy and actual
arguments
Standards Violations
 Code flaws can occur when the program has non-ANSI
standard Fortran coding.
 ANSI standard Fortran is a set of rules for compiler writers that
specify, for example, the value of the do loop index upon exit
from the do loop.
 Standards Violations Detection
 To detect standards violations on the SGI Origin2000 computer
use the -ansi flag.
 This option generates a listing of warning messages for the use
of non-ANSI standard coding.
 On the Linux clusters, the -ansi[-] flag enables/disables
assumption of ANSI conformance.
IEEE Arithmetic Differences
 Code flaws occur when the baseline computer conforms to the
IEEE arithmetic standard and the new computer does not.
 The IEEE Arithmetic Standard is a set of rules governing arithmetic
roundoff and overflow behavior.
 For example, it prohibits the compiler writer from replacing x/y
with x *recip (y) since the two results may differ slightly for some
operands.You can make your program strictly conform to the IEEE
standard.
 To make your program conform to the IEEE Arithmetic Standards
on the SGI Origin2000 computer use:
f90 -OPT:IEEEarithmetic=n ... prog.f where n is 1, 2, or 3.
 This option specifies the level of conformance to the IEEE
standard where 1 is the most stringent and 3 is the most liberal.
 On the Linux clusters, the Intel compilers can achieve
conformance to IEEE standard at a stringent level with the –mp
flag, or a slightly relaxed level with the –mp1 flag.
Math Library Differences
 Most high-performance parallel computers are equipped with
vendor-supplied math libraries.
 On the SGI Origin2000 platform, there are SGI/Cray Scientific
Library (SCSL) and Complib.sgimath.
 SCSL contains Level 1, 2, and 3 Basic Linear Algebra Subprograms
(BLAS), LAPACK and Fast Fourier Transform (FFT) routines.
 SCSL can be linked with –lscs for the serial version, or –mp –
lscs_mp for the parallel version.
 The complib library can be linked with –lcomplib.sgimath for the
serial version, or –mp –lcomplib.sgimath_mp for the parallel
version.
 The Intel Math Kernel Library (MKL) contains the complete set of
functions from BLAS, the extended BLAS (sparse), the complete
set of LAPACK routines, and Fast Fourier Transform (FFT)
routines.
Math Library Differences
 On the IA32 Linux cluster, the libraries to link to are:
 For BLAS: -L/usr/local/intel/mkl/lib/32 -lmkl -lguide –lpthread
 For LAPACK: -L/usr/local/intel/mkl/lib/32 –lmkl_lapack -lmkl -lguide
–lpthread
 When calling MKL routines from C/C++ programs, you also
need to link with –lF90.
 On the IA64 Linux cluster, the corresponding libraries are:
 For BLAS: -L/usr/local/intel/mkl/lib/64 –lmkl_itp –lpthread
 For LAPACK: -L/usr/local/intel/mkl/lib/64 –lmkl_lapack –lmkl_itp –
lpthread
 When calling MKL routines from C/C++ programs, you also
need to link with -lPEPCF90 –lCEPCF90 –lF90 -lintrins
Compute Order Related Differences
 Code flaws can occur because of the non-deterministic computation of
data elements on a parallel computer. The compute order in which the
threads will run cannot be guaranteed.
 For example, in a data parallel program, the 50th index of a do loop may be
computed before the 10th index of the loop. Furthermore, the threads may
run in one order on the first run, and in another order on the next run of the
program.
 Note: : If your algorithm depends on data being compared in a specific order,
your code is inappropriate for a parallel computer.
 Use the following method to detect compute order related differences:
 If your loop looks like
 DO I = 1, N change it to
 DO I = N, 1, -1 The results should not change if the iterations are
independent
Optimization Level Too High
 Code flaws can occur when the optimization level has been set too
high thus trading speed for accuracy.
 The compiler reorders and optimizes your code based on
assumptions it makes about your program. This can sometimes cause
answers to change at higher optimization level.
 Setting the Optimization Level
 Both SGI Origin2000 computer and IBM Linux clusters provide
Level 0 (no optimization) to Level 3 (most aggressive) optimization,
using the –O{0,1,2, or 3} flag. One should bear in mind that Level 3
optimization may carry out loop transformations that affect the
correctness of calculations. Checking correctness and precision of
calculation is highly recommended when –O3 is used.
 For example on the Origin 2000
 f90 -O0 … prog.f turns off all optimizations.
Optimization Level Too High
 Isolating Optimization Level Problems
 You can sometimes isolate optimization level problems using
the method of binary chop.
 To do this, divide your program prog.f into halves. Name them prog1.f
and prog2.f.
 Compile the first half with -O0 and the second half with -O3
f90 -c -O0 prog1.f f90 -c -O3 prog2.f f90 prog1.o
prog2.o a.out > results
 If the results are correct, the optimization problem lies in prog1.f
 Next divide prog1.f into halves. Name them prog1a.f and prog1b.f
 Compile prog1a.f with -O0 and prog1b.f with -O3
f90 -c -O0 prog1a.f f90 -c -O3 prog1b.f f90 prog1a.o
prog1b.o prog2.o a.out > results
 Continue in this manner until you have isolated the section of code that is
producing incorrect results.
Diagnostic Listings
 The SGI Origin 2000 compiler will generate all
kinds of diagnostic warnings and messages, but
not always by default. Some useful listing options
are:
f90
f90
f90
f90
f90
-listing ...
-fullwarn ...
-showdefaults ...
-version ...
-help ...
Further Information
 SGI






man f77/f90/cc
man debug_group
man math
man complib.sgimath
MIPSpro 64-Bit Porting and Transition Guide
Online Manuals
 Linux clusters pages
 ifort/icc/icpc –help (IA32, IA64, Intel64)
 Intel Fortran Compiler for Linux
 Intel C/C++ Compiler for Linux
Agenda
 1 Parallel Computing Overview
 2 How to Parallelize a Code
 3 Porting Issues
 4 Scalar Tuning
 4.1 Aggressive Compiler Options
 4.2 Compiler Optimizations
 4.3 Vendor Tuned Code
 4.4 Further Information
Scalar Tuning
 If you are not satisfied with the performance of your
program on the new computer, you can tune the scalar code
to decrease its runtime.
 This chapter describes many of these techniques:
 The use of the most aggressive compiler options
 The improvement of loop unrolling
 The use of subroutine inlining
 The use of vendor supplied tuned code
 The detection of cache problems, and their solution are
presented in the Cache Tuning chapter.
Aggressive Compiler Options
 For the SGI Origin2000 Linux clusters the main
optimization switch is
-On where n ranges from 0 to 3.
-O0 turns off all optimizations.
-O1 and -O2 do beneficial optimizations that will not
effect the accuracy of results.
-O3 specifies the most aggressive optimizations. It takes
the most compile time, may produce changes in
accuracy, and turns on software pipelining.
Aggressive Compiler Options
 It should be noted that –O3 might carry out loop
transformations that produce incorrect results in some codes.
 It is recommended that one compare the answer obtained from
Level 3 optimization with one obtained from a lower-level
optimization.
 On the SGI Origin2000 and the Linux clusters, –O3 can be
used together with –OPT:IEEE_arithmetic=n (n=1,2, or 3)
and –mp (or –mp1), respectively, to enforce operation
conformance to IEEE standard at different levels.
 On the SGI Origin2000, the option
-Ofast = ip27
is also available. This option specifies the most aggressive
optimizations that are specifically tuned for the Origin2000
computer.
Agenda




1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
 4.1Aggressive Compiler Options
 4.2 Compiler Optimizations
 4.2.1 Statement Level
 4.2.2 Block Level
 4.2.3 Routine Level
 4.2.4 Software Pipelining
 4.2.5 Loop Unrolling
 4.2.6 Subroutine Inlining
 4.2.7 Optimization Report
 4.2.8 Profile-guided Optimization (PGO)
 4.3 Vendor Tuned Code
 4.4 Further Information
Compiler Optimizations
 The various compiler optimizations can be classified as
follows:
 Statement Level Optimizations
 Block Level Optimizations
 Routine Level Optimizations
 Software Pipelining
 Loop Unrolling
 Subroutine Inlining
 Each of these are described in the following sections.
Statement Level
 Constant Folding
 Replace simple arithmetic operations on constants with the pre-
computed result.

y = 5+7 becomes y = 12
 Short Circuiting
 Avoid executing parts of conditional tests that are not necessary.

if (I.eq.J .or. I.eq.K) expression
when I=J immediately compute the expression
 Register Assignment
 Put frequently used variables in registers.
Block Level
 Dead Code Elimination
 Remove unreachable code and code that is never executed or
used.
 Instruction Scheduling
 Reorder the instructions to improve memory pipelining.
Routine Level
 Strength Reduction
 Replace expressions in a loop with an expression that takes fewer
cycles.
 Common Subexpressions Elimination
 Expressions that appear more than once, are computed once, and the
result is substituted for each occurrence of the expression.
 Constant Propagation
 Compile time replacement of variables with constants.
 Loop Invariant Elimination
 Expressions inside a loop that don't change with the do loop index are
moved outside the loop.
Software Pipelining
 Software pipelining allows the mixing of operations from
different loop iterations in each iteration of the hardware
loop. It is used to get the maximum work done per clock
cycle.
 Note: On the R10000s there is out-of-order execution of
instructions, and software pipelining may actually get in the
way of this feature.
Loop Unrolling
 The loops stride (or step) value is increased, and the body of the loop is
replicated. It is used to improve the scheduling of the loop by giving a
longer sequence of straight line code. An example of loop unrolling
follows:
Original Loop
Unrolled Loop
do I = 1, 99
c(I) = a(I) + b(I)
enddo
do I =
c(I) =
c(I+1)
c(I+2)
enddo
1, 99, 3
a(I) + b(I)
= a(I+1) + b(I+1)
= a(I+2) + b(I+2)
There is a limit to the amount of unrolling that can take place because there
are a limited number of registers.
 On the SGI Origin2000, loops are unrolled to a level of 8 by default.
You can unroll to a level of 12 by specifying:
f90 -O3 -OPT:unroll_times_max=12 ... prog.f
 On the IA32 Linux cluster, the corresponding flag is –unroll and -unroll0
for unrolling and no unrolling, respectively.
Subroutine Inlining
 Subroutine inlining replaces a call to a subroutine with
the body of the subroutine itself.
 One reason for using subroutine inlining is that when a
subroutine is called inside a do loop that has a huge
iteration count, subroutine inlining may be more
efficient because it cuts down on loop overhead.
 However, the chief reason for using it is that do loops
that contain subroutine calls may not parallelize.
Subroutine Inlining
 On the SGI Origin2000 computer, there are several options to
invoke inlining:
 Inline all routines except those specified to -INLINE:never
f90 -O3 -INLINE:all … prog.f:
 Inline no routines except those specified to -INLINE:must
f90 -O3 -INLINE:none … prog.f:
 Specify a list of routines to inline at every call
f90 -O3 -INLINE:must=subrname … prog.f:
 Specify a list of routines never to inline
f90 -O3 -INLINE:never=subrname … prog.f:
 On the Linux clusters, the following flags can invoke function inlining:
 inline function expansion for calls defined within the current source file
-ip:
 inline function expansion for calls defined in separate files
-ipo:
Optimization Report
 Intel 9.x and later compilers can generate reports that provide
useful information on optimization done on different parts of your
code.
 To generate such optimization reports in a file filename, add the flag -
opt-report-file filename.
 If you have a lot of source files to process simultaneously, and you use
a makefile to compile, you can also use make's "suffix" rules to have
optimization reports produced automatically, each with a unique
name. For example,
.f.o:
ifort -c -o $@ $(FFLAGS) -opt-report-file $*.opt $*.f
 creates optimization reports that are named identically to the original
Fortran source but with the suffix ".f" replaced by ".opt".
Optimization Report
 To help developers and performance analysts navigate through the
usually lengthy optimization reports, the NCSA program OptView is
designed to provide an easy-to-use and intuitive interface that allows the
user to browse through their own source code, cross-referenced with
the optimization reports.
 OptView is installed on NCSA's IA64 Linux cluster under the directory
/usr/apps/tools/bin. You can either add that directory to your UNIX
PATH or you can invoke optview using an absolute path name.You'll
need to be using the X-Window system and to have set your DISPLAY
environment variable correctly for OptView to work.
 Optview can provide a quick overview of which loops in a source code
or source codes among multiple files are highly optimized and which
might need further work. For a detailed description of use of OptView,
readers see: http://perfsuite.ncsa.uiuc.edu/OptView/
Profile-guided Optimization (PGO)
 Profile-guided optimization allows Intel compilers to use
valuable runtime information to make better decisions about
function inlining and interprocedural optimizations to
generate faster codes. Its methodology is illustrated as
follows:
Profile-guided Optimization (PGO)
 First, you do an instrumented compilation by adding the -prof-gen
flag in the compile process:
icc -prof-gen -c a1.c a2.c a3.c
icc a1.o a2.o a3.o -lirc
 Then, you run the program with a representative set of data to
generate the dynamic information files given by the .dyn suffix.
 These files contain valuable runtime information for the compiler
to do better function inlining and other optimizations.
 Finally, the code is recompiled again with the -prof-use flag to use
the runtime information.
icc -prof-use -ipo -c a1.c a2.c a3.c
 A profile-guided optimized executable is generated.
Vendor Tuned Code
 Vendor math libraries have codes that are optimized for their
specific machine.
 On the SGI Origin2000 platform, Complib.sgimath and SCSL
are available.
 On the Linux clusters, Intel MKL is available. Ways to link to
these libraries are described in Section 3 - Porting Issues.
Further Information
 SGI IRIX man and www pages






man opt
man lno
man inline
man ipa
man perfex
Performance Tuning for the Origin2000 at
http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Origin2000OL
D/Doc/
 Linux clusters help and www pages
 ifort/icc/icpc –help (Intel)
 http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Intel64Cluster/
(Intel64)
 http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Intel64Cluster/
(Intel64)
 http://perfsuite.ncsa.uiuc.edu/OptView/
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
5 Parallel Code Tuning
5.1 Sequential Code Limitation
5.2 Parallel Overhead
5.3 Load Balance
5.3.1 Loop Schedule Types
5.3.2 Chunk Size
Parallel Code Tuning
 This chapter describes several of the most common
techniques for parallel tuning, the type of programs that
benefit, and the details for implementing them.
 The majority of this chapter deals with improving load
balancing.
Sequential Code Limitation
 Sequential code is a part of the program that cannot be run with
multiple processors. Some reasons why it cannot be made data
parallel are:





The code is not in a do loop.
The do loop contains a read or write.
The do loop contains a dependency.
The do loop has an ambiguous subscript.
The do loop has a call to a subroutine or a reference to a function
subprogram.
 Sequential Code Fraction
 As shown by Amdahl’s Law, if the sequential fraction is too large,
there is a limitation on speedup. If you think too much sequential
code is a problem, you can calculate the sequential fraction of code
using the Amdahl’s Law formula.
Sequential Code Limitation
 Measuring the Sequential Code Fraction





Decide how many processors to use, this is p.
Run and time the program with 1 processor to give T(1).
Run and time the program with p processors to give T(2).
Form a ratio of the 2 timings T(1)/T(p), this is SP.
Substitute SP and p into the Amdahl’s Law formula:
 f=(1/SP-1/p)/(1-1/p), where f is the fraction of sequential code.
 Solve for f, this is the fraction of sequential code.
 Decreasing the Sequential Code Fraction
 The compilation optimization reports list which loops could not be
parallelized and why.You can use this report as a guide to improve
performance on do loops by:
 Removing dependencies
 Removing I/O
 Removing calls to subroutines and function subprograms
Parallel Overhead
 Parallel overhead is the processing time spent





creating threads
spin/blocking threads
starting and ending parallel regions
synchronizing at the end of parallel regions
When the computational work done by the parallel processes is too
small, the overhead time needed to create and control the parallel
processes can be disproportionately large limiting the savings due to
parallelism.
 Measuring Parallel Overhead
 To get a rough under-estimate of parallel overhead:
 Run and time the code using 1 processor.
 Parallelize the code.
 Run and time the parallel code using only 1 processor.
 Subtract the 2 timings.
Parallel Overhead
 Reducing Parallel Overhead
 To reduce parallel overhead:
 Don't parallelize all the loops.
 Don't parallelize small loops.
 To benefit from parallelization, a loop needs about 1000 floating
point operations or 500 statements in the loop.You can use the IF
modifier in the OpenMP directive to control when loops are
parallelized.
!$OMP PARALLEL DO IF(n > 500)
do i=1,n
... body of loop ...
end do
!$OMP END PARALLEL DO
 Use task parallelism instead of data parallelism. It doesn't generate as
much parallel overhead and often more code runs in parallel.
 Don't use more threads than you need.
 Parallelize at the highest level possible.
Load Balance
 Load balance
 is the even assignment of subtasks to processors so as to keep each
processor busy doing useful work for as long as possible.
 Load balance is important for speedup because the end of a do loop is
a synchronization point where threads need to catch up with each
other.
 If processors have different work loads, some of the processors will
idle while others are still working.
 Measuring Load Balance
 On the SGI Origin, to measure load balance, use the perfex tool
which is a command line interface to the R10000 hardware counters.
The command
perfex -e16 -mp a.out > results
 reports per thread cycle counts. Compare the cycle counts to
determine load balance problems. The master thread (thread 0)
always uses more cycles than the slave threads. If the counts are vastly
different, it indicates load imbalance.
Load Balance
 For linux systems, the thread cpu times can be compared
with ps. A thread with unusually high or low time compared
to the others may not be working efficiently [high cputime
could be the result of a thread spinning while waiting for
other threads to catch up].
ps uH
 Improving Load Balance
 To improve load balance, try changing the way that loop
iterations are allocated to threads by
 changing the loop schedule type
 changing the chunk size
 These methods are discussed in the following sections.
Loop Schedule Types
 On the SGI Origin2000 computer, 4 different loop schedule
types can be specified by an OpenMP directive. They are:
 Static
 Dynamic
 Guided
 Runtime
 If you don't specify a schedule type, the default will be used.
 Default Schedule Type
 The default schedule type allocates 20 iterations on 4 threads as:
Loop Schedule Types
 Static Schedule Type
 The static schedule type is used when some of the iterations do more
work than others. With the static schedule type, iterations are
allocated in a round-robin fashion to the threads.
 An Example
 Suppose you are computing on the
upper triangle of a 100 x 100
matrix, and you use 2 threads,
named t0 and t1. With default
scheduling, workloads are uneven.
Loop Schedule Types
 Whereas with static scheduling, the columns of the matrix
are given to the threads in a round robin fashion, resulting in
better load balance.
Loop Schedule Types
 Dynamic Schedule Type
 The iterations are dynamically allocated to threads at runtime. Each
thread is given a chunk of iterations. When a thread finishes its work,
it goes into a critical section where it’s given another chunk of
iterations to work on.
 This type is useful when you don’t know the iteration count or work
pattern ahead of time. Dynamic gives good load balance, but at a high
overhead cost.
 Guided Schedule Type
 The guided schedule type is dynamic scheduling that starts with large
chunks of iterations and ends with small chunks of iterations. That is,
the number of iterations given to each thread depends on the number
of iterations remaining. The guided schedule type reduces the number
of entries into the critical section, compared to the dynamic schedule
type. Guided gives good load balancing at a low overhead cost.
Chunk Size
 The word chunk refers to a grouping of iterations. Chunk size means
how many iterations are in the grouping. The static and dynamic
schedule types can be used with a chunk size. If a chunk size is not
specified, then the chunk size is 1.
 Suppose you specify a chunk size of 2 with the static schedule type.
Then 20 iterations are allocated on 4 threads:
 The schedule type and chunk size are specified as follows:
!$OMP PARALLEL DO SCHEDULE(type, chunk)
…
!$OMP END PARALLEL DO
 Where type is STATIC, or DYNAMIC, or GUIDED and chunk is any
positive integer.
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
5 Parallel Code Tuning
6 Timing and Profiling
6.1 Timing
6.1.1 Timing a Section of Code
6.1.1.1 CPU Time
6.1.1.2 Wall clock Time
6.1.2 Timing an Executable
6.1.3 Timing a Batch Job
6.2 Profiling
6.2.1 Profiling Tools
6.2.2 Profile Listings
6.2.3 Profiling Analysis
6.3 Further Information
Timing and Profiling
 Now that your program has been ported to the new
computer, you will want to know how fast it runs.
 This chapter describes how to measure the speed of a
program using various timing routines.
 The chapter also covers how to determine which parts of the
program account for the bulk of the computational load so
that you can concentrate your tuning efforts on those
computationally intensive parts of the program.
Timing
 In the following sections, we’ll discuss timers and review the
profiling tools ssrun and prof on the Origin and vprof and gprof
on the Linux Clusters. The specific timing functions described are:
 Timing a section of code
FORTRAN
 etime, dtime, cpu_time for CPU time
 time and f_time for wallclock time
C
 clock for CPU time
 gettimeofday for wallclock time
 Timing an executable
 time a.out
 Timing a batch run
 busage
 qstat
 qhist
CPU Time
 etime
 A section of code can be timed using etime.
 It returns the elapsed CPU time in seconds since the program
started.
real*4 tarray(2),time1,time2,timeres
… beginning of program
time1=etime(tarray)
… start of section of code to be timed
… lots of computation
… end of section of code to be timed
time2=etime(tarray)
timeres=time2-time1
CPU Time
 dtime
 A section of code can also be timed using dtime.
 It returns the elapsed CPU time in seconds since the last call to
dtime.
real*4 tarray(2),timeres
… beginning of program
timeres=dtime(tarray)
… start of section of code to be timed
… lots of computation …
end of section of code to be timed
timeres=dtime(tarray)
… rest of program
CPU Time
The etime and dtime Functions
 User time.
 This is returned as the first element of tarray.
 It’s the CPU time spent executing user code.
 System time.
 This is returned as the second element of tarray.
 It’s the time spent executing system calls on behalf of your program.
 Sum of user and system time.
 This is the function value that is returned.
 It’s the time that is usually reported.
 Metric.
 Timings are reported in seconds.
 Timings are accurate to 1/100th of a second.
CPU Time
Timing Comparison Warnings
 For the SGI computers:
 The etime and dtime functions return the MAX time over all threads
for a parallel program.
 This is the time of the longest thread, which is usually the master
thread.
 For the Linux Clusters:
 The etime and dtime functions are contained in the VAX compatibility
library of the Intel FORTRAN Compiler.
 To use this library include the compiler flag -Vaxlib.
 Another warning: Do not put calls to etime and dtime inside a do
loop. The overhead is too large.
CPU Time
cpu_time
 The cpu_time routine is available only on the Linux clusters as it is
a component of the Intel FORTRAN compiler library.
 It provides substantially higher resolution and has substantially
lower overhead than the older etime and dtime routines.
 It can be used as an elapsed timer.
real*8 time1, time2, timeres
… beginning of program
call cpu_time (time1)
… start of section of code to be timed
… lots of computation
… end of section of code to be timed
call cpu_time(time2)
timeres=time2-time1
… rest of program
CPU Time
clock
 For C programmers, one can call the cpu_time routine using a
FORTRAN wrapper or call the intrinsic function clock that can be
used to determine elapsed CPU time.
include <time.h>
static const double iCPS =
1.0/(double)CLOCKS_PER_SEC;
double time1, time2, timres;
…
time1=(clock()*iCPS);
…
/* do some work */
…
time2=(clock()*iCPS);
timers=time2-time1;
Wall clock Time
time
 For the Origin, the function time returns the time since
00:00:00 GMT, Jan. 1, 1970.
 It is a means of getting the elapsed wall clock time.
 The wall clock time is reported in integer seconds.
external time integer*4 time1,time2,timeres
… beginning of program
time1=time( )
… start of section of code to be timed
… lots of computation
… end of section of code to be timed
time2=time( )
timeres=time2 - time1
Wall clock Time
f_time
 For the Linux clusters, the appropriate FORTRAN function for elapsed
time is f_time.
integer*8 f_time
external f_time
integer*8 time1,time2,timeres
… beginning of program
time1=f_time()
… start of section of code to be timed
… lots of computation
… end of section of code to be timed
time2=f_time()
timeres=time2 - time1
 As above for etime and dtime, the f_time function is in the VAX
compatibility library of the Intel FORTRAN Compiler. To use this
library include the compiler flag -Vaxlib.
Wall clock Time
gettimeofday
 For C programmers, wallclock time can be obtained by using the very
portable routine gettimeofday.
#include <stddef.h> /* definition of NULL */
#include <sys/time.h> /* definition of timeval struct and
protyping of gettimeofday */
double t1,t2,elapsed;
struct timeval tp;
int rtn;
....
....
rtn=gettimeofday(&tp, NULL);
t1=(double)tp.tv_sec+(1.e-6)*tp.tv_usec;
....
/* do some work */
....
rtn=gettimeofday(&tp, NULL);
t2=(double)tp.tv_sec+(1.e-6)*tp.tv_usec;
elapsed=t2-t1;
Timing an Executable
 To time an executable (if using a csh or tcsh shell, explicitly
call /usr/bin/time)
time …options… a.out
 where options can be ‘-p’ for a simple output or ‘-f format’
which allows the user to display more than just time related
information.
 Consult the man pages on the time command for format
options.
Timing a Batch Job
 Time of a batch job running or completed.
 Origin
busage jobid
 Linux clusters
qstat jobid # for a running job
qhist jobid # for a completed job
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
5 Parallel Code Tuning
6 Timing and Profiling
6.1 Timing
6.1.1 Timing a Section of Code
6.1.1.1 CPU Time
6.1.1.2 Wall clock Time
6.1.2 Timing an Executable
6.1.3 Timing a Batch Job
6.2 Profiling
6.2.1 Profiling Tools
6.2.2 Profile Listings
6.2.3 Profiling Analysis
6.3 Further Information
Profiling
 Profiling determines where a program spends its time.
 It detects the computationally intensive parts of the code.
 Use profiling when you want to focus attention and
optimization efforts on those loops that are responsible for
the bulk of the computational load.
 Most codes follow the 90-10 Rule.
 That is, 90% of the computation is done in 10% of the code.
Profiling Tools
Profiling Tools on the Origin
 On the SGI Origin2000 computer there are profiling tools named
ssrun and prof.
 Used together they do profiling, or what is called hot spot analysis.
 They are useful for generating timing profiles.
 ssrun
 The ssrun utility collects performance data for an executable that you
specify.
 The performance data is written to a file named
"executablename.exptype.id".
 prof
 The prof utility analyzes the data file created by ssrun and produces a
report.
 Example
ssrun -fpcsamp a.out
prof -h a.out.fpcsamp.m12345 > prof.list
Profiling Tools
Profiling Tools on the Linux Clusters
 On the Linux clusters the profiling tools are still maturing. There are
currently several efforts to produce tools comparable to the ssrun,
prof and perfex tools. .
 gprof
 Basic profiling information can be generated using the OS utility gprof.
 First, compile the code with the compiler flags -qp -g for the Intel
compiler (-g on the Intel compiler does not change the optimization
level) or -pg for the GNU compiler.
 Second, run the program.
 Finally analyze the resulting gmon.out file using the gprof utility: gprof
executable gmon.out.
efc -O -qp -g -o foo foo.f
./foo
gprof foo gmon.out
Profiling Tools
Profiling Tools on the Linux Clusters
 vprof
 On the IA32 platform there is a utility called vprof that provides
performance information using the PAPI instrumentation
library.
 To instrument the whole application requires recompiling and
linking to vprof and PAPI libraries.
setenv VMON PAPI_TOT_CYC
ifc -g -O -o md md.f
/usr/apps/tools/vprof/lib/vmonauto_gcc.o L/usr/apps/tools/lib -lvmon -lpapi
./md
/usr/apps/tools/vprof/bin/cprof -e md vmon.out
Profile Listings
Profile Listings on the Origin
 Prof Output First Listing
Cycles
-------42630984
6498294
6141611
3654120
2615860
1580424
1144036
886044
861136
%
----58.47
8.91
8.42
5.01
3.59
2.17
1.57
1.22
1.18
Cum%
----58.47
67.38
75.81
80.82
84.41
86.57
88.14
89.36
90.54
Secs
---0.57
0.09
0.08
0.05
0.03
0.02
0.02
0.01
0.01
Proc
---VSUB
PFSOR
PBSOR
PFSOR1
VADD
ITSRCG
ITSRSI
ITJSI
ITJCG
 The first listing gives the number of cycles executed in each
procedure (or subroutine). The procedures are listed in
descending order of cycle count.
Profile Listings
Profile Listings on the Origin
 Prof Output Second Listing
Cycles
-------36556944
5313198
4968804
2989882
2564544
1988420
1629776
994210
969056
483018
%
----50.14
7.29
6.81
4.10
3.52
2.73
2.24
1.36
1.33
0.66
Cum%
----50.14
57.43
64.24
68.34
71.86
74.59
76.82
78.19
79.52
80.18
Line
---8106
6974
6671
8107
7097
8103
8045
8108
8049
6972
Proc
---VSUB
PFSOR
PBSOR
VSUB
PFSOR1
VSUB
VADD
VSUB
VADD
PFSOR
 The second listing gives the number of cycles per source
code line.
 The lines are listed in descending order of cycle count.
Profile Listings
Profile Listings on the Linux Clusters
 gprof Output First Listing
Flat profile:
Each sample counts as 0.000976562 seconds.
%
cumulative
self
self
time
seconds seconds
calls
us/call
----- ---------- ----------------38.07
5.67
5.67
101 56157.18
34.72
10.84
5.17 25199500
0.21
25.48
14.64
3.80
1.25
14.83
0.19
0.37
14.88
0.06
0.05
14.89
0.01
50500
0.15
0.05
14.90
0.01
100
68.36
0.01
14.90
0.00
0.01
14.90
0.00
0.01
14.90
0.00
0.00
14.90
0.00
1
0.00
total
us/call
------107450.88
0.21
0.15
68.36
0.00
name
----------compute_
dist_
SIND_SINCOS
sin
cos
dotr8_
update_
f_fioinit
f_intorange
mov
initialize_
 The listing gives a 'flat' profile of functions and routines
encountered, sorted by 'self seconds' which is the number of
seconds accounted for by this function alone.
Profile Listings
Profile Listings on the Linux Clusters
 gprof Output Second Listing
Call graph:
index
----[1]
% time
-----72.9
self children
called
name
---- -------------------------------------0.00
10.86
main [1]
5.67
5.18
101/101
compute_ [2]
0.01
0.00
100/100
update_ [8]
0.00
0.00
1/1
initialize_ [12]
--------------------------------------------------------------------5.67
5.18
101/101
main [1]
[2]
72.8
5.67
5.18
101
compute_ [2]
5.17
0.00
25199500/25199500 dist_ [3]
0.01
0.00
50500/50500
dotr8_ [7]
--------------------------------------------------------------------5.17
0.00
25199500/25199500 compute_ [2]
[3]
34.7
5.17
0.00
25199500
dist_ [3]
--------------------------------------------------------------------<spontaneous>
[4]
25.5
3.80
0.00
SIND_SINCOS [4]
…
…
 The second listing gives a 'call-graph' profile of functions and routines encountered. The
definitions of the columns are specific to the line in question. Detailed information is
contained in the full output from gprof.
Profile Listings
Profile Listings on the Linux Clusters
 vprof Listing
Columns correspond to the following events:
PAPI_TOT_CYC - Total cycles (1956 events)
File Summary:
100.0% /u/ncsa/gbauer/temp/md.f
Function Summary:
84.4% compute
15.6% dist
Line Summary:
67.3% /u/ncsa/gbauer/temp/md.f:106
13.6% /u/ncsa/gbauer/temp/md.f:104
9.3% /u/ncsa/gbauer/temp/md.f:166
2.5% /u/ncsa/gbauer/temp/md.f:165
1.5% /u/ncsa/gbauer/temp/md.f:102
1.2% /u/ncsa/gbauer/temp/md.f:164
0.9% /u/ncsa/gbauer/temp/md.f:107
0.8% /u/ncsa/gbauer/temp/md.f:169
0.8% /u/ncsa/gbauer/temp/md.f:162
0.8% /u/ncsa/gbauer/temp/md.f:105
 The above listing from (using the -e option to cprof), displays not only cycles consumed by
functions (a flat profile) but also the lines in the code that contribute to those functions.
Profile Listings
Profile Listings on the Linux Clusters
 vprof Listing (cont.)
0.7%
0.5%
0.2%
0.1%
/u/ncsa/gbauer/temp/md.f:149
/u/ncsa/gbauer/temp/md.f:163
/u/ncsa/gbauer/temp/md.f:109
/u/ncsa/gbauer/temp/md.f:100
…
…
100
101
102
103
104
105
106
107
108
109
0.1%
1.5%
13.6%
0.8%
67.3%
0.9%
0.2%
do j=1,np
if (i .ne. j) then
call dist(nd,box,pos(1,i),pos(1,j),rij,d)
! attribute half of the potential energy to particle 'j'
pot = pot + 0.5*v(d)
do k=1,nd
f(k,i) = f(k,i) - rij(k)*dv(d)/d
enddo
endif
enddo
Profiling Analysis
 The program being analyzed in the previous Origin example has
approximately 10000 source code lines, and consists of many
subroutines.
 The first profile listing shows that over 50% of the computation is done
inside the VSUB subroutine.
 The second profile listing shows that line 8106 in subroutine VSUB
accounted for 50% of the total computation.
 Going back to the source code, line 8106 is a line inside a do loop.
 Putting an OpenMP compiler directive in front of that do loop you can get
50% of the program to run in parallel with almost no work on your part.
 Since the compiler has rearranged the source lines the line numbers
given by ssrun/prof give you an area of the code to inspect.
 To view the rearranged source use the option
f90 … -FLIST:=ON
cc … -CLIST:=ON
 For the Intel compilers, the appropriate options are
ifort … –E …
icc … -E …
Further Information
 SGI Irix








man etime
man 3 time
man 1 time
man busage
man timers
man ssrun
man prof
Origin2000 Performance Tuning and Optimization Guide
 Linux Clusters






man 3 clock
man 2 gettimeofday
man 1 time
man 1 gprof
man 1B qstat
Intel Compilers Vprof on NCSA Linux Cluster
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scaler Tuning
5 Parallel Code Tuning
6 Timing and Profiling
7 Cache Tuning
8 Parallel Performance Analysis
9 About the IBM Regatta P690
Agenda
7 Cache Tuning
7.1 Cache Concepts
7.1.1 Memory Hierarchy
7.1.2 Cache Mapping
7.1.3 Cache Thrashing
7.1.4 Cache Coherence
7.2 Cache Specifics
7.3 Code 0ptimization
7.4 Measuring Cache Performance
7.5 Locating the Cache Problem
7.6 Cache Tuning Strategy
7.7 Preserve Spatial Locality
7.8 Locality Problem
7.9 Grouping Data Together
7.10 Cache Thrashing Example
7.11 Not Enough Cache
7.12 Loop Blocking
7.13 Further Information
Cache Concepts
 The CPU time required to perform an operation is the sum of the
clock cycles executing instructions and the clock cycles waiting
for memory.
 The CPU cannot be performing useful work if it is waiting for
data to arrive from memory.
 Clearly then, the memory system is a major factor in determining
the performance of your program and a large part is your use of
the cache.
 The following sections will discuss the key concepts of cache
including:




Memory subsystem hierarchy
Cache mapping
Cache thrashing
Cache coherence
Memory Hierarchy
 The different subsystems in the memory hierarchy have different
speeds, sizes, and costs.
 Smaller memory is faster
 Slower memory is cheaper
 The hierarchy is set up so that the fastest memory is closest to the
CPU, and the slower memories are further away from the CPU.
Memory Hierarchy
 It's a hierarchy because every level is a subset of a level further away.
 All data in one level is found in the level below.
 The purpose of cache is to improve the memory access time to the
processor.
 There is an overhead associated with it, but the benefits outweigh the cost.
 Registers
Registers are the sources and destinations of CPU data operations.
They hold one data element each and are 32 bits or 64 bits wide.
They are on-chip and built from SRAM.
Computers usually have 32 or 64 registers.
The Origin MIPS R10000 has 64 physical 64-bit registers of which 32
are available for floating-point operations.
 The Intel IA64 has 328 registers for general-purpose (64 bit),
floating-point (80 bit), predicate (1 bit), branch and other functions.
 Register access speeds are comparable to processor speeds.





Memory Hierarchy
 Main Memory Improvements
 A hardware improvement called interleaving reduces main memory access








time.
In interleaving, memory is divided into partitions or segments called
memory banks.
Consecutive data elements are spread across the banks.
Each bank supplies one data element per bank cycle.
Multiple data elements are read in parallel, one from each bank.
The problem with interleaving is that the memory interleaving improvement
assumes that memory is accessed sequentially.
If there is 2-way memory interleaving, but the code accesses every other
location, there is no benefit.
The bank cycle time is 4-8 times the CPU clock cycle time so the main
memory can’t keep up with the fast CPU and keep it busy with data.
Large main memory with a cycle time comparable to the processor is not
affordable.
Memory Hierarchy
 Principle of Locality
 The way your program operates follows the Principle of Locality.
 Temporal Locality: When an item is referenced, it will be referenced again soon.
 Spatial Locality: When an item is referenced, items whose addresses are nearby
will tend to be referenced soon.
 Cache Line
 The overhead of the cache can be reduced by fetching a chunk or block of data
elements.
 When a main memory access is made, a cache line of data is brought into the
cache instead of a single data element.
 A cache line is defined in terms of a number of bytes.
 For example, a cache line is typically 32 or 128 bytes.
 This takes advantage of spatial locality.
 The additional elements in the cache line will most likely be needed soon.
 The cache miss rate falls as the size of the cache line increases, but there is a
point of negative returns on the cache line size.
 When the cache line size becomes too large, the transfer time increases.
Memory Hierarchy
 Cache Hit
 A cache hit occurs when the data element requested by the
processor is in the cache.
 You want to maximize hits.
 The Cache Hit Rate is defined as the fraction of cache hits.
 It is the fraction of the requested data that is found in the cache.
 Cache Miss
 A cache miss occurs when the data element requested by the
processor is NOT in the cache.
 You want to minimize cache misses. Cache Miss Rate is defined as
1.0 - Hit Rate
 Cache Miss Penalty, or miss time, is the time needed to retrieve the data
from a lower level (downstream) of the memory hierarchy. (Recall
that the lower levels of the hierarchy have a slower access time.)
Memory Hierarchy
 Levels of Cache
 It used to be that there were two levels of cache: on-chip and offchip.
 L1/L2 is still true for the Origin MIPS and the Intel IA-32 processors.
 Caches closer to the CPU are called Upstream. Caches further from the CPU are called
Downstream.
 The on-chip cache is called First level, L1, or primary cache.
 An on-chip cache performs the fastest but the computer designer makes a trade-off between
die size and cache size. Hence, on-chip cache has a small size. When the on-chip cache has a
cache miss the time to access the slower main memory is very large.
 The off-chip cache is called Second Level, L2, or secondary cache.
 A cache miss is very costly. To solve this problem, computer designers have implemented a
larger, slower off-chip cache. This chip speeds up the on-chip cache miss time. L1 cache
misses are handled quickly. L2 cache misses have a larger performance penalty.
 The cache external to the chip is called Third Level, L3.
 The newer Intel IA-64 processor has 3 levels of cache
Memory Hierarchy
 Split or Unified Cache
 In unified cache, typically L2, the cache is a combined instruction-data
cache.
 A disadvantage of a unified cache is that when the data access and instruction access
conflict with each other, the cache may be thrashed, e.g. a high cache miss rate.
 In split cache, typically L1, the cache is split into 2 parts:
 one for the instructions, called the instruction cache
 another for the data, called the data cache.
 The 2 caches are independent of each other, and they can have independent
properties.
 Memory Hierarchy Sizes
 Memory hierarchy sizes are specified in the following units:
 Cache Line: bytes
 L1 Cache: Kbytes
 L2 Cache: Mbytes
 Main Memory: Gbytes
Cache Mapping
 Cache mapping determines which cache location should be used
to store a copy of a data element from main memory. There are 3
mapping strategies:
 Direct mapped cache
 Set associative cache
 Fully associative cache
 Direct Mapped Cache
 In direct mapped cache, a line of main memory is mapped to only a
single line of cache.
 Consequently, a particular cache line can be filled from (size of main
memory mod size of cache) different lines from main memory.
 Direct mapped cache is inexpensive but also inefficient and very
susceptible to cache thrashing.
Cache Mapping
 Direct Mapped Cache
http://larc.ee.nthu.edu.tw/~cthuang/courses/ee3450/lectures/07_memory.html
Cache Mapping
 Fully Associative Cache
 For fully associative cache, any line of cache can be loaded with any line from
main memory.
 This technology is very fast but also very expensive.
http://www.xbitlabs.com/images/video/radeon-x1000/caches.png
Cache Mapping
 Set Associative Cache
 For N-way set associative cache, you can think of cache as being divided into N sets
(usually N is 2 or 4).
 A line from main memory can then be written to its cache line in any of the N sets.
 This is a trade-off between direct mapped and fully associative cache.
http://www.alasir.com/articles/cache_principles/cache_way.png
Cache Mapping
 Cache Block Replacement
 With direct mapped cache, a cache line can only be mapped to one unique
place in the cache. The new cache line replaces the cache block at that
address. With set associative cache there is a choice of 3 strategies:
1. Random
 There is a uniform random replacement within the set of cache blocks. The
advantage of random replacement is that it’s simple and inexpensive to implement.
2.
LRU (Least Recently Used)
 The block that gets replaced is the one that hasn’t been used for the longest time.
The principle of temporal locality tells us that recently used data blocks are likely
to be used again soon. An advantage of LRU is that it preserves temporal locality. A
disadvantage of LRU is that it’s expensive to keep track of cache access patterns. In
empirical studies, there was little performance difference between LRU and
Random.
3.
FIFO (First In First Out)
 Replace the block that was brought in N accesses ago, regardless of the usage
pattern. In empirical studies, Random replacement generally outperformed FIFO.
Cache Thrashing
 Cache thrashing is a problem that happens when a frequently used
cache line gets displaced by another frequently used cache line.
 Cache thrashing can happen for both instruction and data caches.
 The CPU can’t find the data element it wants in the cache and must
make another main memory cache line access.
 The same data elements are repeatedly fetched into and displaced
from the cache.
 Cache thrashing happens because the computational code
statements have too many variables and arrays for the needed data
elements to fit in cache.
 Cache lines are discarded and later retrieved.
 The arrays are dimensioned too large to fit in cache. The arrays are
accessed with indirect addressing, e.g. a(k(j)).
Cache Coherence
 Cache coherence
 is maintained by an agreement between data stored in cache,
other caches, and main memory.
 When the same data is being manipulated by different
processors, they must inform each other of their modification
of data.
 The term Protocol is used to describe how caches and main
memory communicate with each other.
 It is the means by which all the memory subsystems maintain
data coherence.
Cache Coherence
 Snoop Protocol
 All processors monitor the bus traffic to determine cache line
status.
 Directory Based Protocol
 Cache lines contain extra bits that indicate which other
processor has a copy of that cache line, and the status of the
cache line – clean (cache line does not need to be sent back to
main memory) or dirty (cache line needs to update main
memory with content of cache line).
 Hardware Cache Coherence
 Cache coherence on the Origin computer is maintained in the
hardware, transparent to the programmer.
Cache Coherence
 False sharing
 happens in a multiprocessor system as a result of maintaining
cache coherence.
 Both processor A and processor B have the same cache line.
 A modifies the first word of the cache line.
 B wants to modify the eighth word of the cache line.
 But A has sent a signal to B that B’s cache line is invalid.
 B must fetch the cache line again before writing to it.
Cache Coherence
 A cache miss creates a processor stall.
 The processor is stalled until the data is retrieved from the
memory.
 The stall is minimized by continuing to load and execute
instructions, until the data that is stalling is retrieved.
 These techniques are called:
 Prefetching
 Out of order execution
 Software pipelining
 Typically, the compiler will do these at -O3 optimization.
Cache Coherence
 The following is an example of software pipelining:
 Suppose you compute
Do I=1,N
y(I)=y(I) + a*x(I)
End Do
 In pseudo-assembly language, this is what the Origin compiler will do:
cycle
cycle
cycle
cycle
cycle
cycle
cycle
cycle
cycle
cycle
cycle
cycle
t+0
t+1
t+2
t+3
t+4
t+5
t+6
t+7
t+8
t+9
t+10
t+11
ld
ld
st
st
st
st
ld
ld
ld
ld
ld
ld
y(I+3)
x(I+3)
y(I-4)
y(I-3)
y(I-2)
y(I-1)
y(I+4)
x(I+4)
y(I+5)
x(I+5)
y(I+6)
x(I+6)
madd
madd
madd
madd
I
I+1
I+2
I+3
Cache Coherence
 Since the Origin processor can only execute 1 load or 1 store




at a time, the compiler places loads in the instruction
pipeline well before the data is needed.
It is then able to continue loading while simultaneously
performing a fused multiply-add (a+b*c).
The code above gets 8 flops in 12 clock cycles.
The peak is 24 flops in 12 clock cycles for the Origin.
The Intel Pentium III (IA-32) and the Itanium (IA-64) will
have differing versions of the code above but the same
concepts apply.
Agenda
7 Cache Tuning
7.1 Cache Concepts
7.2Cache Specifics
7.2.1 Cache on the SGI Origin2000
7.2.2 Cache on the Intel Pentium III
7.2.3 Cache on the Intel Itanium
7.2.4 Cache Summary
7.3 Code 0ptimization
7.4 Measuring Cache Performance
7.5 Locating the Cache Problem
7.6 Cache Tuning Strategy
7.7 Preserve Spatial Locality
7.8 Locality Problem
7.9 Grouping Data Together
7.10 Cache Thrashing Example
7.11 Not Enough Cache
7.12 Loop Blocking
7.13 Further Information
Cache on the SGI Origin2000
 L1 Cache (on-chip primary cache)
 Cache size: 32KB floating point data
 32KB integer data and instruction
 Cache line size: 32 bytes
 Associativity: 2-way set associative
 L2 Cache (off-chip secondary cache)
 Cache size: 4MB per processor
 Cache line size: 128 bytes
 Associativity: 2-way set associative
 Replacement: LRU
 Coherence: Directory based 2-way interleaved (2 banks)
Cache on the SGI Origin2000
 Bandwidth L1 cache-to-processor
 1.6 GB/s/bank
 3.2 GB/sec overall possible
 Latency: 1 cycle
 Bandwidth between L1 and L2 cache
 1GB/s
 Latency: 11 cycles
 Bandwidth between L2 cache and local memory
 .5 GB/s
 Latency: 61 cycles
 Average 32 processor remote memory
 Latency: 150 cycles
Cache on the Intel Pentium III
 L1 Cache (on-chip primary cache)




Cache size: 16KB floating point data
16KB integer data and instruction
Cache line size: 16 bytes
Associativity: 4-way set associative
 L2 Cache (off-chip secondary cache)





Cache size: 256 KB per processor
Cache line size: 32 bytes
Associativity: 8-way set associative
Replacement: pseudo-LRU
Coherence: interleaved (8 banks)
Cache on the Intel Pentium III
 Bandwidth L1 cache-to-processor
 16 GB/s
 Latency: 2 cycles
 Bandwidth between L1 and L2 cache
 11.7 GB/s
 Latency: 4-10 cycles
 Bandwidth between L2 cache and local memory
 1.0 GB/s
 Latency: 15-21 cycles
Cache on the Intel Itanium
 L1 Cache (on-chip primary cache)




Cache size: 16KB floating point data
16KB integer data and instruction
Cache line size: 32 bytes
Associativity: 4-way set associative
 L2 Cache (off-chip secondary cache)




Cache size: 96KB unified data and instruction
Cache line size: 64 bytes
Associativity: 6-way set associative
Replacement: LRU
 L3 Cache (off-chip tertiary cache)




Cache size: 4MB per processor
Cache line size: 64 bytes
Associativity: 4-way set associative
Replacement: LRU
Cache on the Intel Itanium
 Bandwidth L1 cache-to-processor
 25.6 GB/s
 Latency: 1 - 2 cycle
 Bandwidth between L1 and L2 cache
 25.6 GB/sec
 Latency: 6 - 9 cycles
 Bandwidth between L2 and L3 cache
 11.7 GB/sec
 Latency: 21 - 24 cycles
 Bandwidth between L3 cache and main memory
 2.1 GB/sec
 Latency: 50 cycles
Cache Summary
Chip
MIPS R10000 Pentium III
Itanium
#Caches
2
2
3
Associativity
2/2
4/8
4/6/4
Replacement LRU
Pseudo-LRU
LRU
CPU MHz
195/250
1000
800
Peak Mflops
390/500
1000
3200
LD,ST/cycle 1 LD or 1 ST 1 LD and 1 ST 2 LD or 2 ST
 Only one load or store may be performed each CPU cycle on the R10000.
 This indicates that loads and stores may be a bottleneck.
 Efficient use of cache is extremely important.
Agenda
7 Cache Tuning
7.1 Cache Concepts
7.2 Cache Specifics
7.3Code 0ptimization
7.4 Measuring Cache Performance
7.4.1 Measuring Cache Performance on the SGI Origin2000
7.4.2 Measuring Cache Performance on the Linux Clusters
7.5 Locating the Cache Problem
7.6 Cache Tuning Strategy
7.7 Preserve Spatial Locality
7.8 Locality Problem
7.9 Grouping Data Together
7.10 Cache Thrashing Example
7.11 Not Enough Cache
7.12 Loop Blocking
7.13 Further Information
Code 0ptimization
 Gather statistics to find out where the bottlenecks are in your
code so you can identify what you need to optimize.
 The following questions can be useful to ask:
 How much time does the program take to execute?
 Use /usr/bin/time a.out for CPU time
 Which subroutines use the most time?
 Use ssrun and prof on the Origin or gprof and vprof on the Linux clusters.
 Which loop uses the most time?
 Put etime/dtime or other recommended timer calls around loops for CPU time.
 For more information on timers see Timing and Profiling section.
 What is contributing to the cpu time?
 Use the Perfex utility on the Origin or perfex or hpmcount on the Linux
clusters.
Code 0ptimization
 Some useful optimizing and profiling tools are
 etime/dtime/time
 perfex
 ssusage
 ssrun/prof
 gprof cvpav, cvd
 See the NCSA web pages on Compiler, Performance, and
Productivity Tools
http://www.ncsa.uiuc.edu/UserInfo/Resources/Software/Tools/
for information on which tools are available on NCSA platforms.
Measuring Cache Performance on the
SGI Origin2000
 The R10000 processors of NCSA’s Origin2000 computers
have hardware performance counters.
 There are 32 events that are measured and each event is
numbered.
0 = cycles
1 = Instructions issued
...
26 = Secondary data cache misses
...
 View man perfex for more information.
 The Perfex Utility
 The hardware performance counters can be measured using the
perfex utility.
perfex [options] command [arguments]
Measuring Cache Performance on the
SGI Origin2000
 where the options are:
-e counter1-e counter2
This specifies which events are to be counted. You enter the number
of the event you want counted. (Remember to have a space in
between the "e" and the event number.)
-a
sample ALL the events
-mp
Report all results on a per thread basis.
-y
Report the results in seconds, not cycles.
-x
Gives extra summary info including Mflops command Specify the
name of the executable file. arguments Specify the input and output
arguments to the executable file.
Measuring Cache Performance on the
SGI Origin2000
 Examples
 perfex -e 25 -e 26 a.out
- outputs the L1 and L2 cache misses
- the output is reported in cycles
 perfex -a -y a.out > results
- outputs ALL the hardware performance counters
- - the output is reported in seconds
Measuring Cache Performance on the
Linux Clusters
 The Intel Pentium III and Itanium processors provide
hardware event counters that can be accessed from
several tools.
 perfex for the Pentium III and pfmon for the
Itanium
 To view usage and options for perfex and pfmon:
perfex -h
pfmon --help
 To measure L2 cache misses:
perfex –eP6_L2_LINES_IN a.out
pfmon –-events=L2_MISSES a.out
Measuring Cache Performance on the
Linux Clusters
 psrun [soft add +perfsuite]
 Another tool that provides access to the hardware
event counter and also provides derived statistics is
perfsuite.
 To add perfsuite's psrun to the current shell
environment :
soft add +perfsuite
 To measure cache misses:
psrun a.out
psprocess a.out*.xml
Agends
7 Cache Tuning
7.1 Cache Concepts
7.2 Cache Specifics
7.3 Code 0ptimization
7.4 Measuring Cache Performance
7.5 Locating the Cache Problem
7.6 Cache Tuning Strategy
7.7 Preserve Spatial Locality
7.8 Locality Problem
7.9 Grouping Data Together
7.10 Cache Thrashing Example
7.11 Not Enough Cache
7.12 Loop Blocking
7.13 Further Information
Locating the Cache Problem
 For the Origin, the perfex output is a first-pass detection of a
cache problem.
 If you then use the CaseVision tools, you can locate the cache
problem in your code.
 The CaseVision tools are
 cvpav for performance analysis
 cvd for debugging
 CaseVision is not available on the Linux clusters.
 Tools like vprof and libhpm provide routines for users to
instrument their code.
 Using vprof with the PAPI cache events can provide detailed
information about where poor cache utilization is occurring.
Cache Tuning Strategy
 The strategy for performing cache tuning on your code is
based on data reuse.
 Temporal Reuse
 Use the same data elements on more than one iteration of the loop.
 Spatial Reuse
 Use data that is encached as a result of fetching nearby data elements from
downstream memory.
 Strategies that take advantage of the Principle of Locality will
improve performance.
Preserve Spatial Locality
 Check loop nesting to ensure stride-one memory access.
 The following code does not preserve spatial locality:
do I=1,n
do K=1,n
do J=1,n
C(I,J)=C(I,J) + A(I,K) * B(K,J)
end do …
 It is not wrong but runs much slower than it could.
 To ensure stride-one access modify the code using loop interchange.
do J=1,n
do K=1,n
do I=1,n
C(I,J)=C(I,J) + A(I,K) * B(K,J)
end do …
 For Fortran the innermost loop index should be the leftmost index of
the arrays. The code has been modified for spatial reuse.
Locality Problem
 Suppose your code looks like:
DO J=1,N
DO I=1,N
A(I,J)=B(J,I)
ENDDO
ENDDO
 The loop as it is typed above does not have unit-stride access
on loads.
 If you interchange the loops, the code doesn’t have unitstride access on stores.
 Use the optimized, intrinsic-function transpose from the
FORTRAN compiler instead of hand-coding it.
Grouping Data Together
 Consider the following code segment:
d=0.0
do I=1,n
j=index(I)
d = d + sqrt(x(j)*x(j) + y(j)*y(j) + z(j)*z(j))
 Since the arrays are accessed with indirect accessing, it is likely
that 3 new cache lines need to be brought into the cache for each
iteration of the loop. Modify the code by grouping together x, y,
and z into a 2-dimensional array named r.
d=0.0
do I=1,n
j=index(I)
d = d + sqrt(r(1,j)*r(1,j) + r(2,j)*r(2,j) +
r(3,j)*r(3,j))
 Since r(1,j), r(2,j), and r(3,j) are contiguous in memory, it is likely
they will be in one cache line. Hence, 1 cache line, rather than 3,
is brought in for each iteration of I. The code has been modified
for cache reuse.
Cache Thrashing Example
 This example thrashes a 4MB direct mapped cache.
parameter (max = 1024*1024)
common /xyz/ a(max), b(max)
do I=1,max
something = a(I) + b(I)
enddo
 The cache lines for both a and b have the same cache address.
 To avoid cache thrashing in this example, pad common with the
size of a cache line.
parameter (max = 1024*1024)
common /xyz/ a(max),extra(32),b(max)
do I=1,max
something=a(I) + b(I)
enddo
 Improving cache utilization is often the key to getting good
performance.
Not Enough Cache
 Ideally you want the inner loop’s arrays and variables to fit
into cache.
 If a scalar program won’t fit in cache, its parallel version may
fit in cache with a large enough number of processors.
 This often results in super-linear speedup.
Loop Blocking
 This technique is useful when the arrays are too large to fit
into the cache.
 Loop blocking uses strip mining of loops and loop interchange.
 A blocked loop accesses array elements in sections that
optimally fit in the cache.
 It allows for spatial and temporal reuse of data, thus minimizing
cache misses.
 The following example (next slide) illustrates loop blocking
of matrix multiplication.
 The code in the PRE column depicts the original code, the
POST column depicts the code when it is blocked.
Loop Blocking
PRE
POST
do k=1,n
do j=1,n
do i=1,n
c(i,j)=c(i,j)+a(i,k)
*b(k,j)
enddo
enddo
enddo
do kk=1,n,iblk
do jj=1,n,iblk
do ii=1,n,iblk
do j=jj,jj+iblk-1
do k=kk,kk+iblk-1
do i=ii,ii+iblk-1
c(i,j)=c(i,j)+a(i,k)
*b(k,j)
enddo
enddo
enddo
enddo
enddo
enddo
Further Information
 Computer Organization and Design
 The Hardware/Software Interface, David A. Patterson and John L.
Hennessy, Morgan Kaufmann Publishers, Inc.
 Computer Architecture
 A Quantitative Approach, John L. Hennessy and David A. Patterson,






Morgan Kaufmann Publishers, Inc.
The Cache Memory Book, Jim Handy, Academic Press
High Performance Computing, Charles Severance, O’Reilly and
Associates, Inc.
A Practitioner’s Guide to RISC Microprocessor Architecture, Patrick H.
Stakem, John Wiley & Sons, Inc.
Tutorial on Optimization of Fortran, John Levesque, Applied
Parallel Research
Intel® Architecture Optimization Reference Manual
Intel® Itanium® Processor Manuals
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
5 Parallel Code Tuning
6 Timing and Profiling
7 Cache Tuning
8 Parallel Performance Analysis
8.1 Speedup
8.2 Speedup Extremes
8.3 Efficiency
8.4 Amdahl's Law
8.5 Speedup Limitations
8.6 Benchmarks
8.7 Summary
9 About the IBM Regatta P690
Parallel Performance Analysis
 Now that you have parallelized your code, and have run it on
a parallel computer using multiple processors you may want
to know the performance gain that parallelization has
achieved.
 This chapter describes how to compute parallel code
performance.
 Often the performance gain is not perfect, and this chapter
also explains some of the reasons for limitations on parallel
performance.
 Finally, this chapter covers the kinds of information you
should provide in a benchmark, and some sample
benchmarks are given.
Speedup
 The speedup of your code tells you how much performance gain is
achieved by running your program in parallel on multiple
processors.
 A simple definition is that it is the length of time it takes a program to
run on a single processor, divided by the time it takes to run on a
multiple processors.
 Speedup generally ranges between 0 and p, where p is the number of
processors.
 Scalability
 When you compute with multiple processors in a parallel
environment, you will also want to know how your code scales.
 The scalability of a parallel code is defined as its ability to achieve
performance proportional to the number of processors used.
 As you run your code with more and more processors, you want to
see the performance of the code continue to improve.
 Computing speedup is a good way to measure how a program scales
as more processors are used.
Speedup
 Linear Speedup
 If it takes one processor an amount of time t to do a task and if
p processors can do the task in time t / p, then you have perfect
or linear speedup (Sp= p).
 That is, running with 4 processors improves the time by a factor of 4,
running with 8 processors improves the time by a factor of 8, and so on.
 This is shown in the following illustration.
Speedup Extremes
 The extremes of speedup happen when speedup is
 greater than p, called super-linear speedup,
 less than 1.
 Super-Linear Speedup
 You might wonder how super-linear speedup can occur. How can
speedup be greater than the number of processors used?
 The answer usually lies with the program's memory use. When using multiple
processors, each processor only gets part of the problem compared to the
single processor case. It is possible that the smaller problem can make better
use of the memory hierarchy, that is, the cache and the registers. For
example, the smaller problem may fit in cache when the entire problem
would not.
 When super-linear speedup is achieved, it is often an indication that the
sequential code, run on one processor, had serious cache miss problems.
 The most common programs that achieve super-linear speedup
are those that solve dense linear algebra problems.
Speedup Extremes
 Parallel Code Slower than Sequential Code
 When speedup is less than one, it means that the parallel code
runs slower than the sequential code.
 This happens when there isn't enough computation to be done
by each processor.
 The overhead of creating and controlling the parallel threads
outweighs the benefits of parallel computation, and it causes the
code to run slower.
 To eliminate this problem you can try to increase the problem
size or run with fewer processors.
Efficiency
 Efficiency is a measure of parallel performance that is closely
related to speedup and is often also presented in a description
of the performance of a parallel program.
 Efficiency with p processors is defined as the ratio of speedup
with p processors to p.
 Efficiency is a fraction that usually ranges between 0 and 1.
 Ep=1 corresponds to perfect speedup of Sp= p.
 You can think of efficiency as describing the average speedup
per processor.
Amdahl's Law
 An alternative formula for speedup is named Amdahl's Law attributed to
Gene Amdahl, one of America's great computer scientists.
 This formula, introduced in the 1980s, states that no matter how many
processors are used in a parallel run, a program's speedup will be limited by its
fraction of sequential code.
 That is, almost every program has a fraction of the code that doesn't lend itself to
parallelism.
 This is the fraction of code that will have to be run with just one processor, even
in a parallel run.
 Amdahl's Law defines speedup with p processors as follows:
 Where the term f stands for the fraction of operations done sequentially
with just one processor, and the term (1 - f) stands for the fraction of
operations done in perfect parallelism with p processors.
Amdahl's Law
 The sequential fraction of code, f, is a unitless measure
ranging between 0 and 1.
 When f is 0, meaning there is no sequential code, then speedup
is p, or perfect parallelism. This can be seen by substituting f =
0 in the formula above, which results in Sp = p.
 When f is 1, meaning there is no parallel code, then speedup is
1, or there is no benefit from parallelism. This can be seen by
substituting f = 1 in the formula above, which results in Sp = 1.
 This shows that Amdahl's speedup ranges between 1 and
p, where p is the number of processors used in a parallel
processing run.
Amdahl's Law
 The interpretation of Amdahl's Law is that speedup is limited
by the fact that not all parts of a code can be run in parallel.
 Substituting in the formula, when the number of processors goes to
infinity, your code's speedup is still limited by 1 / f.
 Amdahl's Law shows that the sequential fraction of code has a
strong effect on speedup.
 This helps to explain the need for large problem sizes when using
parallel computers.
 It is well known in the parallel computing community, that you
cannot take a small application and expect it to show good
performance on a parallel computer.
 To get good performance, you need to run large applications, with
large data array sizes, and lots of computation.
 The reason for this is that as the problem size increases the
opportunity for parallelism grows, and the sequential fraction
shrinks, and it shrinks in its importance for speedup.
Agenda
8 Parallel Performance Analysis
8.1 Speedup
8.2 Speedup Extremes
8.3 Efficiency
8.4 Amdahl's Law
8.5Speedup Limitations
8.5.1 Memory Contention Limitation
8.5.2 Problem Size Limitation
8.6 Benchmarks
8.7 Summary
Speedup Limitations
 This section covers some of the reasons why a program
doesn't get perfect Speedup. Some of the reasons for
limitations on speedup are:
 Too much I/O
 Speedup is limited when the code is I/O bound.
 That is, when there is too much input or output compared to the amount
of computation.
 Wrong algorithm
 Speedup is limited when the numerical algorithm is not suitable for a
parallel computer.
 You need to replace it with a parallel algorithm.
 Too much memory contention
 Speedup is limited when there is too much memory contention.
 You need to redesign the code with attention to data locality.
 Cache reutilization techniques will help here.
Speedup Limitations
 Wrong problem size
 Speedup is limited when the problem size is too small to take best advantage
of a parallel computer.
 In addition, speedup is limited when the problem size is fixed.
 That is, when the problem size doesn't grow as you compute with more
processors.
 Too much sequential code
 Speedup is limited when there's too much sequential code.
 This is shown by Amdahl's Law.
 Too much parallel overhead
 Speedup is limited when there is too much parallel overhead compared to the
amount of computation.
 These are the additional CPU cycles accumulated in creating parallel regions,
creating threads, synchronizing threads, spin/blocking threads, and ending
parallel regions.
 Load imbalance
 Speedup is limited when the processors have different workloads.
 The processors that finish early will be idle while they are waiting for the
other processors to catch up.
Memory Contention Limitation
 Gene Golub, a professor of Computer Science at Stanford University,
writes in his book on parallel computing that the best way to define
memory contention is with the word delay.
 When different processors all want to read or write into the main memory,
there is a delay until the memory is free.
 On the SGI Origin2000 computer, you can determine whether your
code has memory contention problems by using SGI's perfex utility.
 The perfex utility is covered in the Cache Tuning lecture in this course.
 You can also refer to SGI's manual page, man perfex, for more details.
 On the Linux clusters, you can use the hardware performance counter
tools to get information on memory performance.
 On the IA32 platform, use perfex, vprof, hmpcount, psrun/perfsuite.
 On the IA64 platform, use vprof, pfmon, psrun/perfsuite.
Memory Contention Limitation
 Many of these tools can be used with the PAPI performance counter
interface.
 Be sure to refer to the man pages and webpages on the NCSA website for
more information.
 If the output of the utility shows that memory contention is a problem, you
will want to use some programming techniques for reducing memory
contention.
 A good way to reduce memory contention is to access elements from the
processor's cache memory instead of the main memory.
 Some programming techniques for doing this are:
 Access arrays with unit `.
 Order nested do loops (in Fortran) so that the innermost loop index is the leftmost
index of the arrays in the loop. For the C language, the order is the opposite of
Fortran.
 Avoid specific array sizes that are the same as the size of the data cache or that are
exact fractions or exact multiples of the size of the data cache.
 Pad common blocks.
 These techniques are called cache tuning optimizations. The details for
performing these code modifications are covered in the section on Cache
Optimization of this lecture.
Problem Size Limitation
 Small Problem Size
 Speedup is almost always an increasing function of problem size.
 If there's not enough work to be done by the available
processors, the code will show limited speedup.
 The effect of small problem size on speedup is shown in the
following illustration.
Problem Size Limitation
 Fixed Problem Size
 When the problem size is fixed, you can reach a point of
negative returns when using additional processors.
 As you compute with more and more processors, each
processor has less and less amount of computation to perform.
 The additional parallel overhead, compared to the amount of
computation, causes the speedup curve to start turning
downward as shown in the following figure.
Benchmarks
 It will finally be time to report the parallel performance
of your application code.
 You will want to show a speedup graph with the
number of processors on the x axis, and speedup on
the y axis.
 Some other things you should report and record are:
 the date you obtained the results
 the problem size
 the computer model
 the compiler and the version number of the compiler
 any special compiler options you used
Benchmarks
 When doing computational science, it is often helpful to find
out what kind of performance your colleagues are obtaining.
 In this regard, NCSA has a compilation of parallel performance
benchmarks online at
http://www.ncsa.uiuc.edu/UserInfo/Perf/NCSAbench/.
 You might be interested in looking at these benchmarks to
see how other people report their parallel performance.
 In particular, the NAMD benchmark is a report about the
performance of the NAMD program that does molecular
dynamics simulations.
Summary
 There are many good texts on parallel computing which treat
the subject of parallel performance analysis. Here are two
useful references:
 Scientific Computing An Introduction with Parallel Computing, Gene
Golub and James Ortega, Academic Press, Inc.
 Parallel Computing Theory and Practice, Michael J. Quinn,
McGraw-Hill, Inc.
Agenda









1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
5 Parallel Code Tuning
6 Timing and Profiling
7 Cache Tuning
8 Parallel Performance Analysis
9 About the IBM Regatta P690





9.1 IBM p690 General Overview
9.2 IBM p690 Building Blocks
9.3 Features Performed by the Hardware
9.4 The Operating System
9.5 Further Information
About the IBM Regatta P690
 To obtain your program’s top performance, it is important to
understand the architecture of the computer system on
which the code runs.
 This chapter describes the architecture of NCSA's IBM p690.
 Technical details on the size and design of the processors,
memory, cache, and the interconnect network are covered
along with technical specifications for the compute rate,
memory size and speed, and interconnect bandwidth.
IBM p690 General Overview
 The p690 is IBM's latest Symmetric Multi-Processor (SMP)
machine with Distributed Shared Memory (DSM).
 This means that memory is physically distributed and logically
shared.
 It is based on the Power4 architecture and is a successor to the
Power3-II based RS/6000 SP system.
 IBM p690 Scalability
 The IBM p690 is a flexible, modular, and scalable architecture.
 It scales in these terms:
 Number of processors
 Memory size
 I/O and memory bandwidth and the Interconnect bandwidth
Agenda
 9 About the IBM Regatta P690
 9.1 IBM p690 General Overview
 9.2 IBM p690 Building Blocks
 9.2.1 Power4 Core
 9.2.2 Multi-Chip Modules
 9.2.3 The Processor
 9.2.4 Cache Architecture
 9.2.5 Memory Subsystem
 9.3 Features Performed by the Hardware
 9.4 The Operating System
 9.5 Further Information
IBM p690 Building Blocks
 An IBM p690 system is built from a number of fundamental
building blocks.
 The first of these building blocks is the Power4 Core, which
includes the processors and L1 and L2 caches.
 At NCSA, four of these Power4 Cores are linked to form a
Multi-Chip Module.
 This module includes the L3 cache and four Multi-Chip
Modules are linked to form a 32 processor system (see figure
on the next slide).
 Each of these components will be described in the following
sections.
32-processor IBM p690 configuration
(Image courtesy of IBM)
Power4 Core
 The Power4 Chip contains:
 Two processors
 Local caches (L1)
 External cache for each processor (L2)
 I/O and Interconnect interfaces
The POWER4 chip
(Image curtsey of IBM)
Multi-Chip Modules
 Four Power4 Chips are assembled to form a Multi-Chip
Module (MCM) that contains 8 processors.
 Each MCM also supports the L3 cache for each Power4 chip.
 Multiple MCM interconnection (Image courtesy of IBM)
The Processor
 The processors at the heart of the Power4 Core are speculative
superscalar out of order execution chips.
 The Power4 is a 4-way superscalar RISC architecture running
instructions on its 8 pipelined execution units.
 Speed of the Processor
 The NCSA IBM p690 has CPUs running at 1.3 GHz.
 64-Bit Processor Execution Units
 There are 8 independent fully pipelined execution units.
 2 load/store units for memory access
 2 identical floating point execution units capable of fused multiply/add
 2 fixed point execution units
 1 branch execution unit
 1 logic operation unit
The Processor
 The units are capable of 4 floating point operations, fetching 8
instructions and completing 5 instructions per cycle.
 It is capable of handling up to 200 in-flight instructions.
 Performance Numbers
 Peak Performance:
 4 floating point instructions per cycle
 1.3 Gcycles/sec * 4 flop/cycle yields 5.2 GFLOPS
 MIPS Rating:
 5 instructions per cycle
 1.3 Gcycles/sec * 5 instructions/cycle yields 65 MIPS
 Instruction Set
 The instruction set (ISA) on the IBM p690 is the PowerPC AS
Instruction set.
Cache Architecture
 Each Power4 Core has both a primary (L1) cache associated with each processor and
a secondary (L2) cache shared between the two processors. In addition, each MultiChip Module has a L3 cache.
 Level 1 Cache
 The Level 1 cache is in the processor core. It has split instruction and data caches.
 L1 Instruction Cache
 The properties of the Instruction Cache are:
 64KB in size
 direct mapped
 cache line size is 128 bytes
 L1 Data Cache
 The properties of the L1 Data Cache are:





32KB in size
2-way set associative
FIFO replacement policy
2-way interleaved
cache line size is 128 bytes
 Peak speed is achieved when the data accessed in a loop is entirely contained in the L1
data cache.
Cache Architecture
 Level 2 Cache on the Power4 Chip
 When the processor can't find a data element in the L1
cache, it looks in the L2 cache. The properties of the L2
Cache are:
 external from the processor
 unified instruction and data cache
 1.41MB per Power4 chip (2 processors)
 8-way set associative
 split between 3 controllers
 cache line size is 128 bytes
 pseudo LRU replacement policy for cache coherence
 124.8 GB/s peak bandwidth from L2
Cache Architecture
 Level 3 Cache on the Multi-Chip Module
 When the processor can't find a data element in the L2
cache, it looks in the L3 cache. The properties of the L3
Cache are:
 external from the Power4 Core
 unified instruction and data cache
 128MB per Multi-Chip Module (8 processors)
 8-way set associative
 cache line size is 512 bytes
 55.5 GB/s peak bandwidth from L2
Memory Subsystem
 The total memory is physically distributed among the
Multi-Chip Modules of the p690 system (see the
diagram in the next slide).
 Memory Latencies
 The latency penalties for each of the levels of the
memory hierarchy are:
 L1 Cache - 4 cycles
 L2 Cache - 14 cycles
 L3 Cache - 102 cycles
 Main Memory - 400 cycles
Memory distribution within an MCM
Agenda
9 About the IBM Regatta P690
9.1 IBM p690 General Overview
9.2 IBM p690 Building Blocks
9.3 Features Performed by the Hardware
9.4 The Operating System
9.5 Further Information
Features Performed by the Hardware
 The following is done completely by the hardware,
transparent to the user:
 Global memory addressing (makes the system memory shared)
 Address resolution
 Maintaining cache coherency
 Automatic page migration from remote to local memory (to
reduce interconnect memory transactions)
The Operating System
 The operating system is AIX. NCSA's p690 system is
currently running version 5.1 of AIX. Version 5.1 is a full 64bit file system.
 Compatibility
 AIX 5.1 is highly compatible to both BSD and System V Unix
Further Information
 Computer Architecture: A Quantitative Approach
 John Hennessy, et al. Morgan Kaufman Publishers, 2nd Edition,
1996
 Computer Hardware and Design:The Hardware/Software Interface
 David A. Patterson, et al. Morgan Kaufman Publishers, 2nd
Edition, 1997
 IBM P Series [595] at the URL:
 http://www-03.ibm.com/systems/p/hardware/highend/590/index.html
 IBM p690 Documentation at NCSA at the URL:
 http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/IBMp690/
Download