Parallel Computing Explained Slides Prepared from the CI-Tutor Courses at NCSA http://ci-tutor.ncsa.uiuc.edu/ By S. Masoud Sadjadi School of Computing and Information Sciences Florida International University March 2009 Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 7 Cache Tuning 8 Parallel Performance Analysis 9 About the IBM Regatta P690 Agenda 1 Parallel Computing Overview 1.1 Introduction to Parallel Computing 1.1.1 Parallelism in our Daily Lives 1.1.2 Parallelism in Computer Programs 1.1.3 Parallelism in Computers 1.1.3.4 Disk Parallelism 1.1.4 Performance Measures 1.1.5 More Parallelism Issues 1.2 Comparison of Parallel Computers 1.3 Summary Parallel Computing Overview Who should read this chapter? New Users – to learn concepts and terminology. Intermediate Users – for review or reference. Management Staff – to understand the basic concepts – even if you don’t plan to do any programming. Note: Advanced users may opt to skip this chapter. Introduction to Parallel Computing High performance parallel computers can solve large problems much faster than a desktop computer fast CPUs, large memory, high speed interconnects, and high speed input/output able to speed up computations by making the sequential components run faster by doing more operations in parallel High performance parallel computers are in demand need for tremendous computational capabilities in science, engineering, and business. require gigabytes/terabytes f memory and gigaflops/teraflops of performance scientists are striving for petascale performance Introduction to Parallel Computing HPPC are used in a wide variety of disciplines. Meteorologists: prediction of tornadoes and thunderstorms Computational biologists: analyze DNA sequences Pharmaceutical companies: design of new drugs Oil companies: seismic exploration Wall Street: analysis of financial markets NASA: aerospace vehicle design Entertainment industry: special effects in movies and commercials These complex scientific and business applications all need to perform computations on large datasets or large equations. Parallelism in our Daily Lives There are two types of processes that occur in computers and in our daily lives: Sequential processes occur in a strict order it is not possible to do the next step until the current one is completed. Examples The passage of time: the sun rises and the sun sets. Writing a term paper: pick the topic, research, and write the paper. Parallel processes many events happen simultaneously Examples Plant growth in the springtime An orchestra Agenda 1 Parallel Computing Overview 1.1 Introduction to Parallel Computing 1.1.1 Parallelism in our Daily Lives 1.1.2 Parallelism in Computer Programs 1.1.2.1 Data Parallelism 1.1.2.2 Task Parallelism 1.1.3 Parallelism in Computers 1.1.3.4 Disk Parallelism 1.1.4 Performance Measures 1.1.5 More Parallelism Issues 1.2 Comparison of Parallel Computers 1.3 Summary Parallelism in Computer Programs Conventional wisdom: Computer programs are sequential in nature Only a small subset of them lend themselves to parallelism. Algorithm: the "sequence of steps" necessary to do a computation. The first 30 years of computer use, programs were run sequentially. The 1980's saw great successes with parallel computers. Dr. Geoffrey Fox published a book entitled Parallel Computing Works! many scientific accomplishments resulting from parallel computing Computer programs are parallel in nature Only a small subset of them need to be run sequentially Parallel Computing What a computer does when it carries out more than one computation at a time using more than one processor. By using many processors at once, we can speedup the execution If one processor can perform the arithmetic in time t. Then ideally p processors can perform the arithmetic in time t/p. What if I use 100 processors? What if I use 1000 processors? Almost every program has some form of parallelism. You need to determine whether your data or your program can be partitioned into independent pieces that can be run simultaneously. Decomposition is the name given to this partitioning process. Types of parallelism: data parallelism task parallelism. Data Parallelism The same code segment runs concurrently on each processor, but each processor is assigned its own part of the data to work on. Do loops (in Fortran) define the parallelism. The iterations must be independent of each other. Data parallelism is called "fine grain parallelism" because the computational work is spread into many small subtasks. Example Dense linear algebra, such as matrix multiplication, is a perfect candidate for data parallelism. An example of data parallelism Original Sequential Code DO K=1,N DO J=1,N DO I=1,N C(I,J) = C(I,J) + A(I,K)*B(K,J) END DO END DO END DO Parallel Code !$OMP PARALLEL DO DO K=1,N DO J=1,N DO I=1,N C(I,J) = C(I,J) + A(I,K)*B(K,J) END DO END DO END DO !$END PARALLEL DO Quick Intro to OpenMP OpenMP is a portable standard for parallel directives covering both data and task parallelism. More information about OpenMP is available on the OpenMP website. We will have a lecture on Introduction to OpenMP later. With OpenMP, the loop that is performed in parallel is the loop that immediately follows the Parallel Do directive. In our sample code, it's the K loop: DO K=1,N OpenMP Loop Parallelism Iteration-Processor Assignments The code segment running on each processor Processor Iterations of K Data Elements proc0 K=1:5 A(I, 1:5) B(1:5 ,J) proc1 K=6:10 A(I, 6:10) B(6:10 ,J) proc2 K=11:15 A(I, 11:15) B(11:15 ,J) proc3 K=16:20 A(I, 16:20) B(16:20 ,J) DO J=1,N DO I=1,N C(I,J) = C(I,J) + A(I,K)*B(K,J) END DO END DO OpenMP Style of Parallelism can be done incrementally as follows: Parallelize the most computationally intensive loop. 2. Compute performance of the code. 3. If performance is not satisfactory, parallelize another loop. 4. Repeat steps 2 and 3 as many times as needed. 1. The ability to perform incremental parallelism is considered a positive feature of data parallelism. It is contrasted with the MPI (Message Passing Interface) style of parallelism, which is an "all or nothing" approach. Task Parallelism Task parallelism may be thought of as the opposite of data parallelism. Instead of the same operations being performed on different parts of the data, each process performs different operations. You can use task parallelism when your program can be split into independent pieces, often subroutines, that can be assigned to different processors and run concurrently. Task parallelism is called "coarse grain" parallelism because the computational work is spread into just a few subtasks. More code is run in parallel because the parallelism is implemented at a higher level than in data parallelism. Task parallelism is often easier to implement and has less overhead than data parallelism. Task Parallelism The abstract code shown in the diagram is decomposed into 4 independent code segments that are labeled A, B, C, and D. The right hand side of the diagram illustrates the 4 code segments running concurrently. Task Parallelism Original Code Parallel Code program main program main !$OMP PARALLEL !$OMP SECTIONS code segment labeled !$OMP SECTION code segment labeled !$OMP SECTION code segment labeled !$OMP SECTION code segment labeled !$OMP END SECTIONS !$OMP END PARALLEL end code segment labeled A code segment labeled B code segment labeled C code segment labeled D end A B C D OpenMP Task Parallelism With OpenMP, the code that follows each SECTION(S) directive is allocated to a different processor. In our sample parallel code, the allocation of code segments to processors is as follows. Processor Code proc0 code segment labeled A proc1 code segment labeled B proc2 code segment labeled C proc3 code segment labeled D Parallelism in Computers How parallelism is exploited and enhanced within the operating system and hardware components of a parallel computer: operating system arithmetic memory disk Operating System Parallelism All of the commonly used parallel computers run a version of the Unix operating system. In the table below each OS listed is in fact Unix, but the name of the Unix OS varies with each vendor. Parallel Computer OS SGI Origin2000 IRIX HP V-Class HP-UX Cray T3E Unicos IBM SP AIX Workstation Clusters Linux For more information about Unix, a collection of Unix documents is available. Two Unix Parallelism Features background processing facility With the Unix background processing facility you can run the executable a.out in the background and simultaneously view the man page for the etime function in the foreground. There are two Unix commands that accomplish this: a.out > results & man etime cron feature With the Unix cron feature you can submit a job that will run at a later time. Arithmetic Parallelism Multiple execution units facilitate arithmetic parallelism. The arithmetic operations of add, subtract, multiply, and divide (+ - * /) are each done in a separate execution unit. This allows several execution units to be used simultaneously, because the execution units operate independently. Fused multiply and add is another parallel arithmetic feature. Parallel computers are able to overlap multiply and add. This arithmetic is named MultiplyADD (MADD) on SGI computers, and Fused Multiply Add (FMA) on HP computers. In either case, the two arithmetic operations are overlapped and can complete in hardware in one computer cycle. Superscalar arithmetic is the ability to issue several arithmetic operations per computer cycle. It makes use of the multiple, independent execution units. On superscalar computers there are multiple slots per cycle that can be filled with work. This gives rise to the name n-way superscalar, where n is the number of slots per cycle. The SGI Origin2000 is called a 4-way superscalar computer. Memory Parallelism memory interleaving memory is divided into multiple banks, and consecutive data elements are interleaved among them. For example if your computer has 2 memory banks, then data elements with even memory addresses would fall into one bank, and data elements with odd memory addresses into the other. multiple memory ports Port means a bi-directional memory pathway. When the data elements that are interleaved across the memory banks are needed, the multiple memory ports allow them to be accessed and fetched in parallel, which increases the memory bandwidth (MB/s or GB/s). multiple levels of the memory hierarchy There is global memory that any processor can access. There is memory that is local to a partition of the processors. Finally there is memory that is local to a single processor, that is, the cache memory and the memory elements held in registers. Cache memory Cache is a small memory that has fast access compared with the larger main memory and serves to keep the faster processor filled with data. Memory Parallelism Memory Hierarchy Cache Memory Disk Parallelism RAID (Redundant Array of Inexpensive Disk) RAID disks are on most parallel computers. The advantage of a RAID disk system is that it provides a measure of fault tolerance. If one of the disks goes down, it can be swapped out, and the RAID disk system remains operational. Disk Striping When a data set is written to disk, it is striped across the RAID disk system. That is, it is broken into pieces that are written simultaneously to the different disks in the RAID disk system. When the same data set is read back in, the pieces are read in parallel, and the full data set is reassembled in memory. Agenda 1 Parallel Computing Overview 1.1 Introduction to Parallel Computing 1.1.1 Parallelism in our Daily Lives 1.1.2 Parallelism in Computer Programs 1.1.3 Parallelism in Computers 1.1.3.4 Disk Parallelism 1.1.4 Performance Measures 1.1.5 More Parallelism Issues 1.2 Comparison of Parallel Computers 1.3 Summary Performance Measures Peak Performance is the top speed at which the computer can operate. It is a theoretical upper limit on the computer's performance. Sustained Performance is the highest consistently achieved speed. It is a more realistic measure of computer performance. Cost Performance is used to determine if the computer is cost effective. MHz is a measure of the processor speed. The processor speed is commonly measured in millions of cycles per second, where a computer cycle is defined as the shortest time in which some work can be done. MIPS is a measure of how quickly the computer can issue instructions. Millions of instructions per second is abbreviated as MIPS, where the instructions are computer instructions such as: memory reads and writes, logical operations , floating point operations, integer operations, and branch instructions. Performance Measures Mflops (Millions of floating point operations per second) measures how quickly a computer can perform floating-point operations such as add, subtract, multiply, and divide. Speedup measures the benefit of parallelism. It shows how your program scales as you compute with more processors, compared to the performance on one processor. Ideal speedup happens when the performance gain is linearly proportional to the number of processors used. Benchmarks are used to rate the performance of parallel computers and parallel programs. A well known benchmark that is used to compare parallel computers is the Linpack benchmark. Based on the Linpack results, a list is produced of the Top 500 Supercomputer Sites. This list is maintained by the University of Tennessee and the University of Mannheim. More Parallelism Issues Load balancing is the technique of evenly dividing the workload among the processors. For data parallelism it involves how iterations of loops are allocated to processors. Load balancing is important because the total time for the program to complete is the time spent by the longest executing thread. The problem size must be large and must be able to grow as you compute with more processors. In order to get the performance you expect from a parallel computer you need to run a large application with large data sizes, otherwise the overhead of passing information between processors will dominate the calculation time. Good software tools are essential for users of high performance parallel computers. These tools include: parallel compilers parallel debuggers performance analysis tools parallel math software The availability of a broad set of application software is also important. More Parallelism Issues The high performance computing market is risky and chaotic. Many supercomputer vendors are no longer in business, making the portability of your application very important. A workstation farm is defined as a fast network connecting heterogeneous workstations. The individual workstations serve as desktop systems for their owners. When they are idle, large problems can take advantage of the unused cycles in the whole system. An application of this concept is the SETI project.You can participate in searching for extraterrestrial intelligence with your home PC. More information about this project is available at the SETI Institute. Condor is software that provides resource management services for applications that run on heterogeneous collections of workstations. Miron Livny at the University of Wisconsin at Madison is the director of the Condor project, and has coined the phrase high throughput computing to describe this process of harnessing idle workstation cycles. More information is available at the Condor Home Page. Agenda 1 Parallel Computing Overview 1.1 Introduction to Parallel Computing 1.2 Comparison of Parallel Computers 1.2.1 Processors 1.2.2 Memory Organization 1.2.3 Flow of Control 1.2.4 Interconnection Networks 1.2.4.1 Bus Network 1.2.4.2 Cross-Bar Switch Network 1.2.4.3 Hypercube Network 1.2.4.4 Tree Network 1.2.4.5 Interconnection Networks Self-test 1.2.5 Summary of Parallel Computer Characteristics 1.3 Summary Comparison of Parallel Computers Now you can explore the hardware components of parallel computers: kinds of processors types of memory organization flow of control interconnection networks You will see what is common to these parallel computers, and what makes each one of them unique. Kinds of Processors There are three types of parallel computers: 1. computers with a small number of powerful processors Typically have tens of processors. The cooling of these computers often requires very sophisticated and expensive equipment, making these computers very expensive for computing centers. They are general-purpose computers that perform especially well on applications that have large vector lengths. The examples of this type of computer are the Cray SV1 and the Fujitsu VPP5000. Kinds of Processors There are three types of parallel computers: computers with a large number of less powerful processors 2. Named a Massively Parallel Processor (MPP), typically have thousands of processors. The processors are usually proprietary and air-cooled. Because of the large number of processors, the distance between the furthest processors can be quite large requiring a sophisticated internal network that allows distant processors to communicate with each other quickly. These computers are suitable for applications with a high degree of concurrency. The MPP type of computer was popular in the 1980s. Examples of this type of computer were the Thinking Machines CM-2 computer, and the computers made by the MassPar company. Kinds of Processors There are three types of parallel computers: 3. computers that are medium scale in between the two extremes Typically have hundreds of processors. The processor chips are usually not proprietary; rather they are commodity processors like the Pentium III. These are general-purpose computers that perform well on a wide range of applications. The most common example of this class is the Linux Cluster. Trends and Examples Processor trends : Decade Processor Type Computer Example 1970s Pipelined, Proprietary Cray-1 1980s Massively Parallel, Proprietary Thinking Machines CM2 1990s Superscalar, RISC, Commodity SGI Origin2000 2000s CISC, Commodity Workstation Clusters The processors on today’s commonly used parallel computers: Computer Processor SGI Origin2000 MIPS RISC R12000 HP V-Class HP PA 8200 Cray T3E Compaq Alpha IBM SP IBM Power3 Workstation Clusters Intel Pentium III, Intel Itanium Memory Organization The following paragraphs describe the three types of memory organization found on parallel computers: distributed memory shared memory distributed shared memory Distributed Memory In distributed memory computers, the total memory is partitioned into memory that is private to each processor. There is a Non-Uniform Memory Access time (NUMA), which is proportional to the distance between the two communicating processors. On NUMA computers, data is accessed the quickest from a private memory, while data from the most distant processor takes the longest to access. Some examples are the Cray T3E, the IBM SP, and workstation clusters. Distributed Memory When programming distributed memory computers, the code and the data should be structured such that the bulk of a processor’s data accesses are to its own private (local) memory. This is called having good data locality. Today's distributed memory computers use message passing such as MPI to communicate between processors as shown in the following example: Distributed Memory One advantage of distributed memory computers is that they are easy to scale. As the demand for resources grows, computer centers can easily add more memory and processors. This is often called the LEGO block approach. The drawback is that programming of distributed memory computers can be quite complicated. Shared Memory In shared memory computers, all processors have access to a single pool of centralized memory with a uniform address space. Any processor can address any memory location at the same speed so there is Uniform Memory Access time (UMA). Processors communicate with each other through the shared memory. The advantages and disadvantages of shared memory machines are roughly the opposite of distributed memory computers. They are easier to program because they resemble the programming of single processor machines But they don't scale like their distributed memory counterparts Distributed Shared Memory In Distributed Shared Memory (DSM) computers, a cluster or partition of processors has access to a common shared memory. It accesses the memory of a different processor cluster in a NUMA fashion. Memory is physically distributed but logically shared. Attention to data locality again is important. Distributed shared memory computers combine the best features of both distributed memory computers and shared memory computers. That is, DSM computers have both the scalability of distributed memory computers and the ease of programming of shared memory computers. Some examples of DSM computers are the SGI Origin2000 and the HP VClass computers. Trends and Examples Memory organization trends: Decade Memory Organization Example 1970s Shared Memory Cray-1 1980s Distributed Memory Thinking Machines CM-2 1990s Distributed Shared Memory SGI Origin2000 2000s Distributed Memory Workstation Clusters The memory organization of today’s commonly used parallel computers: Computer Memory Organization SGI Origin2000 DSM HP V-Class DSM Cray T3E Distributed IBM SP Distributed Workstation Clusters Distributed Flow of Control When you look at the control of flow you will see three types of parallel computers: Single Instruction Multiple Data (SIMD) Multiple Instruction Multiple Data (MIMD) Single Program Multiple Data (SPMD) Flynn’s Taxonomy Flynn’s Taxonomy, devised in 1972 by Michael Flynn of Stanford University, describes computers by how streams of instructions interact with streams of data. There can be single or multiple instruction streams, and there can be single or multiple data streams. This gives rise to 4 types of computers as shown in the diagram below: Flynn's taxonomy names the 4 computer types SISD, MISD, SIMD and MIMD. Of these 4, only SIMD and MIMD are applicable to parallel computers. Another computer type, SPMD, is a special case of MIMD. SIMD Computers SIMD stands for Single Instruction Multiple Data. Each processor follows the same set of instructions. With different data elements being allocated to each processor. SIMD computers have distributed memory with typically thousands of simple processors, and the processors run in lock step. SIMD computers, popular in the 1980s, are useful for fine grain data parallel applications, such as neural networks. Some examples of SIMD computers were the Thinking Machines CM-2 computer and the computers from the MassPar company. The processors are commanded by the global controller that sends instructions to the processors. It says add, and they all add. It says shift to the right, and they all shift to the right. The processors are like obedient soldiers, marching in unison. MIMD Computers MIMD stands for Multiple Instruction Multiple Data. There are multiple instruction streams with separate code segments distributed among the processors. MIMD is actually a superset of SIMD, so that the processors can run the same instruction stream or different instruction streams. In addition, there are multiple data streams; different data elements are allocated to each processor. MIMD computers can have either distributed memory or shared memory. While the processors on SIMD computers run in lock step, the processors on MIMD computers run independently of each other. MIMD computers can be used for either data parallel or task parallel applications. Some examples of MIMD computers are the SGI Origin2000 computer and the HP V-Class computer. SPMD Computers SPMD stands for Single Program Multiple Data. SPMD is a special case of MIMD. SPMD execution happens when a MIMD computer is programmed to have the same set of instructions per processor. With SPMD computers, while the processors are running the same code segment, each processor can run that code segment asynchronously. Unlike SIMD, the synchronous execution of instructions is relaxed. An example is the execution of an if statement on a SPMD computer. Because each processor computes with its own partition of the data elements, it may evaluate the right hand side of the if statement differently from another processor. One processor may take a certain branch of the if statement, and another processor may take a different branch of the same if statement. Hence, even though each processor has the same set of instructions, those instructions may be evaluated in a different order from one processor to the next. The analogies we used for describing SIMD computers can be modified for MIMD computers. Instead of the SIMD obedient soldiers, all marching in unison, in the MIMD world the processors march to the beat of their own drummer. Summary of SIMD versus MIMD SIMD MIMD distributed memory distriuted memory or shared memory Code Segment same per processor same or different Processors Run In lock step asynchronously Data Elements different per processor different per processor data parallel data parallel or task parallel Memory Applications Trends and Examples Flow of control trends: Decade Flow of Control Computer Example 1980's SIMD Thinking Machines CM-2 1990's MIMD SGI Origin2000 2000's MIMD Workstation Clusters The flow of control on today: Computer Flow of Control SGI Origin2000 MIMD HP V-Class MIMD Cray T3E MIMD IBM SP MIMD Workstation Clusters MIMD Agenda 1 Parallel Computing Overview 1.1 Introduction to Parallel Computing 1.2 Comparison of Parallel Computers 1.2.1 Processors 1.2.2 Memory Organization 1.2.3 Flow of Control 1.2.4 Interconnection Networks 1.2.4.1 Bus Network 1.2.4.2 Cross-Bar Switch Network 1.2.4.3 Hypercube Network 1.2.4.4 Tree Network 1.2.4.5 Interconnection Networks Self-test 1.2.5 Summary of Parallel Computer Characteristics 1.3 Summary Interconnection Networks What exactly is the interconnection network? The interconnection network is made up of the wires and cables that define how the multiple processors of a parallel computer are connected to each other and to the memory units. The time required to transfer data is dependent upon the specific type of the interconnection network. This transfer time is called the communication time. What network characteristics are important? Diameter: the maximum distance that data must travel for 2 processors to communicate. Bandwidth: the amount of data that can be sent through a network connection. Latency: the delay on a network while a data packet is being stored and forwarded. Types of Interconnection Networks The network topologies (geometric arrangements of the computer network connections) are: Bus Cross-bar Switch Hybercube Tree Interconnection Networks The aspects of network issues are: Cost Scalability Reliability Suitable Applications Data Rate Diameter Degree General Network Characteristics Some networks can be compared in terms of their degree and diameter. Degree: how many communicating wires are coming out of each processor. A large degree is a benefit because it has multiple paths. Diameter: This is the distance between the two processors that are farthest apart. A small diameter corresponds to low latency. Bus Network Bus topology is the original coaxial cable-based Local Area Network (LAN) topology in which the medium forms a single bus to which all stations are attached. The positive aspects It is also a mature technology that is well known and reliable. The cost is also very low. simple to construct. The negative aspects limited data transmission rate. not scalable in terms of performance. Example: SGI Power Challenge. Only scaled to 18 processors. Cross-Bar Switch Network A cross-bar switch is a network that works through a switching mechanism to access shared memory. it scales better than the bus network but it costs significantly more. The telephone system uses this type of network. An example of a computer with this type of network is the HP V-Class. Here is a diagram of a cross-bar switch network which shows the processors talking through the switchboxes to store or retrieve data in memory. There are multiple paths for a processor to communicate with a certain memory. The switches determine the optimal route to take. Cross-Bar Switch Network In a hypercube network, the processors are connected as if they were corners of a multidimensional cube. Each node in an N dimensional cube is directly connected to N other nodes. The fact that the number of directly connected, "nearest neighbor", nodes increases with the total size of the network is also highly desirable for a parallel computer. The degree of a hypercube network is log n and the diameter is log n, where n is the number of processors. Examples of computers with this type of network are the CM-2, NCUBE-2, and the Intel iPSC860. Tree Network The processors are the bottom nodes of the tree. For a processor to retrieve data, it must go up in the network and then go back down. This is useful for decision making applications that can be mapped as trees. The degree of a tree network is 1. The diameter of the network is 2 log (n+1)-2 where n is the number of processors. The Thinking Machines CM-5 is an example of a parallel computer with this type of network. Tree networks are very suitable for database applications because it allows multiple searches through the database at a time. Interconnected Networks Torus Network: A mesh with wrap-around connections in both the x and y directions. Multistage Network: A network with more than one networking unit. Fully Connected Network: A network where every processor is connected to every other processor. Hypercube Network: Processors are connected as if they were corners of a multidimensional cube. Mesh Network: A network where each interior processor is connected to its four nearest neighbors. Interconnected Networks Bus Based Network: Coaxial cable based LAN topology in which the medium forms a single bus to which all stations are attached. Cross-bar Switch Network: A network that works through a switching mechanism to access shared memory. Tree Network: The processors are the bottom nodes of the tree. Ring Network: Each processor is connected to two others and the line of connections forms a circle. Summary of Parallel Computer Characteristics How many processors does the computer have? 10s? 100s? 1000s? How powerful are the processors? what's the MHz rate what's the MIPS rate What's the instruction set architecture? RISC CISC Summary of Parallel Computer Characteristics How much memory is available? total memory memory per processor What kind of memory? distributed memory shared memory distributed shared memory What type of flow of control? SIMD MIMD SPMD Summary of Parallel Computer Characteristics What is the interconnection network? Bus Crossbar Hypercube Tree Torus Multistage Fully Connected Mesh Ring Hybrid Design decisions made by some of the major parallel computer vendors Computer Programming Style OS Processors Memory Flow of Control Network SGI Origin2000 OpenMP MPI IRIX MIPS RISC R10000 DSM MIMD Crossbar Hypercube HP V-Class OpenMP MPI HP-UX HP PA 8200 DSM MIMD Crossbar Ring Cray T3E SHMEM Unicos Compaq Alpha Distributed MIMD Torus IBM SP MPI AIX IBM Power3 Distributed MIMD IBM Switch Linux Intel Pentium III Distributed MIMD Myrinet Tree Workstation MPI Clusters Summary This completes our introduction to parallel computing. You have learned about parallelism in computer programs, and also about parallelism in the hardware components of parallel computers. In addition, you have learned about the commonly used parallel computers, and how these computers compare to each other. There are many good texts which provide an introductory treatment of parallel computing. Here are two useful references: Highly Parallel Computing, Second Edition George S. Almasi and Allan Gottlieb Benjamin/Cummings Publishers, 1994 Parallel Computing Theory and Practice Michael J. Quinn McGraw-Hill, Inc., 1994 Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 2.1 Automatic Compiler Parallelism 2.2 Data Parallelism by Hand 2.3 Mixing Automatic and Hand Parallelism 2.4 Task Parallelism 2.5 Parallelism Issues 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 7 Cache Tuning 8 Parallel Performance Analysis 9 About the IBM Regatta P690 How to Parallelize a Code This chapter describes how to turn a single processor program into a parallel one, focusing on shared memory machines. Both automatic compiler parallelization and parallelization by hand are covered. The details for accomplishing both data parallelism and task parallelism are presented. Automatic Compiler Parallelism Automatic compiler parallelism enables you to use a single compiler option and let the compiler do the work. The advantage of it is that it’s easy to use. The disadvantages are: The compiler only does loop level parallelism, not task parallelism. The compiler wants to parallelize every do loop in your code. If you have hundreds of do loops this creates way too much parallel overhead. Automatic Compiler Parallelism To use automatic compiler parallelism on a Linux system with the Intel compilers, specify the following. ifort -parallel -O2 ... prog.f The compiler creates conditional code that will run with any number of threads. Specify the number of threads and make sure you still get the right answers with setenv: setenv OMP_NUM_THREADS 4 a.out > results Data Parallelism by Hand First identify the loops that use most of the CPU time (the Profiling lecture describes how to do this). By hand, insert into the code OpenMP directive(s) just before the loop(s) you want to make parallel. Some code modifications may be needed to remove data dependencies and other inhibitors of parallelism. Use your knowledge of the code and data to assist the compiler. For the SGI Origin2000 computer, insert into the code an OpenMP directive just before the loop that you want to make parallel. !$OMP PARALLEL DO do i=1,n … lots of computation ... end do !$OMP END PARALLEL DO Data Parallelism by Hand Compile with the mp compiler option. f90 -mp ... prog.f As before, the compiler generates conditional code that will run with any number of threads. If you want to rerun your program with a different number of threads, you do not need to recompile, just re-specify the setenv command. setenv OMP_NUM_THREADS 8 a.out > results2 The setenv command can be placed anywhere before the a.out command. The setenv command must be typed exactly as indicated. If you have a typo, you will not receive a warning or error message. To make sure that the setenv command is specified correctly, type: setenv It produces a listing of your environment variable settings. Mixing Automatic and Hand Parallelism You can have one source file parallelized automatically by the compiler, and another source file parallelized by hand. Suppose you split your code into two files named prog1.f and prog2.f. f90 -c -apo … prog1.f (automatic // for prog1.f) f90 -c -mp … prog2.f prog2.f) (by hand // for f90 prog1.o prog2.o executable) (creates one a.out > results (runs the executable) Task Parallelism You can accomplish task parallelism as follows: !$OMP PARALLEL !$OMP SECTIONS … lots of computation in part A … !$OMP SECTION … lots of computation in part B ... !$OMP SECTION … lots of computation in part C ... !$OMP END SECTIONS !$OMP END PARALLEL Compile with the mp compiler option. f90 -mp … prog.f Use the setenv command to specify the number of threads. setenv OMP_NUM_THREADS 3 a.out > results Parallelism Issues There are some issues to consider when parallelizing a program. Should data parallelism or task parallelism be used? Should automatic compiler parallelism or parallelism by hand be used? Which loop in a nested loop situation should be the one that becomes parallel? How many threads should be used? Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 3.1 Recompile 3.2 Word Length 3.3 Compiler Options for Debugging 3.4 Standards Violations 3.5 IEEE Arithmetic Differences 3.6 Math Library Differences 3.7 Compute Order Related Differences 3.8 Optimization Level Too High 3.9 Diagnostic Listings 3.10 Further Information Porting Issues In order to run a computer program that presently runs on a workstation, a mainframe, a vector computer, or another parallel computer, on a new parallel computer you must first "port" the code. After porting the code, it is important to have some benchmark results you can use for comparison. To do this, run the original program on a well-defined dataset, and save the results from the old or “baseline” computer. Then run the ported code on the new computer and compare the results. If the results are different, don't automatically assume that the new results are wrong – they may actually be better. There are several reasons why this might be true, including: Precision Differences - the new results may actually be more accurate than the baseline results. Code Flaws - porting your code to a new computer may have uncovered a hidden flaw in the code that was already there. Detection methods for finding code flaws, solutions, and workarounds are provided in this lecture. Recompile Some codes just need to be recompiled to get accurate results. The compilers available on the NCSA computer platforms are shown in the following table: Language SGI Origin2000 MIPSpro Portland Group IA-32 Linux Intel GNU Portland Group Intel GNU g77 pgf77 ifort g77 pgf90 ifort Fortran 77 f77 ifort Fortran 90 f90 ifort Fortran 90 f95 ifort High Performance Fortran C C++ IA-64 Linux ifort pghpf cc CC pghpf icc icpc gcc g++ pgcc pgCC icc icpc gcc g++ Word Length Code flaws can occur when you are porting your code to a different word length computer. For C, the size of an integer variable differs depending on the machine and how the variable is generated. On the IA32 and IA64 Linux clusters, the size of an integer variable is 4 and 8 bytes, respectively. On the SGI Origin2000, the corresponding value is 4 bytes if the code is compiled with the –n32 flag, and 8 bytes if compiled without any flags or explicitly with the –64 flag. For Fortran, the SGI MIPSpro and Intel compilers contain the following flags to set default variable size. -in where n is a number: set the default INTEGER to INTEGER*n. The value of n can be 4 or 8 on SGI, and 2, 4, or 8 on the Linux clusters. -rn where n is a number: set the default REAL to REAL*n. The value of n can be 4 or 8 on SGI, and 4, 8, or 16 on the Linux clusters. Compiler Options for Debugging On the SGI Origin2000, the MIPSpro compilers include debugging options via the –DEBUG:group. The syntax is as follows: -DEBUG:option1[=value1]:option2[=value2]... Two examples are: Array-bound checking: check for subscripts out of range at runtime. -DEBUG:subscript_check=ON Force all un-initialized stack, automatic and dynamically allocated variables to be initialized. -DEBUG:trap_uninitialized=ON Compiler Options for Debugging On the IA32 Linux cluster, the Fortran compiler is equipped with the following –C flags for runtime diagnostics: -CA: pointers and allocatable references -CB: array and subscript bounds -CS: consistent shape of intrinsic procedure -CU: use of uninitialized variables -CV: correspondence between dummy and actual arguments Standards Violations Code flaws can occur when the program has non-ANSI standard Fortran coding. ANSI standard Fortran is a set of rules for compiler writers that specify, for example, the value of the do loop index upon exit from the do loop. Standards Violations Detection To detect standards violations on the SGI Origin2000 computer use the -ansi flag. This option generates a listing of warning messages for the use of non-ANSI standard coding. On the Linux clusters, the -ansi[-] flag enables/disables assumption of ANSI conformance. IEEE Arithmetic Differences Code flaws occur when the baseline computer conforms to the IEEE arithmetic standard and the new computer does not. The IEEE Arithmetic Standard is a set of rules governing arithmetic roundoff and overflow behavior. For example, it prohibits the compiler writer from replacing x/y with x *recip (y) since the two results may differ slightly for some operands.You can make your program strictly conform to the IEEE standard. To make your program conform to the IEEE Arithmetic Standards on the SGI Origin2000 computer use: f90 -OPT:IEEEarithmetic=n ... prog.f where n is 1, 2, or 3. This option specifies the level of conformance to the IEEE standard where 1 is the most stringent and 3 is the most liberal. On the Linux clusters, the Intel compilers can achieve conformance to IEEE standard at a stringent level with the –mp flag, or a slightly relaxed level with the –mp1 flag. Math Library Differences Most high-performance parallel computers are equipped with vendor-supplied math libraries. On the SGI Origin2000 platform, there are SGI/Cray Scientific Library (SCSL) and Complib.sgimath. SCSL contains Level 1, 2, and 3 Basic Linear Algebra Subprograms (BLAS), LAPACK and Fast Fourier Transform (FFT) routines. SCSL can be linked with –lscs for the serial version, or –mp – lscs_mp for the parallel version. The complib library can be linked with –lcomplib.sgimath for the serial version, or –mp –lcomplib.sgimath_mp for the parallel version. The Intel Math Kernel Library (MKL) contains the complete set of functions from BLAS, the extended BLAS (sparse), the complete set of LAPACK routines, and Fast Fourier Transform (FFT) routines. Math Library Differences On the IA32 Linux cluster, the libraries to link to are: For BLAS: -L/usr/local/intel/mkl/lib/32 -lmkl -lguide –lpthread For LAPACK: -L/usr/local/intel/mkl/lib/32 –lmkl_lapack -lmkl -lguide –lpthread When calling MKL routines from C/C++ programs, you also need to link with –lF90. On the IA64 Linux cluster, the corresponding libraries are: For BLAS: -L/usr/local/intel/mkl/lib/64 –lmkl_itp –lpthread For LAPACK: -L/usr/local/intel/mkl/lib/64 –lmkl_lapack –lmkl_itp – lpthread When calling MKL routines from C/C++ programs, you also need to link with -lPEPCF90 –lCEPCF90 –lF90 -lintrins Compute Order Related Differences Code flaws can occur because of the non-deterministic computation of data elements on a parallel computer. The compute order in which the threads will run cannot be guaranteed. For example, in a data parallel program, the 50th index of a do loop may be computed before the 10th index of the loop. Furthermore, the threads may run in one order on the first run, and in another order on the next run of the program. Note: : If your algorithm depends on data being compared in a specific order, your code is inappropriate for a parallel computer. Use the following method to detect compute order related differences: If your loop looks like DO I = 1, N change it to DO I = N, 1, -1 The results should not change if the iterations are independent Optimization Level Too High Code flaws can occur when the optimization level has been set too high thus trading speed for accuracy. The compiler reorders and optimizes your code based on assumptions it makes about your program. This can sometimes cause answers to change at higher optimization level. Setting the Optimization Level Both SGI Origin2000 computer and IBM Linux clusters provide Level 0 (no optimization) to Level 3 (most aggressive) optimization, using the –O{0,1,2, or 3} flag. One should bear in mind that Level 3 optimization may carry out loop transformations that affect the correctness of calculations. Checking correctness and precision of calculation is highly recommended when –O3 is used. For example on the Origin 2000 f90 -O0 … prog.f turns off all optimizations. Optimization Level Too High Isolating Optimization Level Problems You can sometimes isolate optimization level problems using the method of binary chop. To do this, divide your program prog.f into halves. Name them prog1.f and prog2.f. Compile the first half with -O0 and the second half with -O3 f90 -c -O0 prog1.f f90 -c -O3 prog2.f f90 prog1.o prog2.o a.out > results If the results are correct, the optimization problem lies in prog1.f Next divide prog1.f into halves. Name them prog1a.f and prog1b.f Compile prog1a.f with -O0 and prog1b.f with -O3 f90 -c -O0 prog1a.f f90 -c -O3 prog1b.f f90 prog1a.o prog1b.o prog2.o a.out > results Continue in this manner until you have isolated the section of code that is producing incorrect results. Diagnostic Listings The SGI Origin 2000 compiler will generate all kinds of diagnostic warnings and messages, but not always by default. Some useful listing options are: f90 f90 f90 f90 f90 -listing ... -fullwarn ... -showdefaults ... -version ... -help ... Further Information SGI man f77/f90/cc man debug_group man math man complib.sgimath MIPSpro 64-Bit Porting and Transition Guide Online Manuals Linux clusters pages ifort/icc/icpc –help (IA32, IA64, Intel64) Intel Fortran Compiler for Linux Intel C/C++ Compiler for Linux Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 4.1 Aggressive Compiler Options 4.2 Compiler Optimizations 4.3 Vendor Tuned Code 4.4 Further Information Scalar Tuning If you are not satisfied with the performance of your program on the new computer, you can tune the scalar code to decrease its runtime. This chapter describes many of these techniques: The use of the most aggressive compiler options The improvement of loop unrolling The use of subroutine inlining The use of vendor supplied tuned code The detection of cache problems, and their solution are presented in the Cache Tuning chapter. Aggressive Compiler Options For the SGI Origin2000 Linux clusters the main optimization switch is -On where n ranges from 0 to 3. -O0 turns off all optimizations. -O1 and -O2 do beneficial optimizations that will not effect the accuracy of results. -O3 specifies the most aggressive optimizations. It takes the most compile time, may produce changes in accuracy, and turns on software pipelining. Aggressive Compiler Options It should be noted that –O3 might carry out loop transformations that produce incorrect results in some codes. It is recommended that one compare the answer obtained from Level 3 optimization with one obtained from a lower-level optimization. On the SGI Origin2000 and the Linux clusters, –O3 can be used together with –OPT:IEEE_arithmetic=n (n=1,2, or 3) and –mp (or –mp1), respectively, to enforce operation conformance to IEEE standard at different levels. On the SGI Origin2000, the option -Ofast = ip27 is also available. This option specifies the most aggressive optimizations that are specifically tuned for the Origin2000 computer. Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 4.1Aggressive Compiler Options 4.2 Compiler Optimizations 4.2.1 Statement Level 4.2.2 Block Level 4.2.3 Routine Level 4.2.4 Software Pipelining 4.2.5 Loop Unrolling 4.2.6 Subroutine Inlining 4.2.7 Optimization Report 4.2.8 Profile-guided Optimization (PGO) 4.3 Vendor Tuned Code 4.4 Further Information Compiler Optimizations The various compiler optimizations can be classified as follows: Statement Level Optimizations Block Level Optimizations Routine Level Optimizations Software Pipelining Loop Unrolling Subroutine Inlining Each of these are described in the following sections. Statement Level Constant Folding Replace simple arithmetic operations on constants with the pre- computed result. y = 5+7 becomes y = 12 Short Circuiting Avoid executing parts of conditional tests that are not necessary. if (I.eq.J .or. I.eq.K) expression when I=J immediately compute the expression Register Assignment Put frequently used variables in registers. Block Level Dead Code Elimination Remove unreachable code and code that is never executed or used. Instruction Scheduling Reorder the instructions to improve memory pipelining. Routine Level Strength Reduction Replace expressions in a loop with an expression that takes fewer cycles. Common Subexpressions Elimination Expressions that appear more than once, are computed once, and the result is substituted for each occurrence of the expression. Constant Propagation Compile time replacement of variables with constants. Loop Invariant Elimination Expressions inside a loop that don't change with the do loop index are moved outside the loop. Software Pipelining Software pipelining allows the mixing of operations from different loop iterations in each iteration of the hardware loop. It is used to get the maximum work done per clock cycle. Note: On the R10000s there is out-of-order execution of instructions, and software pipelining may actually get in the way of this feature. Loop Unrolling The loops stride (or step) value is increased, and the body of the loop is replicated. It is used to improve the scheduling of the loop by giving a longer sequence of straight line code. An example of loop unrolling follows: Original Loop Unrolled Loop do I = 1, 99 c(I) = a(I) + b(I) enddo do I = c(I) = c(I+1) c(I+2) enddo 1, 99, 3 a(I) + b(I) = a(I+1) + b(I+1) = a(I+2) + b(I+2) There is a limit to the amount of unrolling that can take place because there are a limited number of registers. On the SGI Origin2000, loops are unrolled to a level of 8 by default. You can unroll to a level of 12 by specifying: f90 -O3 -OPT:unroll_times_max=12 ... prog.f On the IA32 Linux cluster, the corresponding flag is –unroll and -unroll0 for unrolling and no unrolling, respectively. Subroutine Inlining Subroutine inlining replaces a call to a subroutine with the body of the subroutine itself. One reason for using subroutine inlining is that when a subroutine is called inside a do loop that has a huge iteration count, subroutine inlining may be more efficient because it cuts down on loop overhead. However, the chief reason for using it is that do loops that contain subroutine calls may not parallelize. Subroutine Inlining On the SGI Origin2000 computer, there are several options to invoke inlining: Inline all routines except those specified to -INLINE:never f90 -O3 -INLINE:all … prog.f: Inline no routines except those specified to -INLINE:must f90 -O3 -INLINE:none … prog.f: Specify a list of routines to inline at every call f90 -O3 -INLINE:must=subrname … prog.f: Specify a list of routines never to inline f90 -O3 -INLINE:never=subrname … prog.f: On the Linux clusters, the following flags can invoke function inlining: inline function expansion for calls defined within the current source file -ip: inline function expansion for calls defined in separate files -ipo: Optimization Report Intel 9.x and later compilers can generate reports that provide useful information on optimization done on different parts of your code. To generate such optimization reports in a file filename, add the flag - opt-report-file filename. If you have a lot of source files to process simultaneously, and you use a makefile to compile, you can also use make's "suffix" rules to have optimization reports produced automatically, each with a unique name. For example, .f.o: ifort -c -o $@ $(FFLAGS) -opt-report-file $*.opt $*.f creates optimization reports that are named identically to the original Fortran source but with the suffix ".f" replaced by ".opt". Optimization Report To help developers and performance analysts navigate through the usually lengthy optimization reports, the NCSA program OptView is designed to provide an easy-to-use and intuitive interface that allows the user to browse through their own source code, cross-referenced with the optimization reports. OptView is installed on NCSA's IA64 Linux cluster under the directory /usr/apps/tools/bin. You can either add that directory to your UNIX PATH or you can invoke optview using an absolute path name.You'll need to be using the X-Window system and to have set your DISPLAY environment variable correctly for OptView to work. Optview can provide a quick overview of which loops in a source code or source codes among multiple files are highly optimized and which might need further work. For a detailed description of use of OptView, readers see: http://perfsuite.ncsa.uiuc.edu/OptView/ Profile-guided Optimization (PGO) Profile-guided optimization allows Intel compilers to use valuable runtime information to make better decisions about function inlining and interprocedural optimizations to generate faster codes. Its methodology is illustrated as follows: Profile-guided Optimization (PGO) First, you do an instrumented compilation by adding the -prof-gen flag in the compile process: icc -prof-gen -c a1.c a2.c a3.c icc a1.o a2.o a3.o -lirc Then, you run the program with a representative set of data to generate the dynamic information files given by the .dyn suffix. These files contain valuable runtime information for the compiler to do better function inlining and other optimizations. Finally, the code is recompiled again with the -prof-use flag to use the runtime information. icc -prof-use -ipo -c a1.c a2.c a3.c A profile-guided optimized executable is generated. Vendor Tuned Code Vendor math libraries have codes that are optimized for their specific machine. On the SGI Origin2000 platform, Complib.sgimath and SCSL are available. On the Linux clusters, Intel MKL is available. Ways to link to these libraries are described in Section 3 - Porting Issues. Further Information SGI IRIX man and www pages man opt man lno man inline man ipa man perfex Performance Tuning for the Origin2000 at http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Origin2000OL D/Doc/ Linux clusters help and www pages ifort/icc/icpc –help (Intel) http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Intel64Cluster/ (Intel64) http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Intel64Cluster/ (Intel64) http://perfsuite.ncsa.uiuc.edu/OptView/ Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 5.1 Sequential Code Limitation 5.2 Parallel Overhead 5.3 Load Balance 5.3.1 Loop Schedule Types 5.3.2 Chunk Size Parallel Code Tuning This chapter describes several of the most common techniques for parallel tuning, the type of programs that benefit, and the details for implementing them. The majority of this chapter deals with improving load balancing. Sequential Code Limitation Sequential code is a part of the program that cannot be run with multiple processors. Some reasons why it cannot be made data parallel are: The code is not in a do loop. The do loop contains a read or write. The do loop contains a dependency. The do loop has an ambiguous subscript. The do loop has a call to a subroutine or a reference to a function subprogram. Sequential Code Fraction As shown by Amdahl’s Law, if the sequential fraction is too large, there is a limitation on speedup. If you think too much sequential code is a problem, you can calculate the sequential fraction of code using the Amdahl’s Law formula. Sequential Code Limitation Measuring the Sequential Code Fraction Decide how many processors to use, this is p. Run and time the program with 1 processor to give T(1). Run and time the program with p processors to give T(2). Form a ratio of the 2 timings T(1)/T(p), this is SP. Substitute SP and p into the Amdahl’s Law formula: f=(1/SP-1/p)/(1-1/p), where f is the fraction of sequential code. Solve for f, this is the fraction of sequential code. Decreasing the Sequential Code Fraction The compilation optimization reports list which loops could not be parallelized and why.You can use this report as a guide to improve performance on do loops by: Removing dependencies Removing I/O Removing calls to subroutines and function subprograms Parallel Overhead Parallel overhead is the processing time spent creating threads spin/blocking threads starting and ending parallel regions synchronizing at the end of parallel regions When the computational work done by the parallel processes is too small, the overhead time needed to create and control the parallel processes can be disproportionately large limiting the savings due to parallelism. Measuring Parallel Overhead To get a rough under-estimate of parallel overhead: Run and time the code using 1 processor. Parallelize the code. Run and time the parallel code using only 1 processor. Subtract the 2 timings. Parallel Overhead Reducing Parallel Overhead To reduce parallel overhead: Don't parallelize all the loops. Don't parallelize small loops. To benefit from parallelization, a loop needs about 1000 floating point operations or 500 statements in the loop.You can use the IF modifier in the OpenMP directive to control when loops are parallelized. !$OMP PARALLEL DO IF(n > 500) do i=1,n ... body of loop ... end do !$OMP END PARALLEL DO Use task parallelism instead of data parallelism. It doesn't generate as much parallel overhead and often more code runs in parallel. Don't use more threads than you need. Parallelize at the highest level possible. Load Balance Load balance is the even assignment of subtasks to processors so as to keep each processor busy doing useful work for as long as possible. Load balance is important for speedup because the end of a do loop is a synchronization point where threads need to catch up with each other. If processors have different work loads, some of the processors will idle while others are still working. Measuring Load Balance On the SGI Origin, to measure load balance, use the perfex tool which is a command line interface to the R10000 hardware counters. The command perfex -e16 -mp a.out > results reports per thread cycle counts. Compare the cycle counts to determine load balance problems. The master thread (thread 0) always uses more cycles than the slave threads. If the counts are vastly different, it indicates load imbalance. Load Balance For linux systems, the thread cpu times can be compared with ps. A thread with unusually high or low time compared to the others may not be working efficiently [high cputime could be the result of a thread spinning while waiting for other threads to catch up]. ps uH Improving Load Balance To improve load balance, try changing the way that loop iterations are allocated to threads by changing the loop schedule type changing the chunk size These methods are discussed in the following sections. Loop Schedule Types On the SGI Origin2000 computer, 4 different loop schedule types can be specified by an OpenMP directive. They are: Static Dynamic Guided Runtime If you don't specify a schedule type, the default will be used. Default Schedule Type The default schedule type allocates 20 iterations on 4 threads as: Loop Schedule Types Static Schedule Type The static schedule type is used when some of the iterations do more work than others. With the static schedule type, iterations are allocated in a round-robin fashion to the threads. An Example Suppose you are computing on the upper triangle of a 100 x 100 matrix, and you use 2 threads, named t0 and t1. With default scheduling, workloads are uneven. Loop Schedule Types Whereas with static scheduling, the columns of the matrix are given to the threads in a round robin fashion, resulting in better load balance. Loop Schedule Types Dynamic Schedule Type The iterations are dynamically allocated to threads at runtime. Each thread is given a chunk of iterations. When a thread finishes its work, it goes into a critical section where it’s given another chunk of iterations to work on. This type is useful when you don’t know the iteration count or work pattern ahead of time. Dynamic gives good load balance, but at a high overhead cost. Guided Schedule Type The guided schedule type is dynamic scheduling that starts with large chunks of iterations and ends with small chunks of iterations. That is, the number of iterations given to each thread depends on the number of iterations remaining. The guided schedule type reduces the number of entries into the critical section, compared to the dynamic schedule type. Guided gives good load balancing at a low overhead cost. Chunk Size The word chunk refers to a grouping of iterations. Chunk size means how many iterations are in the grouping. The static and dynamic schedule types can be used with a chunk size. If a chunk size is not specified, then the chunk size is 1. Suppose you specify a chunk size of 2 with the static schedule type. Then 20 iterations are allocated on 4 threads: The schedule type and chunk size are specified as follows: !$OMP PARALLEL DO SCHEDULE(type, chunk) … !$OMP END PARALLEL DO Where type is STATIC, or DYNAMIC, or GUIDED and chunk is any positive integer. Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 6.1 Timing 6.1.1 Timing a Section of Code 6.1.1.1 CPU Time 6.1.1.2 Wall clock Time 6.1.2 Timing an Executable 6.1.3 Timing a Batch Job 6.2 Profiling 6.2.1 Profiling Tools 6.2.2 Profile Listings 6.2.3 Profiling Analysis 6.3 Further Information Timing and Profiling Now that your program has been ported to the new computer, you will want to know how fast it runs. This chapter describes how to measure the speed of a program using various timing routines. The chapter also covers how to determine which parts of the program account for the bulk of the computational load so that you can concentrate your tuning efforts on those computationally intensive parts of the program. Timing In the following sections, we’ll discuss timers and review the profiling tools ssrun and prof on the Origin and vprof and gprof on the Linux Clusters. The specific timing functions described are: Timing a section of code FORTRAN etime, dtime, cpu_time for CPU time time and f_time for wallclock time C clock for CPU time gettimeofday for wallclock time Timing an executable time a.out Timing a batch run busage qstat qhist CPU Time etime A section of code can be timed using etime. It returns the elapsed CPU time in seconds since the program started. real*4 tarray(2),time1,time2,timeres … beginning of program time1=etime(tarray) … start of section of code to be timed … lots of computation … end of section of code to be timed time2=etime(tarray) timeres=time2-time1 CPU Time dtime A section of code can also be timed using dtime. It returns the elapsed CPU time in seconds since the last call to dtime. real*4 tarray(2),timeres … beginning of program timeres=dtime(tarray) … start of section of code to be timed … lots of computation … end of section of code to be timed timeres=dtime(tarray) … rest of program CPU Time The etime and dtime Functions User time. This is returned as the first element of tarray. It’s the CPU time spent executing user code. System time. This is returned as the second element of tarray. It’s the time spent executing system calls on behalf of your program. Sum of user and system time. This is the function value that is returned. It’s the time that is usually reported. Metric. Timings are reported in seconds. Timings are accurate to 1/100th of a second. CPU Time Timing Comparison Warnings For the SGI computers: The etime and dtime functions return the MAX time over all threads for a parallel program. This is the time of the longest thread, which is usually the master thread. For the Linux Clusters: The etime and dtime functions are contained in the VAX compatibility library of the Intel FORTRAN Compiler. To use this library include the compiler flag -Vaxlib. Another warning: Do not put calls to etime and dtime inside a do loop. The overhead is too large. CPU Time cpu_time The cpu_time routine is available only on the Linux clusters as it is a component of the Intel FORTRAN compiler library. It provides substantially higher resolution and has substantially lower overhead than the older etime and dtime routines. It can be used as an elapsed timer. real*8 time1, time2, timeres … beginning of program call cpu_time (time1) … start of section of code to be timed … lots of computation … end of section of code to be timed call cpu_time(time2) timeres=time2-time1 … rest of program CPU Time clock For C programmers, one can call the cpu_time routine using a FORTRAN wrapper or call the intrinsic function clock that can be used to determine elapsed CPU time. include <time.h> static const double iCPS = 1.0/(double)CLOCKS_PER_SEC; double time1, time2, timres; … time1=(clock()*iCPS); … /* do some work */ … time2=(clock()*iCPS); timers=time2-time1; Wall clock Time time For the Origin, the function time returns the time since 00:00:00 GMT, Jan. 1, 1970. It is a means of getting the elapsed wall clock time. The wall clock time is reported in integer seconds. external time integer*4 time1,time2,timeres … beginning of program time1=time( ) … start of section of code to be timed … lots of computation … end of section of code to be timed time2=time( ) timeres=time2 - time1 Wall clock Time f_time For the Linux clusters, the appropriate FORTRAN function for elapsed time is f_time. integer*8 f_time external f_time integer*8 time1,time2,timeres … beginning of program time1=f_time() … start of section of code to be timed … lots of computation … end of section of code to be timed time2=f_time() timeres=time2 - time1 As above for etime and dtime, the f_time function is in the VAX compatibility library of the Intel FORTRAN Compiler. To use this library include the compiler flag -Vaxlib. Wall clock Time gettimeofday For C programmers, wallclock time can be obtained by using the very portable routine gettimeofday. #include <stddef.h> /* definition of NULL */ #include <sys/time.h> /* definition of timeval struct and protyping of gettimeofday */ double t1,t2,elapsed; struct timeval tp; int rtn; .... .... rtn=gettimeofday(&tp, NULL); t1=(double)tp.tv_sec+(1.e-6)*tp.tv_usec; .... /* do some work */ .... rtn=gettimeofday(&tp, NULL); t2=(double)tp.tv_sec+(1.e-6)*tp.tv_usec; elapsed=t2-t1; Timing an Executable To time an executable (if using a csh or tcsh shell, explicitly call /usr/bin/time) time …options… a.out where options can be ‘-p’ for a simple output or ‘-f format’ which allows the user to display more than just time related information. Consult the man pages on the time command for format options. Timing a Batch Job Time of a batch job running or completed. Origin busage jobid Linux clusters qstat jobid # for a running job qhist jobid # for a completed job Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 6.1 Timing 6.1.1 Timing a Section of Code 6.1.1.1 CPU Time 6.1.1.2 Wall clock Time 6.1.2 Timing an Executable 6.1.3 Timing a Batch Job 6.2 Profiling 6.2.1 Profiling Tools 6.2.2 Profile Listings 6.2.3 Profiling Analysis 6.3 Further Information Profiling Profiling determines where a program spends its time. It detects the computationally intensive parts of the code. Use profiling when you want to focus attention and optimization efforts on those loops that are responsible for the bulk of the computational load. Most codes follow the 90-10 Rule. That is, 90% of the computation is done in 10% of the code. Profiling Tools Profiling Tools on the Origin On the SGI Origin2000 computer there are profiling tools named ssrun and prof. Used together they do profiling, or what is called hot spot analysis. They are useful for generating timing profiles. ssrun The ssrun utility collects performance data for an executable that you specify. The performance data is written to a file named "executablename.exptype.id". prof The prof utility analyzes the data file created by ssrun and produces a report. Example ssrun -fpcsamp a.out prof -h a.out.fpcsamp.m12345 > prof.list Profiling Tools Profiling Tools on the Linux Clusters On the Linux clusters the profiling tools are still maturing. There are currently several efforts to produce tools comparable to the ssrun, prof and perfex tools. . gprof Basic profiling information can be generated using the OS utility gprof. First, compile the code with the compiler flags -qp -g for the Intel compiler (-g on the Intel compiler does not change the optimization level) or -pg for the GNU compiler. Second, run the program. Finally analyze the resulting gmon.out file using the gprof utility: gprof executable gmon.out. efc -O -qp -g -o foo foo.f ./foo gprof foo gmon.out Profiling Tools Profiling Tools on the Linux Clusters vprof On the IA32 platform there is a utility called vprof that provides performance information using the PAPI instrumentation library. To instrument the whole application requires recompiling and linking to vprof and PAPI libraries. setenv VMON PAPI_TOT_CYC ifc -g -O -o md md.f /usr/apps/tools/vprof/lib/vmonauto_gcc.o L/usr/apps/tools/lib -lvmon -lpapi ./md /usr/apps/tools/vprof/bin/cprof -e md vmon.out Profile Listings Profile Listings on the Origin Prof Output First Listing Cycles -------42630984 6498294 6141611 3654120 2615860 1580424 1144036 886044 861136 % ----58.47 8.91 8.42 5.01 3.59 2.17 1.57 1.22 1.18 Cum% ----58.47 67.38 75.81 80.82 84.41 86.57 88.14 89.36 90.54 Secs ---0.57 0.09 0.08 0.05 0.03 0.02 0.02 0.01 0.01 Proc ---VSUB PFSOR PBSOR PFSOR1 VADD ITSRCG ITSRSI ITJSI ITJCG The first listing gives the number of cycles executed in each procedure (or subroutine). The procedures are listed in descending order of cycle count. Profile Listings Profile Listings on the Origin Prof Output Second Listing Cycles -------36556944 5313198 4968804 2989882 2564544 1988420 1629776 994210 969056 483018 % ----50.14 7.29 6.81 4.10 3.52 2.73 2.24 1.36 1.33 0.66 Cum% ----50.14 57.43 64.24 68.34 71.86 74.59 76.82 78.19 79.52 80.18 Line ---8106 6974 6671 8107 7097 8103 8045 8108 8049 6972 Proc ---VSUB PFSOR PBSOR VSUB PFSOR1 VSUB VADD VSUB VADD PFSOR The second listing gives the number of cycles per source code line. The lines are listed in descending order of cycle count. Profile Listings Profile Listings on the Linux Clusters gprof Output First Listing Flat profile: Each sample counts as 0.000976562 seconds. % cumulative self self time seconds seconds calls us/call ----- ---------- ----------------38.07 5.67 5.67 101 56157.18 34.72 10.84 5.17 25199500 0.21 25.48 14.64 3.80 1.25 14.83 0.19 0.37 14.88 0.06 0.05 14.89 0.01 50500 0.15 0.05 14.90 0.01 100 68.36 0.01 14.90 0.00 0.01 14.90 0.00 0.01 14.90 0.00 0.00 14.90 0.00 1 0.00 total us/call ------107450.88 0.21 0.15 68.36 0.00 name ----------compute_ dist_ SIND_SINCOS sin cos dotr8_ update_ f_fioinit f_intorange mov initialize_ The listing gives a 'flat' profile of functions and routines encountered, sorted by 'self seconds' which is the number of seconds accounted for by this function alone. Profile Listings Profile Listings on the Linux Clusters gprof Output Second Listing Call graph: index ----[1] % time -----72.9 self children called name ---- -------------------------------------0.00 10.86 main [1] 5.67 5.18 101/101 compute_ [2] 0.01 0.00 100/100 update_ [8] 0.00 0.00 1/1 initialize_ [12] --------------------------------------------------------------------5.67 5.18 101/101 main [1] [2] 72.8 5.67 5.18 101 compute_ [2] 5.17 0.00 25199500/25199500 dist_ [3] 0.01 0.00 50500/50500 dotr8_ [7] --------------------------------------------------------------------5.17 0.00 25199500/25199500 compute_ [2] [3] 34.7 5.17 0.00 25199500 dist_ [3] --------------------------------------------------------------------<spontaneous> [4] 25.5 3.80 0.00 SIND_SINCOS [4] … … The second listing gives a 'call-graph' profile of functions and routines encountered. The definitions of the columns are specific to the line in question. Detailed information is contained in the full output from gprof. Profile Listings Profile Listings on the Linux Clusters vprof Listing Columns correspond to the following events: PAPI_TOT_CYC - Total cycles (1956 events) File Summary: 100.0% /u/ncsa/gbauer/temp/md.f Function Summary: 84.4% compute 15.6% dist Line Summary: 67.3% /u/ncsa/gbauer/temp/md.f:106 13.6% /u/ncsa/gbauer/temp/md.f:104 9.3% /u/ncsa/gbauer/temp/md.f:166 2.5% /u/ncsa/gbauer/temp/md.f:165 1.5% /u/ncsa/gbauer/temp/md.f:102 1.2% /u/ncsa/gbauer/temp/md.f:164 0.9% /u/ncsa/gbauer/temp/md.f:107 0.8% /u/ncsa/gbauer/temp/md.f:169 0.8% /u/ncsa/gbauer/temp/md.f:162 0.8% /u/ncsa/gbauer/temp/md.f:105 The above listing from (using the -e option to cprof), displays not only cycles consumed by functions (a flat profile) but also the lines in the code that contribute to those functions. Profile Listings Profile Listings on the Linux Clusters vprof Listing (cont.) 0.7% 0.5% 0.2% 0.1% /u/ncsa/gbauer/temp/md.f:149 /u/ncsa/gbauer/temp/md.f:163 /u/ncsa/gbauer/temp/md.f:109 /u/ncsa/gbauer/temp/md.f:100 … … 100 101 102 103 104 105 106 107 108 109 0.1% 1.5% 13.6% 0.8% 67.3% 0.9% 0.2% do j=1,np if (i .ne. j) then call dist(nd,box,pos(1,i),pos(1,j),rij,d) ! attribute half of the potential energy to particle 'j' pot = pot + 0.5*v(d) do k=1,nd f(k,i) = f(k,i) - rij(k)*dv(d)/d enddo endif enddo Profiling Analysis The program being analyzed in the previous Origin example has approximately 10000 source code lines, and consists of many subroutines. The first profile listing shows that over 50% of the computation is done inside the VSUB subroutine. The second profile listing shows that line 8106 in subroutine VSUB accounted for 50% of the total computation. Going back to the source code, line 8106 is a line inside a do loop. Putting an OpenMP compiler directive in front of that do loop you can get 50% of the program to run in parallel with almost no work on your part. Since the compiler has rearranged the source lines the line numbers given by ssrun/prof give you an area of the code to inspect. To view the rearranged source use the option f90 … -FLIST:=ON cc … -CLIST:=ON For the Intel compilers, the appropriate options are ifort … –E … icc … -E … Further Information SGI Irix man etime man 3 time man 1 time man busage man timers man ssrun man prof Origin2000 Performance Tuning and Optimization Guide Linux Clusters man 3 clock man 2 gettimeofday man 1 time man 1 gprof man 1B qstat Intel Compilers Vprof on NCSA Linux Cluster Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scaler Tuning 5 Parallel Code Tuning 6 Timing and Profiling 7 Cache Tuning 8 Parallel Performance Analysis 9 About the IBM Regatta P690 Agenda 7 Cache Tuning 7.1 Cache Concepts 7.1.1 Memory Hierarchy 7.1.2 Cache Mapping 7.1.3 Cache Thrashing 7.1.4 Cache Coherence 7.2 Cache Specifics 7.3 Code 0ptimization 7.4 Measuring Cache Performance 7.5 Locating the Cache Problem 7.6 Cache Tuning Strategy 7.7 Preserve Spatial Locality 7.8 Locality Problem 7.9 Grouping Data Together 7.10 Cache Thrashing Example 7.11 Not Enough Cache 7.12 Loop Blocking 7.13 Further Information Cache Concepts The CPU time required to perform an operation is the sum of the clock cycles executing instructions and the clock cycles waiting for memory. The CPU cannot be performing useful work if it is waiting for data to arrive from memory. Clearly then, the memory system is a major factor in determining the performance of your program and a large part is your use of the cache. The following sections will discuss the key concepts of cache including: Memory subsystem hierarchy Cache mapping Cache thrashing Cache coherence Memory Hierarchy The different subsystems in the memory hierarchy have different speeds, sizes, and costs. Smaller memory is faster Slower memory is cheaper The hierarchy is set up so that the fastest memory is closest to the CPU, and the slower memories are further away from the CPU. Memory Hierarchy It's a hierarchy because every level is a subset of a level further away. All data in one level is found in the level below. The purpose of cache is to improve the memory access time to the processor. There is an overhead associated with it, but the benefits outweigh the cost. Registers Registers are the sources and destinations of CPU data operations. They hold one data element each and are 32 bits or 64 bits wide. They are on-chip and built from SRAM. Computers usually have 32 or 64 registers. The Origin MIPS R10000 has 64 physical 64-bit registers of which 32 are available for floating-point operations. The Intel IA64 has 328 registers for general-purpose (64 bit), floating-point (80 bit), predicate (1 bit), branch and other functions. Register access speeds are comparable to processor speeds. Memory Hierarchy Main Memory Improvements A hardware improvement called interleaving reduces main memory access time. In interleaving, memory is divided into partitions or segments called memory banks. Consecutive data elements are spread across the banks. Each bank supplies one data element per bank cycle. Multiple data elements are read in parallel, one from each bank. The problem with interleaving is that the memory interleaving improvement assumes that memory is accessed sequentially. If there is 2-way memory interleaving, but the code accesses every other location, there is no benefit. The bank cycle time is 4-8 times the CPU clock cycle time so the main memory can’t keep up with the fast CPU and keep it busy with data. Large main memory with a cycle time comparable to the processor is not affordable. Memory Hierarchy Principle of Locality The way your program operates follows the Principle of Locality. Temporal Locality: When an item is referenced, it will be referenced again soon. Spatial Locality: When an item is referenced, items whose addresses are nearby will tend to be referenced soon. Cache Line The overhead of the cache can be reduced by fetching a chunk or block of data elements. When a main memory access is made, a cache line of data is brought into the cache instead of a single data element. A cache line is defined in terms of a number of bytes. For example, a cache line is typically 32 or 128 bytes. This takes advantage of spatial locality. The additional elements in the cache line will most likely be needed soon. The cache miss rate falls as the size of the cache line increases, but there is a point of negative returns on the cache line size. When the cache line size becomes too large, the transfer time increases. Memory Hierarchy Cache Hit A cache hit occurs when the data element requested by the processor is in the cache. You want to maximize hits. The Cache Hit Rate is defined as the fraction of cache hits. It is the fraction of the requested data that is found in the cache. Cache Miss A cache miss occurs when the data element requested by the processor is NOT in the cache. You want to minimize cache misses. Cache Miss Rate is defined as 1.0 - Hit Rate Cache Miss Penalty, or miss time, is the time needed to retrieve the data from a lower level (downstream) of the memory hierarchy. (Recall that the lower levels of the hierarchy have a slower access time.) Memory Hierarchy Levels of Cache It used to be that there were two levels of cache: on-chip and offchip. L1/L2 is still true for the Origin MIPS and the Intel IA-32 processors. Caches closer to the CPU are called Upstream. Caches further from the CPU are called Downstream. The on-chip cache is called First level, L1, or primary cache. An on-chip cache performs the fastest but the computer designer makes a trade-off between die size and cache size. Hence, on-chip cache has a small size. When the on-chip cache has a cache miss the time to access the slower main memory is very large. The off-chip cache is called Second Level, L2, or secondary cache. A cache miss is very costly. To solve this problem, computer designers have implemented a larger, slower off-chip cache. This chip speeds up the on-chip cache miss time. L1 cache misses are handled quickly. L2 cache misses have a larger performance penalty. The cache external to the chip is called Third Level, L3. The newer Intel IA-64 processor has 3 levels of cache Memory Hierarchy Split or Unified Cache In unified cache, typically L2, the cache is a combined instruction-data cache. A disadvantage of a unified cache is that when the data access and instruction access conflict with each other, the cache may be thrashed, e.g. a high cache miss rate. In split cache, typically L1, the cache is split into 2 parts: one for the instructions, called the instruction cache another for the data, called the data cache. The 2 caches are independent of each other, and they can have independent properties. Memory Hierarchy Sizes Memory hierarchy sizes are specified in the following units: Cache Line: bytes L1 Cache: Kbytes L2 Cache: Mbytes Main Memory: Gbytes Cache Mapping Cache mapping determines which cache location should be used to store a copy of a data element from main memory. There are 3 mapping strategies: Direct mapped cache Set associative cache Fully associative cache Direct Mapped Cache In direct mapped cache, a line of main memory is mapped to only a single line of cache. Consequently, a particular cache line can be filled from (size of main memory mod size of cache) different lines from main memory. Direct mapped cache is inexpensive but also inefficient and very susceptible to cache thrashing. Cache Mapping Direct Mapped Cache http://larc.ee.nthu.edu.tw/~cthuang/courses/ee3450/lectures/07_memory.html Cache Mapping Fully Associative Cache For fully associative cache, any line of cache can be loaded with any line from main memory. This technology is very fast but also very expensive. http://www.xbitlabs.com/images/video/radeon-x1000/caches.png Cache Mapping Set Associative Cache For N-way set associative cache, you can think of cache as being divided into N sets (usually N is 2 or 4). A line from main memory can then be written to its cache line in any of the N sets. This is a trade-off between direct mapped and fully associative cache. http://www.alasir.com/articles/cache_principles/cache_way.png Cache Mapping Cache Block Replacement With direct mapped cache, a cache line can only be mapped to one unique place in the cache. The new cache line replaces the cache block at that address. With set associative cache there is a choice of 3 strategies: 1. Random There is a uniform random replacement within the set of cache blocks. The advantage of random replacement is that it’s simple and inexpensive to implement. 2. LRU (Least Recently Used) The block that gets replaced is the one that hasn’t been used for the longest time. The principle of temporal locality tells us that recently used data blocks are likely to be used again soon. An advantage of LRU is that it preserves temporal locality. A disadvantage of LRU is that it’s expensive to keep track of cache access patterns. In empirical studies, there was little performance difference between LRU and Random. 3. FIFO (First In First Out) Replace the block that was brought in N accesses ago, regardless of the usage pattern. In empirical studies, Random replacement generally outperformed FIFO. Cache Thrashing Cache thrashing is a problem that happens when a frequently used cache line gets displaced by another frequently used cache line. Cache thrashing can happen for both instruction and data caches. The CPU can’t find the data element it wants in the cache and must make another main memory cache line access. The same data elements are repeatedly fetched into and displaced from the cache. Cache thrashing happens because the computational code statements have too many variables and arrays for the needed data elements to fit in cache. Cache lines are discarded and later retrieved. The arrays are dimensioned too large to fit in cache. The arrays are accessed with indirect addressing, e.g. a(k(j)). Cache Coherence Cache coherence is maintained by an agreement between data stored in cache, other caches, and main memory. When the same data is being manipulated by different processors, they must inform each other of their modification of data. The term Protocol is used to describe how caches and main memory communicate with each other. It is the means by which all the memory subsystems maintain data coherence. Cache Coherence Snoop Protocol All processors monitor the bus traffic to determine cache line status. Directory Based Protocol Cache lines contain extra bits that indicate which other processor has a copy of that cache line, and the status of the cache line – clean (cache line does not need to be sent back to main memory) or dirty (cache line needs to update main memory with content of cache line). Hardware Cache Coherence Cache coherence on the Origin computer is maintained in the hardware, transparent to the programmer. Cache Coherence False sharing happens in a multiprocessor system as a result of maintaining cache coherence. Both processor A and processor B have the same cache line. A modifies the first word of the cache line. B wants to modify the eighth word of the cache line. But A has sent a signal to B that B’s cache line is invalid. B must fetch the cache line again before writing to it. Cache Coherence A cache miss creates a processor stall. The processor is stalled until the data is retrieved from the memory. The stall is minimized by continuing to load and execute instructions, until the data that is stalling is retrieved. These techniques are called: Prefetching Out of order execution Software pipelining Typically, the compiler will do these at -O3 optimization. Cache Coherence The following is an example of software pipelining: Suppose you compute Do I=1,N y(I)=y(I) + a*x(I) End Do In pseudo-assembly language, this is what the Origin compiler will do: cycle cycle cycle cycle cycle cycle cycle cycle cycle cycle cycle cycle t+0 t+1 t+2 t+3 t+4 t+5 t+6 t+7 t+8 t+9 t+10 t+11 ld ld st st st st ld ld ld ld ld ld y(I+3) x(I+3) y(I-4) y(I-3) y(I-2) y(I-1) y(I+4) x(I+4) y(I+5) x(I+5) y(I+6) x(I+6) madd madd madd madd I I+1 I+2 I+3 Cache Coherence Since the Origin processor can only execute 1 load or 1 store at a time, the compiler places loads in the instruction pipeline well before the data is needed. It is then able to continue loading while simultaneously performing a fused multiply-add (a+b*c). The code above gets 8 flops in 12 clock cycles. The peak is 24 flops in 12 clock cycles for the Origin. The Intel Pentium III (IA-32) and the Itanium (IA-64) will have differing versions of the code above but the same concepts apply. Agenda 7 Cache Tuning 7.1 Cache Concepts 7.2Cache Specifics 7.2.1 Cache on the SGI Origin2000 7.2.2 Cache on the Intel Pentium III 7.2.3 Cache on the Intel Itanium 7.2.4 Cache Summary 7.3 Code 0ptimization 7.4 Measuring Cache Performance 7.5 Locating the Cache Problem 7.6 Cache Tuning Strategy 7.7 Preserve Spatial Locality 7.8 Locality Problem 7.9 Grouping Data Together 7.10 Cache Thrashing Example 7.11 Not Enough Cache 7.12 Loop Blocking 7.13 Further Information Cache on the SGI Origin2000 L1 Cache (on-chip primary cache) Cache size: 32KB floating point data 32KB integer data and instruction Cache line size: 32 bytes Associativity: 2-way set associative L2 Cache (off-chip secondary cache) Cache size: 4MB per processor Cache line size: 128 bytes Associativity: 2-way set associative Replacement: LRU Coherence: Directory based 2-way interleaved (2 banks) Cache on the SGI Origin2000 Bandwidth L1 cache-to-processor 1.6 GB/s/bank 3.2 GB/sec overall possible Latency: 1 cycle Bandwidth between L1 and L2 cache 1GB/s Latency: 11 cycles Bandwidth between L2 cache and local memory .5 GB/s Latency: 61 cycles Average 32 processor remote memory Latency: 150 cycles Cache on the Intel Pentium III L1 Cache (on-chip primary cache) Cache size: 16KB floating point data 16KB integer data and instruction Cache line size: 16 bytes Associativity: 4-way set associative L2 Cache (off-chip secondary cache) Cache size: 256 KB per processor Cache line size: 32 bytes Associativity: 8-way set associative Replacement: pseudo-LRU Coherence: interleaved (8 banks) Cache on the Intel Pentium III Bandwidth L1 cache-to-processor 16 GB/s Latency: 2 cycles Bandwidth between L1 and L2 cache 11.7 GB/s Latency: 4-10 cycles Bandwidth between L2 cache and local memory 1.0 GB/s Latency: 15-21 cycles Cache on the Intel Itanium L1 Cache (on-chip primary cache) Cache size: 16KB floating point data 16KB integer data and instruction Cache line size: 32 bytes Associativity: 4-way set associative L2 Cache (off-chip secondary cache) Cache size: 96KB unified data and instruction Cache line size: 64 bytes Associativity: 6-way set associative Replacement: LRU L3 Cache (off-chip tertiary cache) Cache size: 4MB per processor Cache line size: 64 bytes Associativity: 4-way set associative Replacement: LRU Cache on the Intel Itanium Bandwidth L1 cache-to-processor 25.6 GB/s Latency: 1 - 2 cycle Bandwidth between L1 and L2 cache 25.6 GB/sec Latency: 6 - 9 cycles Bandwidth between L2 and L3 cache 11.7 GB/sec Latency: 21 - 24 cycles Bandwidth between L3 cache and main memory 2.1 GB/sec Latency: 50 cycles Cache Summary Chip MIPS R10000 Pentium III Itanium #Caches 2 2 3 Associativity 2/2 4/8 4/6/4 Replacement LRU Pseudo-LRU LRU CPU MHz 195/250 1000 800 Peak Mflops 390/500 1000 3200 LD,ST/cycle 1 LD or 1 ST 1 LD and 1 ST 2 LD or 2 ST Only one load or store may be performed each CPU cycle on the R10000. This indicates that loads and stores may be a bottleneck. Efficient use of cache is extremely important. Agenda 7 Cache Tuning 7.1 Cache Concepts 7.2 Cache Specifics 7.3Code 0ptimization 7.4 Measuring Cache Performance 7.4.1 Measuring Cache Performance on the SGI Origin2000 7.4.2 Measuring Cache Performance on the Linux Clusters 7.5 Locating the Cache Problem 7.6 Cache Tuning Strategy 7.7 Preserve Spatial Locality 7.8 Locality Problem 7.9 Grouping Data Together 7.10 Cache Thrashing Example 7.11 Not Enough Cache 7.12 Loop Blocking 7.13 Further Information Code 0ptimization Gather statistics to find out where the bottlenecks are in your code so you can identify what you need to optimize. The following questions can be useful to ask: How much time does the program take to execute? Use /usr/bin/time a.out for CPU time Which subroutines use the most time? Use ssrun and prof on the Origin or gprof and vprof on the Linux clusters. Which loop uses the most time? Put etime/dtime or other recommended timer calls around loops for CPU time. For more information on timers see Timing and Profiling section. What is contributing to the cpu time? Use the Perfex utility on the Origin or perfex or hpmcount on the Linux clusters. Code 0ptimization Some useful optimizing and profiling tools are etime/dtime/time perfex ssusage ssrun/prof gprof cvpav, cvd See the NCSA web pages on Compiler, Performance, and Productivity Tools http://www.ncsa.uiuc.edu/UserInfo/Resources/Software/Tools/ for information on which tools are available on NCSA platforms. Measuring Cache Performance on the SGI Origin2000 The R10000 processors of NCSA’s Origin2000 computers have hardware performance counters. There are 32 events that are measured and each event is numbered. 0 = cycles 1 = Instructions issued ... 26 = Secondary data cache misses ... View man perfex for more information. The Perfex Utility The hardware performance counters can be measured using the perfex utility. perfex [options] command [arguments] Measuring Cache Performance on the SGI Origin2000 where the options are: -e counter1-e counter2 This specifies which events are to be counted. You enter the number of the event you want counted. (Remember to have a space in between the "e" and the event number.) -a sample ALL the events -mp Report all results on a per thread basis. -y Report the results in seconds, not cycles. -x Gives extra summary info including Mflops command Specify the name of the executable file. arguments Specify the input and output arguments to the executable file. Measuring Cache Performance on the SGI Origin2000 Examples perfex -e 25 -e 26 a.out - outputs the L1 and L2 cache misses - the output is reported in cycles perfex -a -y a.out > results - outputs ALL the hardware performance counters - - the output is reported in seconds Measuring Cache Performance on the Linux Clusters The Intel Pentium III and Itanium processors provide hardware event counters that can be accessed from several tools. perfex for the Pentium III and pfmon for the Itanium To view usage and options for perfex and pfmon: perfex -h pfmon --help To measure L2 cache misses: perfex –eP6_L2_LINES_IN a.out pfmon –-events=L2_MISSES a.out Measuring Cache Performance on the Linux Clusters psrun [soft add +perfsuite] Another tool that provides access to the hardware event counter and also provides derived statistics is perfsuite. To add perfsuite's psrun to the current shell environment : soft add +perfsuite To measure cache misses: psrun a.out psprocess a.out*.xml Agends 7 Cache Tuning 7.1 Cache Concepts 7.2 Cache Specifics 7.3 Code 0ptimization 7.4 Measuring Cache Performance 7.5 Locating the Cache Problem 7.6 Cache Tuning Strategy 7.7 Preserve Spatial Locality 7.8 Locality Problem 7.9 Grouping Data Together 7.10 Cache Thrashing Example 7.11 Not Enough Cache 7.12 Loop Blocking 7.13 Further Information Locating the Cache Problem For the Origin, the perfex output is a first-pass detection of a cache problem. If you then use the CaseVision tools, you can locate the cache problem in your code. The CaseVision tools are cvpav for performance analysis cvd for debugging CaseVision is not available on the Linux clusters. Tools like vprof and libhpm provide routines for users to instrument their code. Using vprof with the PAPI cache events can provide detailed information about where poor cache utilization is occurring. Cache Tuning Strategy The strategy for performing cache tuning on your code is based on data reuse. Temporal Reuse Use the same data elements on more than one iteration of the loop. Spatial Reuse Use data that is encached as a result of fetching nearby data elements from downstream memory. Strategies that take advantage of the Principle of Locality will improve performance. Preserve Spatial Locality Check loop nesting to ensure stride-one memory access. The following code does not preserve spatial locality: do I=1,n do K=1,n do J=1,n C(I,J)=C(I,J) + A(I,K) * B(K,J) end do … It is not wrong but runs much slower than it could. To ensure stride-one access modify the code using loop interchange. do J=1,n do K=1,n do I=1,n C(I,J)=C(I,J) + A(I,K) * B(K,J) end do … For Fortran the innermost loop index should be the leftmost index of the arrays. The code has been modified for spatial reuse. Locality Problem Suppose your code looks like: DO J=1,N DO I=1,N A(I,J)=B(J,I) ENDDO ENDDO The loop as it is typed above does not have unit-stride access on loads. If you interchange the loops, the code doesn’t have unitstride access on stores. Use the optimized, intrinsic-function transpose from the FORTRAN compiler instead of hand-coding it. Grouping Data Together Consider the following code segment: d=0.0 do I=1,n j=index(I) d = d + sqrt(x(j)*x(j) + y(j)*y(j) + z(j)*z(j)) Since the arrays are accessed with indirect accessing, it is likely that 3 new cache lines need to be brought into the cache for each iteration of the loop. Modify the code by grouping together x, y, and z into a 2-dimensional array named r. d=0.0 do I=1,n j=index(I) d = d + sqrt(r(1,j)*r(1,j) + r(2,j)*r(2,j) + r(3,j)*r(3,j)) Since r(1,j), r(2,j), and r(3,j) are contiguous in memory, it is likely they will be in one cache line. Hence, 1 cache line, rather than 3, is brought in for each iteration of I. The code has been modified for cache reuse. Cache Thrashing Example This example thrashes a 4MB direct mapped cache. parameter (max = 1024*1024) common /xyz/ a(max), b(max) do I=1,max something = a(I) + b(I) enddo The cache lines for both a and b have the same cache address. To avoid cache thrashing in this example, pad common with the size of a cache line. parameter (max = 1024*1024) common /xyz/ a(max),extra(32),b(max) do I=1,max something=a(I) + b(I) enddo Improving cache utilization is often the key to getting good performance. Not Enough Cache Ideally you want the inner loop’s arrays and variables to fit into cache. If a scalar program won’t fit in cache, its parallel version may fit in cache with a large enough number of processors. This often results in super-linear speedup. Loop Blocking This technique is useful when the arrays are too large to fit into the cache. Loop blocking uses strip mining of loops and loop interchange. A blocked loop accesses array elements in sections that optimally fit in the cache. It allows for spatial and temporal reuse of data, thus minimizing cache misses. The following example (next slide) illustrates loop blocking of matrix multiplication. The code in the PRE column depicts the original code, the POST column depicts the code when it is blocked. Loop Blocking PRE POST do k=1,n do j=1,n do i=1,n c(i,j)=c(i,j)+a(i,k) *b(k,j) enddo enddo enddo do kk=1,n,iblk do jj=1,n,iblk do ii=1,n,iblk do j=jj,jj+iblk-1 do k=kk,kk+iblk-1 do i=ii,ii+iblk-1 c(i,j)=c(i,j)+a(i,k) *b(k,j) enddo enddo enddo enddo enddo enddo Further Information Computer Organization and Design The Hardware/Software Interface, David A. Patterson and John L. Hennessy, Morgan Kaufmann Publishers, Inc. Computer Architecture A Quantitative Approach, John L. Hennessy and David A. Patterson, Morgan Kaufmann Publishers, Inc. The Cache Memory Book, Jim Handy, Academic Press High Performance Computing, Charles Severance, O’Reilly and Associates, Inc. A Practitioner’s Guide to RISC Microprocessor Architecture, Patrick H. Stakem, John Wiley & Sons, Inc. Tutorial on Optimization of Fortran, John Levesque, Applied Parallel Research Intel® Architecture Optimization Reference Manual Intel® Itanium® Processor Manuals Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 7 Cache Tuning 8 Parallel Performance Analysis 8.1 Speedup 8.2 Speedup Extremes 8.3 Efficiency 8.4 Amdahl's Law 8.5 Speedup Limitations 8.6 Benchmarks 8.7 Summary 9 About the IBM Regatta P690 Parallel Performance Analysis Now that you have parallelized your code, and have run it on a parallel computer using multiple processors you may want to know the performance gain that parallelization has achieved. This chapter describes how to compute parallel code performance. Often the performance gain is not perfect, and this chapter also explains some of the reasons for limitations on parallel performance. Finally, this chapter covers the kinds of information you should provide in a benchmark, and some sample benchmarks are given. Speedup The speedup of your code tells you how much performance gain is achieved by running your program in parallel on multiple processors. A simple definition is that it is the length of time it takes a program to run on a single processor, divided by the time it takes to run on a multiple processors. Speedup generally ranges between 0 and p, where p is the number of processors. Scalability When you compute with multiple processors in a parallel environment, you will also want to know how your code scales. The scalability of a parallel code is defined as its ability to achieve performance proportional to the number of processors used. As you run your code with more and more processors, you want to see the performance of the code continue to improve. Computing speedup is a good way to measure how a program scales as more processors are used. Speedup Linear Speedup If it takes one processor an amount of time t to do a task and if p processors can do the task in time t / p, then you have perfect or linear speedup (Sp= p). That is, running with 4 processors improves the time by a factor of 4, running with 8 processors improves the time by a factor of 8, and so on. This is shown in the following illustration. Speedup Extremes The extremes of speedup happen when speedup is greater than p, called super-linear speedup, less than 1. Super-Linear Speedup You might wonder how super-linear speedup can occur. How can speedup be greater than the number of processors used? The answer usually lies with the program's memory use. When using multiple processors, each processor only gets part of the problem compared to the single processor case. It is possible that the smaller problem can make better use of the memory hierarchy, that is, the cache and the registers. For example, the smaller problem may fit in cache when the entire problem would not. When super-linear speedup is achieved, it is often an indication that the sequential code, run on one processor, had serious cache miss problems. The most common programs that achieve super-linear speedup are those that solve dense linear algebra problems. Speedup Extremes Parallel Code Slower than Sequential Code When speedup is less than one, it means that the parallel code runs slower than the sequential code. This happens when there isn't enough computation to be done by each processor. The overhead of creating and controlling the parallel threads outweighs the benefits of parallel computation, and it causes the code to run slower. To eliminate this problem you can try to increase the problem size or run with fewer processors. Efficiency Efficiency is a measure of parallel performance that is closely related to speedup and is often also presented in a description of the performance of a parallel program. Efficiency with p processors is defined as the ratio of speedup with p processors to p. Efficiency is a fraction that usually ranges between 0 and 1. Ep=1 corresponds to perfect speedup of Sp= p. You can think of efficiency as describing the average speedup per processor. Amdahl's Law An alternative formula for speedup is named Amdahl's Law attributed to Gene Amdahl, one of America's great computer scientists. This formula, introduced in the 1980s, states that no matter how many processors are used in a parallel run, a program's speedup will be limited by its fraction of sequential code. That is, almost every program has a fraction of the code that doesn't lend itself to parallelism. This is the fraction of code that will have to be run with just one processor, even in a parallel run. Amdahl's Law defines speedup with p processors as follows: Where the term f stands for the fraction of operations done sequentially with just one processor, and the term (1 - f) stands for the fraction of operations done in perfect parallelism with p processors. Amdahl's Law The sequential fraction of code, f, is a unitless measure ranging between 0 and 1. When f is 0, meaning there is no sequential code, then speedup is p, or perfect parallelism. This can be seen by substituting f = 0 in the formula above, which results in Sp = p. When f is 1, meaning there is no parallel code, then speedup is 1, or there is no benefit from parallelism. This can be seen by substituting f = 1 in the formula above, which results in Sp = 1. This shows that Amdahl's speedup ranges between 1 and p, where p is the number of processors used in a parallel processing run. Amdahl's Law The interpretation of Amdahl's Law is that speedup is limited by the fact that not all parts of a code can be run in parallel. Substituting in the formula, when the number of processors goes to infinity, your code's speedup is still limited by 1 / f. Amdahl's Law shows that the sequential fraction of code has a strong effect on speedup. This helps to explain the need for large problem sizes when using parallel computers. It is well known in the parallel computing community, that you cannot take a small application and expect it to show good performance on a parallel computer. To get good performance, you need to run large applications, with large data array sizes, and lots of computation. The reason for this is that as the problem size increases the opportunity for parallelism grows, and the sequential fraction shrinks, and it shrinks in its importance for speedup. Agenda 8 Parallel Performance Analysis 8.1 Speedup 8.2 Speedup Extremes 8.3 Efficiency 8.4 Amdahl's Law 8.5Speedup Limitations 8.5.1 Memory Contention Limitation 8.5.2 Problem Size Limitation 8.6 Benchmarks 8.7 Summary Speedup Limitations This section covers some of the reasons why a program doesn't get perfect Speedup. Some of the reasons for limitations on speedup are: Too much I/O Speedup is limited when the code is I/O bound. That is, when there is too much input or output compared to the amount of computation. Wrong algorithm Speedup is limited when the numerical algorithm is not suitable for a parallel computer. You need to replace it with a parallel algorithm. Too much memory contention Speedup is limited when there is too much memory contention. You need to redesign the code with attention to data locality. Cache reutilization techniques will help here. Speedup Limitations Wrong problem size Speedup is limited when the problem size is too small to take best advantage of a parallel computer. In addition, speedup is limited when the problem size is fixed. That is, when the problem size doesn't grow as you compute with more processors. Too much sequential code Speedup is limited when there's too much sequential code. This is shown by Amdahl's Law. Too much parallel overhead Speedup is limited when there is too much parallel overhead compared to the amount of computation. These are the additional CPU cycles accumulated in creating parallel regions, creating threads, synchronizing threads, spin/blocking threads, and ending parallel regions. Load imbalance Speedup is limited when the processors have different workloads. The processors that finish early will be idle while they are waiting for the other processors to catch up. Memory Contention Limitation Gene Golub, a professor of Computer Science at Stanford University, writes in his book on parallel computing that the best way to define memory contention is with the word delay. When different processors all want to read or write into the main memory, there is a delay until the memory is free. On the SGI Origin2000 computer, you can determine whether your code has memory contention problems by using SGI's perfex utility. The perfex utility is covered in the Cache Tuning lecture in this course. You can also refer to SGI's manual page, man perfex, for more details. On the Linux clusters, you can use the hardware performance counter tools to get information on memory performance. On the IA32 platform, use perfex, vprof, hmpcount, psrun/perfsuite. On the IA64 platform, use vprof, pfmon, psrun/perfsuite. Memory Contention Limitation Many of these tools can be used with the PAPI performance counter interface. Be sure to refer to the man pages and webpages on the NCSA website for more information. If the output of the utility shows that memory contention is a problem, you will want to use some programming techniques for reducing memory contention. A good way to reduce memory contention is to access elements from the processor's cache memory instead of the main memory. Some programming techniques for doing this are: Access arrays with unit `. Order nested do loops (in Fortran) so that the innermost loop index is the leftmost index of the arrays in the loop. For the C language, the order is the opposite of Fortran. Avoid specific array sizes that are the same as the size of the data cache or that are exact fractions or exact multiples of the size of the data cache. Pad common blocks. These techniques are called cache tuning optimizations. The details for performing these code modifications are covered in the section on Cache Optimization of this lecture. Problem Size Limitation Small Problem Size Speedup is almost always an increasing function of problem size. If there's not enough work to be done by the available processors, the code will show limited speedup. The effect of small problem size on speedup is shown in the following illustration. Problem Size Limitation Fixed Problem Size When the problem size is fixed, you can reach a point of negative returns when using additional processors. As you compute with more and more processors, each processor has less and less amount of computation to perform. The additional parallel overhead, compared to the amount of computation, causes the speedup curve to start turning downward as shown in the following figure. Benchmarks It will finally be time to report the parallel performance of your application code. You will want to show a speedup graph with the number of processors on the x axis, and speedup on the y axis. Some other things you should report and record are: the date you obtained the results the problem size the computer model the compiler and the version number of the compiler any special compiler options you used Benchmarks When doing computational science, it is often helpful to find out what kind of performance your colleagues are obtaining. In this regard, NCSA has a compilation of parallel performance benchmarks online at http://www.ncsa.uiuc.edu/UserInfo/Perf/NCSAbench/. You might be interested in looking at these benchmarks to see how other people report their parallel performance. In particular, the NAMD benchmark is a report about the performance of the NAMD program that does molecular dynamics simulations. Summary There are many good texts on parallel computing which treat the subject of parallel performance analysis. Here are two useful references: Scientific Computing An Introduction with Parallel Computing, Gene Golub and James Ortega, Academic Press, Inc. Parallel Computing Theory and Practice, Michael J. Quinn, McGraw-Hill, Inc. Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 7 Cache Tuning 8 Parallel Performance Analysis 9 About the IBM Regatta P690 9.1 IBM p690 General Overview 9.2 IBM p690 Building Blocks 9.3 Features Performed by the Hardware 9.4 The Operating System 9.5 Further Information About the IBM Regatta P690 To obtain your program’s top performance, it is important to understand the architecture of the computer system on which the code runs. This chapter describes the architecture of NCSA's IBM p690. Technical details on the size and design of the processors, memory, cache, and the interconnect network are covered along with technical specifications for the compute rate, memory size and speed, and interconnect bandwidth. IBM p690 General Overview The p690 is IBM's latest Symmetric Multi-Processor (SMP) machine with Distributed Shared Memory (DSM). This means that memory is physically distributed and logically shared. It is based on the Power4 architecture and is a successor to the Power3-II based RS/6000 SP system. IBM p690 Scalability The IBM p690 is a flexible, modular, and scalable architecture. It scales in these terms: Number of processors Memory size I/O and memory bandwidth and the Interconnect bandwidth Agenda 9 About the IBM Regatta P690 9.1 IBM p690 General Overview 9.2 IBM p690 Building Blocks 9.2.1 Power4 Core 9.2.2 Multi-Chip Modules 9.2.3 The Processor 9.2.4 Cache Architecture 9.2.5 Memory Subsystem 9.3 Features Performed by the Hardware 9.4 The Operating System 9.5 Further Information IBM p690 Building Blocks An IBM p690 system is built from a number of fundamental building blocks. The first of these building blocks is the Power4 Core, which includes the processors and L1 and L2 caches. At NCSA, four of these Power4 Cores are linked to form a Multi-Chip Module. This module includes the L3 cache and four Multi-Chip Modules are linked to form a 32 processor system (see figure on the next slide). Each of these components will be described in the following sections. 32-processor IBM p690 configuration (Image courtesy of IBM) Power4 Core The Power4 Chip contains: Two processors Local caches (L1) External cache for each processor (L2) I/O and Interconnect interfaces The POWER4 chip (Image curtsey of IBM) Multi-Chip Modules Four Power4 Chips are assembled to form a Multi-Chip Module (MCM) that contains 8 processors. Each MCM also supports the L3 cache for each Power4 chip. Multiple MCM interconnection (Image courtesy of IBM) The Processor The processors at the heart of the Power4 Core are speculative superscalar out of order execution chips. The Power4 is a 4-way superscalar RISC architecture running instructions on its 8 pipelined execution units. Speed of the Processor The NCSA IBM p690 has CPUs running at 1.3 GHz. 64-Bit Processor Execution Units There are 8 independent fully pipelined execution units. 2 load/store units for memory access 2 identical floating point execution units capable of fused multiply/add 2 fixed point execution units 1 branch execution unit 1 logic operation unit The Processor The units are capable of 4 floating point operations, fetching 8 instructions and completing 5 instructions per cycle. It is capable of handling up to 200 in-flight instructions. Performance Numbers Peak Performance: 4 floating point instructions per cycle 1.3 Gcycles/sec * 4 flop/cycle yields 5.2 GFLOPS MIPS Rating: 5 instructions per cycle 1.3 Gcycles/sec * 5 instructions/cycle yields 65 MIPS Instruction Set The instruction set (ISA) on the IBM p690 is the PowerPC AS Instruction set. Cache Architecture Each Power4 Core has both a primary (L1) cache associated with each processor and a secondary (L2) cache shared between the two processors. In addition, each MultiChip Module has a L3 cache. Level 1 Cache The Level 1 cache is in the processor core. It has split instruction and data caches. L1 Instruction Cache The properties of the Instruction Cache are: 64KB in size direct mapped cache line size is 128 bytes L1 Data Cache The properties of the L1 Data Cache are: 32KB in size 2-way set associative FIFO replacement policy 2-way interleaved cache line size is 128 bytes Peak speed is achieved when the data accessed in a loop is entirely contained in the L1 data cache. Cache Architecture Level 2 Cache on the Power4 Chip When the processor can't find a data element in the L1 cache, it looks in the L2 cache. The properties of the L2 Cache are: external from the processor unified instruction and data cache 1.41MB per Power4 chip (2 processors) 8-way set associative split between 3 controllers cache line size is 128 bytes pseudo LRU replacement policy for cache coherence 124.8 GB/s peak bandwidth from L2 Cache Architecture Level 3 Cache on the Multi-Chip Module When the processor can't find a data element in the L2 cache, it looks in the L3 cache. The properties of the L3 Cache are: external from the Power4 Core unified instruction and data cache 128MB per Multi-Chip Module (8 processors) 8-way set associative cache line size is 512 bytes 55.5 GB/s peak bandwidth from L2 Memory Subsystem The total memory is physically distributed among the Multi-Chip Modules of the p690 system (see the diagram in the next slide). Memory Latencies The latency penalties for each of the levels of the memory hierarchy are: L1 Cache - 4 cycles L2 Cache - 14 cycles L3 Cache - 102 cycles Main Memory - 400 cycles Memory distribution within an MCM Agenda 9 About the IBM Regatta P690 9.1 IBM p690 General Overview 9.2 IBM p690 Building Blocks 9.3 Features Performed by the Hardware 9.4 The Operating System 9.5 Further Information Features Performed by the Hardware The following is done completely by the hardware, transparent to the user: Global memory addressing (makes the system memory shared) Address resolution Maintaining cache coherency Automatic page migration from remote to local memory (to reduce interconnect memory transactions) The Operating System The operating system is AIX. NCSA's p690 system is currently running version 5.1 of AIX. Version 5.1 is a full 64bit file system. Compatibility AIX 5.1 is highly compatible to both BSD and System V Unix Further Information Computer Architecture: A Quantitative Approach John Hennessy, et al. Morgan Kaufman Publishers, 2nd Edition, 1996 Computer Hardware and Design:The Hardware/Software Interface David A. Patterson, et al. Morgan Kaufman Publishers, 2nd Edition, 1997 IBM P Series [595] at the URL: http://www-03.ibm.com/systems/p/hardware/highend/590/index.html IBM p690 Documentation at NCSA at the URL: http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/IBMp690/