APPENDIX B. MPI Program for Matrix Multiplication

MESSAGE PASSING MULTIPROCESSING SYSTEM SIMULATION USING SIMICS A Project Presented to the faculty of the Department of Computer Science California State University, Sacramento Submitted in partial satisfaction of the requirements for the degree of MASTER OF SCIENCE in Computer Science by Sandra Guija FALL 2012 © 2012 Sandra Guija ALL RIGHTS RESERVED ii MESSAGE PASSING MULTIPROCESSING SYSTEM SIMULATION USING SIMICS A Project by Sandra Guija Approved by: __________________________________, Committee Chair Nikrouz Faroughi, Ph.D. __________________________________, Second Reader William Mitchell, Ph.D. ____________________________ Date iii Student: Sandra Guija I certify that this student has met the requirements for format contained in the University format manual, and that this project is suitable for shelving in the Library and credit is to be awarded for the project. __________________________, Graduate Coordinator Nikrouz Faroughi Department of Computer Science iv ___________________ Date Abstract of MESSAGE PASSING MULTIPROCESSING SYSTEM SIMULATION USING SIMICS by Sandra Guija Parallel processing uses multiple processors to compute a large computer problem. Two main multiprocessing programming models are shared memory and message passing. In the latter model, processes communicate by exchanging messages using the network. The project consisted of two parts: 1) To investigate the performance of a multithreaded matrix multiplication program, and 2) To create a user guide for how to setup a message passing multiprocessor simulation environment using Simics including MPI (message passing interface) installation, OS craff file creation, memory caches addition and python scripts usage. v The simulation results and performance analysis indicate as matrix size increases and the number of processing nodes increases, the rate at which bytes communicated and the number of packets increase is faster than the rates at which processing time per node decreases. _______________________, Committee Chair Nikrouz Faroughi, Ph.D. _______________________ Date vi ACKNOWLEDGEMENTS Success is the ability to go from one failure to another with no loss of enthusiasm. [Sir Winston Churchill] To God who gave me strength, enthusiasm, and health to be able to complete my project. To my husband Allan who said, “you can do this”, I would like to thank him for being there for me. I would like to thank my parents Lucho and Normita and my sister Giovana for their love and support despite the distance. I would like to thank Dr. Nikrouz Faroughi for his guidance during this project, his knowledge, time and constant feedback. I would also like to thank Dr. William Mitchell, who was kind enough to be my second reader. I would like to thank these special people: Cara for her always-sincere advice, Tom Pratt for his kindness, dedication, patience and time and Sandhya for being helpful, my manager Jay Rossi and my co-workers for their support. I truly believe their help has had a significant and positive impact on my project. vii TABLE OF CONTENTS Page Acknowledgments............................................................................................................. vii List of Tables ...................................................................................................................... x List of Figures .................................................................................................................... xi Chapter INTRODUCTION .............................................................................................................. 1 1.1 Shared Memory Multiprocessor Architecture ........................................................... 2 1.2 Message Passing Multiprocessor Architecture .......................................................... 3 1.3 Project Overview ....................................................................................................... 3 MESSAGE PASSING SYSTEM MODELING ................................................................. 5 2.1 Simics Overview ........................................................................................................ 5 2.2 Message Passing Interface ......................................................................................... 6 2.3 2.4 2.2.1 MPICH2 ..................................................................................................... 10 2.2.2 Open MPI ................................................................................................... 11 MPI Overview .......................................................................................................... 12 2.3.1 Beowulf Cluster and MPI Cluster .............................................................. 12 2.3.2 MPI Network Simulation ........................................................................... 12 Simulation of Matrix Multiplication ........................................................................ 13 viii SIMULATION RESULTS AND PERFORMANCE ANALYSIS .................................. 15 3.1 Simulation Parameters ............................................................................................. 15 3.2 Data Analysis ........................................................................................................... 16 CONCLUSION ................................................................................................................. 25 Appendix A. Simics Script to Run an 16-node MPI Network Simulation ....................... 26 Appendix B. MPI Program for Matrix Multiplication ...................................................... 28 Appendix C. Simics Script to Add L1 and L2 Cache Memories ...................................... 33 Appendix D. Python Script to Collect Simulation Data ................................................... 42 Appendix E. SSH-Agent script ......................................................................................... 43 Appendix F. User Guide ................................................................................................... 45 Appendix G. Simulation Data ........................................................................................... 67 Bibliography ..................................................................................................................... 76 ix LIST OF TABLES Tables Page 1. Table 1 MPI_Init and MPI_Finalize Functions .......................................................7 2. Table 2 MPI_Comm Functions ................................................................................8 3. Table 3 MPI Send and Receive Functions ...............................................................9 4. Table 4 MPI Broadcast Function .............................................................................9 5. Table 5 Configuration Information ........................................................................15 6. Table 6 Processing Time and Network Traffic Data Collected .............................74 7. Table 7 Processing Time, Total Bytes and Number of Packets Ratios .................75 8. Table 8 Time before the start of 1st slave ..............................................................75 x LIST OF FIGURES Figures Page 1. Figure 1 Shared memory multiprocessor interconnected via bus ............................2 2. Figure 2 Scalable Shared Memory Multiprocessor .................................................3 3. Figure 3 Processing Time per node .......................................................................17 4. Figure 4 Time before the start of 1st slave ............................................................18 5. Figure 5 Total Bytes per node................................................................................19 6. Figure 6 Number of Packets per node....................................................................21 7. Figure 7 Processing Time Ratio .............................................................................22 8. Figure 8 Bytes Ratio ..............................................................................................23 9. Figure 9 Number of Packets Ratio .........................................................................24 xi 1 Chapter 1 INTRODUCTION A parallel computer is a “collection of processing elements that communicate and cooperate to solve large problems fast” [Almasi and Gollieb, Highly Parallel Computing, 1989] Parallel Computing is the main approach to process massive data and to solve complex problems. Parallel computing is used in a wide range of applications including galaxy formation, weather forecasting, quantum physics, climate research, manufacturing processes, chemical reactions and planetary movements. Parallel processing means to divide a workload into subtasks and complete the subtasks concurrently. In order to achieve that, communication between processing elements is required. Parallel Programming Models such as, Shared Address Space (SAS) and Message Passing (MP) will define how a set of parallel processes communicate, share information and coordinate their activities [1]. 2 1.1 Shared Memory Multiprocessor Architecture In this case multiple processes access a shared memory space using standard load and store instructions. Each thread/process accesses a portion of the shared data address space. The threads communicate with each other by reading and writing shared variables. Synchronization functions are used to prevent a thread from updating the same-shared variable at the same time or for the threads to coordinate their activities. A shared memory system is implemented using a bus or interconnection network to interconnect the processors. Figure 1 illustrates a bus based multiprocessor system called UMA (Uniform Memory Access) because all memory accesses have the same latency. A NUMA (Non-uniform Memory Access) multiprocessor, on the other hand, is designed by distributing the shared memory space among the different processors as illustrated in Figure 2. The processors are interconnected using an interconnection network, making the architecture scalable. Figure 1 Shared memory multiprocessor interconnected via bus [1] 3 Figure 2 Scalable Shared Memory Multiprocessor [1] 1.2 Message Passing Multiprocessor Architecture In a message passing system, processes communicate by sending and receiving messages thought the network. To send a message, a processor executes a system call to request an operating system to send the message to a destination process. A common message passing system is a cluster network. A message passing architecture diagram is also similar to that shown for NUMA in Figure 2; except that each processor can only access its own memory and can send and receive data to and from other processors. 1.3 Project Overview Chapter 2 covers tools and concepts to model a message passing system. Chapter 3 describes simulation data collection and analysis, and Chapter 4 is the conclusion, and future work. Appendix A presents the Simics script to start a 16-node MPI network simulation. Appendix B includes an MPI program for Matrix Multiplication. Appendix C 4 presents the Simics script to add an L1 and L2 caches to simulated machines. Appendix D presents the Python script to collect the processing time and network traffic data from Simics. Appendix E presents the SSH-Agent script. Appendix F contains a step-by-step User Guide to configure and simulate a message passing system model using Simics. 5 Chapter 2 MESSAGE PASSING SYSTEM MODELING This chapter presents a description of the Simics simulation environment, MPI, and the multithreaded message passing matrix multiplication program. 2.1 Simics Overview Simics is a complete machine simulator that models all the hardware components found in a typical computer system. It is used by software developers to simulate any target hardware from a single processor to large and complex systems [2]. Simics facilitates integration and testing environment for software by providing the same experience as a real hardware system. Simics is a user-friendly interface with many tools and options. Among the many products and features of Simics are the craff utility, SimicsFS, Ethernet networking and scripting with Python. These are the main Simics functionalities used in this project for the simulation of a message passing multiprocessor system simulation. Each processor is modeled as a stand-alone processing node with its own copy of the OS. The craff utility allows users to create an operating system image from a simulated machine and use it to simulate multiple identical nodes. This utility saves significant time by setting only one target machine with all the software and configuration features, which is then replicated 6 to the remaining nodes. SimicsFS allows users to copy files from a host directory to a simulated node. The Ethernet Networking provides network connectivity for a Simics platform inside one Simics session. Scripting with Python is very simple and can be used to access system configuration parameters, invoke command line functions, define hap events and interface with Simics API functions. The primary use of hap and Simics API functions is for collecting simulation data. 2.2 Message Passing Interface Message Passing Interface (MPI) is a standard message library developed to create practical, portable, efficient, and flexible message passing programs [4]. The MPI standardization process commenced in 1992. A group of researchers from academia and industry worked together in a standardization process exploiting the most advantageous features of the existing message passing systems. The MPI standard consists of two publications: MPI-1 (1994) and MPI-2 (1996). The MPI-2 is mainly additions and extensions to MPI-1. The MPI Standard includes point-to-point communication, collective operations, process groups, communication contexts, process topologies, and interfaces supported in FORTRAN, C and C++. Processes/threads communicate by calling MPI library routines to send and receive messages to other processes. All programs using MPI require the mpi.h header file to 7 make MPI library calls. The MPI includes over one hundred different functions. The first MPI function that a MPI-base message passing program must call is MPI_INIT, which initializes an MPI execution. The last function is MPI_FINALIZE which terminates the MPI execution. Both functions are called once during a program execution. Table 1 illustrates the declaration and description of MPI_INIT and MPI_FINALIZE functions. Table 1 MPI_Init and MPI_Finalize Functions MPI_INIT (int *argc, char *argv[] ) First MPI function called in a program. Some of the common arguments taken from the command-line are number of processes, specified hosts or list of hosts, hostfile (text file with hosts specified), directory of the program, Initializes MPI variables and forms the MPI_COMM_WORLD communicator Opens TCP connections MPI_FINALIZE () Terminates MPI execution environment Called last by all processes Closes TCP Connections Cleans up The two basic concepts to program with MPI are groups and communicators. A group is an ordered set of processes, where each process has its own rank number. “A communicator determines the scope and the "communication universe" in which a pointto-point or collective operation is to operate. Each communicator is associated with a group” [3]. MPI_COMM_WORLD is a communicator defined by MPI referring to all the processes. Groups and communicators are dynamic objects that may get created and destroyed during program execution. MPI provides flexibility to create groups and communicators 8 for applications that might require communications among selected subgroup of processes. MPI_COMM_SIZE and MPI_COMM_RANK are the most commonly used communication functions in an MPI program. MPI_COMMON_SIZE determines the size of the group or number of the processes associated with a communicator. MPI_COMM_RANK determines the rank of the calling process in the communicator. The Matrix Multiplication MPI program uses the MPI_COMM_WORLD as the communicator. Table 2 illustrates the declaration and description of MPI_COMM functions. Table 2 MPI_Comm Functions MPI_COMM_SIZE(MPI_Comm comm, int *size) Determines number of processes within a communicator. In this study the MPI_Comm argument is MPI_COMM_WORLD. MPI_COMM_RANK(MPI_Comm comm, int *rank) Returns the process identifier for the process that invokes it. Rank is integer between 0 and size-1. In MPI, point-to-point communication is fundamental for sending and receiving operations. MPI defines two models of communication blocking and non-blocking. The non-blocking functions return immediately even if the communication is not finished yet, while the blocking functions do not return until the communication is finished. Using non-blocking functions allows computations and calculations to proceed simultaneously. 9 For this study, we use the asynchronous non-blocking MPI_ISend and MPI_Recv function. Table 3 illustrates the declaration and description of MPI_ISend and MPI_Recv functions. Table 3 MPI Send and Receive Functions MPI_Isend (void *buffer, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) Sends a message. An MPI Nonblocking call, where the computation can proceed immediately allowing both communications and computations to proceed concurrently. MPI supports messages with all the basic datatypes. MPI_Recv (void *buffer, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) Receives a message The count argument indicates the maximum length of a message. The tag argument must match between sender and receiver. Many applications require a communication between two or more processes. MPI includes collective communication operations that involve the participation of all processes in a communicator. Broadcast is one of the most common collective operations that is used in this study. Broadcast is defined as MPI_Bcast and is used for a process, which is the root, to send a message to all the members of the communicator. Table 4 illustrates the declaration and description of MPI_Bcast function. Table 4 MPI Broadcast Function MPI_Bcast (void *buffer, int count, MPI_Datatype datatype, int master, MPI_Comm comm) Broadcasts a message from the process with rank "master" to all other processes of the communicator. 10 The MPI Standard is a set of functions and capabilities that any implementation of the message-passing library must follow. The two leading open source implementations of MPI are MPICH2 and Open MPI. Both implementations are available for different versions of Unix, Linux, Mac OS X and MS Windows. 2.2.1 MPICH2 MPICH2 is a broadly used MPI implementation developed at the Argonne National Laboratory (ANL) and Mississippi State University (MSU). MPICH2 is a highperformance and widely portable implementation of the Message Passing Interface (MPI) standard (both MPI-1 and MPI-2). The CH comes from “Chameleon”, the portability layer used in the original MPICH. The founder of MPICH developed the Chameleon parallel programming library. MPICH2 uses an external process manager that spawns and manages parallel jobs. MPD is the default process manager and is used to manage all MPI processes. This process manager uses PMI (process management interface) to communicate with MPICH2. MPD involves starting up an mpd daemon on each of the worker nodes. MPD used to be the default process manager for MPICH2. Starting with version 1.3 Hydra, a more robust and reliable process manager, is the default MPICH2 process manager. 11 2.2.2 Open MPI Open MPI is evolved from the merger of three established MPI implementations: FT_MPI, LA_MPI and LAM/MPI plus contributions of PACX-MPI. Open MPI is developed using the best practices among them established MPI implementations. Open MPI runs using Open Run-Time Environment ORTE. ORTE is open source software developed to support distributed high-performance applications and transparent scalability. ORTE starts MPI jobs and provides some status information to the upperlayer Open MPI [5]. The Open MPI project goal is to work with and for the supercomputer community to support MPI implementation for a large number and variety of systems. A K computer is a supercomputer produced by Fujitsu and is currently the world’s second fastest supercomputer. It uses a Tofu-optimized MPI based on Open MPI. MPICH2 and Open MPI are the most common MPI implementations used by supercomputers. Open MPI does not require the usage of a process manager, which makes the installation, configuration and execution a simpler process. Open MPI is the MPI implementation used in this project. 12 2.3 MPI Overview 2.3.1 Beowulf Cluster and MPI Cluster Beowulf Project started in 1994 at NASA's Goddard Space Flight Center. A result of this research was the Beowulf Cluster system, a scalable combination of hardware and software that provides a sophisticated and robust environment to support a wide range of applications [6]. The name “Beowulf” comes from the mythical Old-English hero with extraordinary strength who defeats Grendel, the green dragon. The motivation to develop a cost-efficient solution makes a Beowulf Cluster attainable to anyone. The three required components are a collection of stand-alone computers networked together, an open source operating system such as Linux and a message passing interface or PVM Parallel Virtual Machine implementation. Based on that, the components selected for this project are: 4, 8 and 16 Pentium PCs with Fedora 5, TCP/IP network connectivity and Open MPI implementation. The Beowulf Cluster is known as “MPI Cluster” when MPI is used for communication and coordination between the processing nodes. 2.3.2 MPI Network Simulation One early consideration to make when setting an MPI Network is to determine the use of a file system. The options are whether to setup a Network File System (NFS) or not. NFS is a protocol that facilitates access to files in the network, as if they were local. With NFS 13 a folder containing the Open MPI program can be shared in the master node to all the other slave nodes. NFS can become a bottleneck when the nodes all use the NFS shared directory. NFS will not be used in this project; instead Open MPI is installed in the local drive in each node. A second consideration is setting up a secure SSH protocol in the master node. MPI uses SSH to communicate among the nodes. Simics Tango targets are loaded with OpenSSH, a widely used implementation of SSH protocol, configured with password protection. Because Open MPI will rely on OpenSSH at the execution time, additional commands will be run to ensure a connection without a password. The last setting to be performed in this simulation is to define a username with the same user ID to be created on each node with the same path directory to access common files. Now that, all the required components have been introduced, the MPI matrix multiplication program will be described next. 2.4 Simulation of Matrix Multiplication In this project, Simics scripts are used to configure and simulate 4, 8 or 16-node message passing system. Each node is configured as a complete single processor system. Each node also includes a copy of executable matrix multiplication code. A file with all the nodes hostname must be created. 14 When entering the execution command two arguments are passed: 1) the number of processes “np” to specify how many processors to use and 2) a file that includes the names of the processing nodes. However, in this project, the node names are not referenced in the program explicitly; only the node ID (also called rank) as 0, 1, 2, etc. are referenced. One of the nodes is the master node, which coordinates the task of multiplying two matrices A and B. The master partitions matrix A among the different (slave) nodes and then broadcasts Matrix B to all the slaves. Each slave node multiplies its portion of matrix A with matrix B and sends the results to the master, which combines the results to produce the final product of A and B. 15 Chapter 3 SIMULATION RESULTS AND PERFORMANCE ANALYSIS As was described in the previous chapter, for modeling a message passing system simulation, Open MPI was installed and configured on a Simics target machine. This chapter presents the performance simulation results and analysis of running a message passing matrix multiplication program. 3.1 Simulation Parameters The matrix multiplication program is executed in three simulated MP systems with 4, 8 and 16 processing nodes. The nodes are identical in terms of processor type, clock frequency, and the size of main and cache memories. Table 5 displays the configuration data of each node. The master and slave nodes are interconnected by Ethernet with the MTU (Maximum Transmission Unit) set to the default value of 1500B. The nodes are independent and each includes a copy of the test program. Seven different matrix sizes were used in the simulation. Table 5 Configuration Information Nodes Cores MHz Main Memory L1 Data Cache L1 Instruction Cache L2 Cache 4 1 2000 1 GB 32K 32 K 256 KB 8 16 1 1 2000 2000 1 GB 1 GB 32K 32K 32 K 32 K 256 KB 256 KB 16 3.2 Data Analysis Figure 3 shows the average processing time per node using 100x100, 200x200, etc., matrices. As expected, the average processing time per nodes decreases as number of nodes increases. Also, as expected, as the matrix size increases the average processing time per node in each system also increases. In the 4-node system, the average processing time increases linearly as the matrix size increases. On the other hand, in the 8-node and 16-node systems, the increases of the average processing time per node are not linear as matrix size increase. In the 8-node system, when the matrix size is 800x800 the average processing time climbs from 34.23 to 98.80. In the case of the 16-node system, when the matrix is 1000x1000 the average processing time per node climbs from 24.59 to 94.48. This jump in the average processing times is due to the increased delay from the time the program starts running until the first slave starts multiplying its portion of the matrix as illustrated in Figure 4. In the 8-node and 16-node systems, the delay to start the first slave node jumps when the matrix size is 800x800 and 1000x1000, respectively. One can conclude that the communication delay time increases at a higher rate as the matrix size and number of processing nodes increase. Figure 3 Processing Time per node 17 Figure 4 Time before the start of 1st slave 18 Figure 5 shows the average of the total bytes communicated per node; as expected the larger the matrix sizes are the larger the number of bytes transmitted. This increase is proportional to the number of elements in each matrix. For example, in the 16node system, the number of transmitted bytes for 500x500 matrix is 2,284,426 and for 1000x1000 matrix is 9,097,049, a ratio of 3.98 equal approximately to the number of elements in each matrix. 19 Figure 5 Total Bytes per node 20 Figure 6 shows the average number of packets per nodes; as expected the larger the matrix sizes the larger the number of packets incurred during the program execution. In general as matrix size increases there are more packets per node when there are fewer nodes. Each node must receive a bigger section of matrix A when there are fewer nodes. Figure 7 through Figure 9 illustrate the processing time, number of bytes communicated, number of packets of the 8-node and 16-node systems as compared with those of the 4node system. While the ratios of the number of bytes communicated and number of packets between 8 vs. 4 and 16 vs. 4 remain the same as matrix size increases, the 16node system has the least processing time per node. However, the ratios of the 8 vs. 4 and 16 vs. 4 processing time per node decrease as matrix size became larger. Figure 6 Number of Packets per node 21 Figure 7 Processing Time Ratio 22 Figure 8 Bytes Ratio 23 . Figure 9 Number of Packets Ratio 24 25 Chapter 4 CONCLUSION This project simulates a message passing multiprocessor system using Simics. Using an MPI matrix multiplication program, the processing time and network traffic information were collected to evaluate the performance in three separated systems: 4-node, 8-node and 16-node. Several iterations of Simics simulations were performed to study the performance by varying the matrix size. The results indicate that as the matrix size gets larger and there are more processing nodes, there is a rapid increase in the processing time per node. However, the average processing time per node is less when there are more nodes. This project serves as the base research for future projects. Further studies may include performance analysis of a different problem. Other studies may include the simulation of alternative interconnection networks in Simics. For example, this can be done with multiple Ethernet connections per node to implement a Hypercube interconnection network. 26 APPENDIX A. Simics Script to Run an 16-node MPI Network Simulation if not defined create_network {$create_network = "yes"} if not defined disk_image {$disk_image="tango-openmpi.craff"} load-module std-components load-module eth-links $host_name = "master" run-command-file "%script%/tango-common.simics" $mac_address = "10:10:10:10:10:31" $ip_address = "10.10.0.13" $host_name = "slave1" run-command-file "%script%/tango-common.simics" $mac_address = "10:10:10:10:10:32" $ip_address = "10.10.0.14" $host_name = "slave2" run-command-file "%script%/tango-common.simics" $mac_address = "10:10:10:10:10:33" $ip_address = "10.10.0.15" $host_name = "slave3" run-command-file "%script%/tango-common.simics" $mac_address = "10:10:10:10:10:34" $ip_address = "10.10.0.16" $host_name = "slave4" run-command-file "%script%/tango-common.simics" $mac_address = "10:10:10:10:10:35" $ip_address = "10.10.0.17" $host_name = "slave5" run-command-file "%script%/tango-common.simics" $mac_address = "10:10:10:10:10:36" $ip_address = "10.10.0.18" $host_name = "slave6" run-command-file "%script%/tango-common.simics" $mac_address = "10:10:10:10:10:37" $ip_address = "10.10.0.19" $host_name = "slave7" run-command-file "%script%/tango-common.simics" 27 $mac_address = "10:10:10:10:10:38" $ip_address = "10.10.0.20" $host_name = "slave8" run-command-file "%script%/tango-common.simics" $mac_address = "10:10:10:10:10:39" $ip_address = "10.10.0.21" $host_name = "slave9" run-command-file "%script%/tango-common.simics" $mac_address = "10:10:10:10:10:40" $ip_address = "10.10.0.22" $host_name = "slave10" run-command-file "%script%/tango-common.simics" $mac_address = "10:10:10:10:10:41" $ip_address = "10.10.0.23" $host_name = "slave11" run-command-file "%script%/tango-common.simics" $mac_address = "10:10:10:10:10:42" $ip_address = "10.10.0.24" $host_name = "slave12" run-command-file "%script%/tango-common.simics" $mac_address = "10:10:10:10:10:43" $ip_address = "10.10.0.25" $host_name = "slave13" run-command-file "%script%/tango-common.simics" $mac_address = "10:10:10:10:10:44" $ip_address = "10.10.0.26" $host_name = "slave14" run-command-file "%script%/tango-common.simics" $mac_address = "10:10:10:10:10:45" $ip_address = "10.10.0.27" $host_name = "slave15" run-command-file "%script%/tango-common.simics" set-memory-limit 980 28 APPENDIX B. MPI Program for Matrix Multiplication The Matrix Multiplication MPI program was found on the Internet in the following website URL: http://www.daniweb.com/software-development/c/code/334470/matrixmultiplication-using-mpi-parallel-programming-approach. A request to Viraj Brian Wijesuriya, the author of the code, was submitted asking authorization to use his code in this study. Below are the screenshots of the email requesting and authorizing permission to use the Matrix Multiplication MPI program. Email Sent to Request Permission to Use Matrix Multiplication Program using MPI. Email Received from Viraj Brian Wijesuriya granting authorization to use his program. 29 A Simics MAGIC(n) function has been added to the Matrix Multiplication Program to insert a breakpoint to invoke a callback function to collect simulation data. MAGIC (1) and MAGIC(2) are executed by the master node to dump start and end processing time and to Start and Stop network traffic capture. MAGIC(3) and MAGIC(4) are executed by each slaves to dump start and end processing time. /*********************************************************************** * Matrix Multiplication Program using MPI. * Viraj Brian Wijesuriya - University of Colombo School of Computing, Sri Lanka. * Works with any type of two matrixes [A], [B] which could be multiplied to produce * a matrix [c]. * Master process initializes the multiplication operands, distributes the multiplication * operation to worker processes and reduces the worker results to construct the final * output. ***********************************************************************/ #include<stdio.h> #include<mpi.h> #include <magic-instruction.h> //part of Simics SW #define NUM_ROWS_A 12 //rows of input [A] #define NUM_COLUMNS_A 12 //columns of input [A] #define NUM_ROWS_B 12 //rows of input [B] #define NUM_COLUMNS_B 12 //columns of input [B] #define MASTER_TO_SLAVE_TAG 1 //tag for messages sent from master to slaves #define SLAVE_TO_MASTER_TAG 4 //tag for messages sent from slaves to master void makeAB(); void printArray(); //makes the [A] and [B] matrixes //print the content of output matrix [C]; int rank; //process rank int size; //number of processes int i, j, k; //helper variables double mat_a[NUM_ROWS_A][NUM_COLUMNS_A]; //declare input [A] double mat_b[NUM_ROWS_B][NUM_COLUMNS_B]; //declare input [B] double mat_result[NUM_ROWS_A][NUM_COLUMNS_B];//declare output [C] double start_time; //hold start time double end_time; // hold end time int low_bound; //low bound of the number of rows of [A] allocated to a slave int upper_bound; //upper bound of the number of rows of [A] allocated to a slave int portion; //portion of the number of rows of [A] allocated to a slave MPI_Status status; MPI_Request request; int main(int argc, char *argv[]) // store status of an MPI_Recv //capture request of an MPI_Isend 30 { MPI_Init(&argc, &argv); //initialize MPI operations MPI_Comm_rank(MPI_COMM_WORLD, &rank); //get the rank MPI_Comm_size(MPI_COMM_WORLD, &size); //get number of processes /* master initializes work*/ if (rank == 0) { MAGIC (1); makeAB(); start_time = MPI_Wtime(); for (i = 1; i < size; i++) { //for each slave other than the master portion = (NUM_ROWS_A / (size - 1)); // calculate portion without master low_bound = (i - 1) * portion; if (((i + 1) == size) && ((NUM_ROWS_A % (size - 1)) != 0)) { //if rows of [A] cannot be equally divided among slaves upper_bound = NUM_ROWS_A; //last slave gets all the remaining rows } else { //rows of [A] are equally divisable among slaves upper_bound = low_bound + portion; } //send the low bound first without blocking, to the intended slave MPI_Isend(&low_bound, 1, MPI_INT, i, MASTER_TO_SLAVE_TAG, MPI_COMM_WORLD, &request); //next send the upper bound without blocking, to the intended slave MPI_Isend(&upper_bound, 1, MPI_INT, i, MASTER_TO_SLAVE_TAG + 1, MPI_COMM_WORLD, &request); //finally send the allocated row portion of [A] without blocking, to the intended slave MPI_Isend(&mat_a[low_bound][0], (upper_bound - low_bound) * NUM_COLUMNS_A, MPI_DOUBLE, i, MASTER_TO_SLAVE_TAG + 2, MPI_COMM_WORLD, &request); } } //broadcast [B] to all the slaves MPI_Bcast(&mat_b, NUM_ROWS_B*NUM_COLUMNS_B, MPI_DOUBLE, 0, MPI_COMM_WORLD); /* work done by slaves*/ if (rank > 0) { MAGIC(3); //receive low bound from the master MPI_Recv(&low_bound, 1, MPI_INT, 0, MASTER_TO_SLAVE_TAG, 31 MPI_COMM_WORLD, &status); //next receive upper bound from the master MPI_Recv(&upper_bound, 1, MPI_INT, 0, MASTER_TO_SLAVE_TAG + 1, MPI_COMM_WORLD, &status); //finally receive row portion of [A] to be processed from the master MPI_Recv(&mat_a[low_bound][0], (upper_bound - low_bound) * NUM_COLUMNS_A, MPI_DOUBLE, 0, MASTER_TO_SLAVE_TAG + 2, MPI_COMM_WORLD, &status); for (i = low_bound; i < upper_bound; i++) { //iterate through a given set of rows of [A] for (j = 0; j < NUM_COLUMNS_B; j++) { //iterate through columns of [B] for (k = 0; k < NUM_ROWS_B; k++) { //iterate through rows of [B] mat_result[i][j] += (mat_a[i][k] * mat_b[k][j]); } } } //send back the low bound first without blocking, to the master MPI_Isend(&low_bound, 1, MPI_INT, 0, SLAVE_TO_MASTER_TAG, MPI_COMM_WORLD, &request); //send the upper bound next without blocking, to the master MPI_Isend(&upper_bound, 1, MPI_INT, 0, SLAVE_TO_MASTER_TAG + 1, MPI_COMM_WORLD, &request); //finally send the processed portion of data without blocking, to the master MPI_Isend(&mat_result[low_bound][0], (upper_bound - low_bound) * NUM_COLUMNS_B, MPI_DOUBLE, 0, SLAVE_TO_MASTER_TAG + 2, MPI_COMM_WORLD, &request); MAGIC(4); } /* master gathers processed work*/ if (rank == 0) { for (i = 1; i < size; i++) { // untill all slaves have handed back the processed data //receive low bound from a slave MPI_Recv(&low_bound, 1, MPI_INT, i, SLAVE_TO_MASTER_TAG, MPI_COMM_WORLD, &status); //receive upper bound from a slave 32 MPI_Recv(&upper_bound, 1, MPI_INT, i, SLAVE_TO_MASTER_TAG + 1, MPI_COMM_WORLD, &status); //receive processed data from a slave MPI_Recv(&mat_result[low_bound][0], (upper_bound - low_bound) * NUM_COLUMNS_B, MPI_DOUBLE, i, SLAVE_TO_MASTER_TAG + 2, MPI_COMM_WORLD, &status); } printArray(); end_time = MPI_Wtime(); printf("\nRunning Time = %f\n\n", end_time - start_time); } MPI_Finalize(); MAGIC(2); return 0; } //finalize MPI operations void makeAB() { for (i = 0; i < NUM_ROWS_A; i++) { for (j = 0; j < NUM_COLUMNS_A; j++) { mat_a[i][j] = i + j; } } for (i = 0; i < NUM_ROWS_B; i++) { for (j = 0; j < NUM_COLUMNS_B; j++) { mat_b[i][j] = i*j; } } } void printArray() { for (i = 0; i < NUM_ROWS_A; i++) { printf("\n"); for (j = 0; j < NUM_COLUMNS_B; j++) printf("%8.2f ", mat_result[i][j]); } printf ("Done.\n"); end_time = MPI_Wtime(); printf("\nRunning Time = %f\n\n", end_time - start_time); } 33 APPENDIX C. Simics Script to Add L1 and L2 Cache Memories This script adds L1 and L2 cache memory to each simulated machine in a 4-node network simulation. Each processor has a 32KB write-through L1 data cache, a 32KB L1 instruction cache and a 256KB L2 cache with write-back policy. Instruction and data accesses are separated out by id-splitters and are sent to the respective caches. The splitter allows the correctly aligned accesses to go through and splits the incorrectly aligned ones into two accesses. The transaction staller (trans-staller) simulates main memory latency [11]. ##Add L1 and L2 caches to Master Node ## Transaction staller to represent memory latency. Stall instructions 239 cycles to simulate memory latency @master_staller = pre_conf_object("master_staller", "trans-staller", stall_time = 239) ##Latency of (L2 + RAM) in CPU cycles ## Master core @master_cpu0 = conf.master.motherboard.processor0.core[0][0] ## L2 cache(l2c0) for cpu0: 256KB with write-back @master_l2c0 = pre_conf_object("master_l2c0", "g-cache") @master_l2c0.cpus = master_cpu0 @master_l2c0.config_line_number = 4096 @master_l2c0.config_line_size = 64 ##64 blocks. Implies 512 lines @master_l2c0.config_assoc = 8 @master_l2c0.config_virtual_index = 0 @master_l2c0.config_virtual_tag = 0 @master_l2c0.config_write_back = 1 @master_l2c0.config_write_allocate = 1 @master_l2c0.config_replacement_policy = 'lru' @master_l2c0.penalty_read =37 ##Stall penalty (in cycles) for any incoming read transaction @master_l2c0.penalty_write =37 ##Stall penalty (in cycles) for any incoming write transaction @master_l2c0.penalty_read_next =22 ##Stall penalty (in cycles) for a read transaction issued by the cache to the next level cache. Rounding error, value should be 7. @master_l2c0.penalty_write_next =22 ##Stall penalty (in cycles) for a write transactions issued by the cache to the next level cache. Rounding error, value should be 7 @master_l2c0.timing_model = master_staller ##L1- Instruction Cache (ic0) : 32Kb @master_ic0 = pre_conf_object("master_ic0", "g-cache") @master_ic0.cpus = master_cpu0 34 @master_ic0.config_line_number = 512 @master_ic0.config_line_size = 64 ##64 blocks. Implies 512 lines @master_ic0.config_assoc = 8 @master_ic0.config_virtual_index = 0 @master_ic0.config_virtual_tag = 0 @master_ic0.config_write_back = 0 @master_ic0.config_write_allocate = 0 @master_ic0.config_replacement_policy = 'lru' @master_ic0.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read transaction @master_ic0.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write transaction @master_ic0.penalty_read_next = 9 ##Stall penalty (in cycles) for a read transaction issued by the cache to the next level cache. Rounding error, value should be 7. @master_ic0.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions issued by the cache to the next level cache. Rounding error, value should be 7 @master_ic0.timing_model = master_l2c0 # L1 - Data Cache (dc0) : 32KB Write Through @master_dc0 = pre_conf_object("master_dc0", "g-cache") @master_dc0.cpus = master_cpu0 @master_dc0.config_line_number = 512 @master_dc0.config_line_size = 64 ##64 blocks. Implies 512 lines @master_dc0.config_assoc = 8 @master_dc0.config_virtual_index = 0 @master_dc0.config_virtual_tag = 0 @master_dc0.config_write_back = 0 @master_dc0.config_write_allocate = 0 @master_dc0.config_replacement_policy = 'lru' @master_dc0.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read transaction @master_dc0.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write transaction @master_dc0.penalty_read_next = 9 ##Stall penalty (in cycles) for a read transaction issued by the cache to the next level cache. Rounding error, value should be 7. @master_dc0.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions issued by the cache to the next level cache. Rounding error, value should be 7 @master_dc0.timing_model = master_l2c0 # Transaction splitter for L1 instruction cache for master_cpu0 @master_ts_i0 = pre_conf_object("master_ts_i0", "trans-splitter") @master_ts_i0.cache = master_ic0 @master_ts_i0.timing_model = master_ic0 @master_ts_i0.next_cache_line_size = 64 35 # transaction splitter for L1 data cache for master_cpu0 @master_ts_d0 = pre_conf_object("master_ts_d0", "trans-splitter") @master_ts_d0.cache = master_dc0 @master_ts_d0.timing_model = master_dc0 @master_ts_d0.next_cache_line_size = 64 # ID splitter for L1 cache for master_cpu0 @master_id0 = pre_conf_object("master_id0", "id-splitter") @master_id0.ibranch = master_ts_i0 @master_id0.ibranch = master_ts_d0 #Add Configuration @SIM_add_configuration([master_staller, master_l2c0, master_ic0, master_dc0, master_ts_i0, master_ts_d0, master_id0], None); @master_cpu0.physical_memory.timing_model = conf.master_id0 #End of master ##Add L1 and L2 caches to slave1 Node ## transaction staller to represent memory latency. Stall instructions 239 cycles to simulate memory latency @slave1_staller = pre_conf_object("slave1_staller", "trans-staller", stall_time = 239) ##Latency of (L2 + RAM) in CPU cycles ## Slave1 core @slave1_cpu0 = conf.slave1.motherboard.processor0.core[0][0] ## L2 cache(l2c0) for cpu0: 256KB with write-back @slave1_l2c0 = pre_conf_object("slave1_l2c0", "g-cache") @slave1_l2c0.cpus = slave1_cpu0 @slave1_l2c0.config_line_number = 4096 @slave1_l2c0.config_line_size = 64 ##64 blocks. Implies 512 lines @slave1_l2c0.config_assoc = 8 @slave1_l2c0.config_virtual_index = 0 @slave1_l2c0.config_virtual_tag = 0 @slave1_l2c0.config_write_back = 1 @slave1_l2c0.config_write_allocate = 1 @slave1_l2c0.config_replacement_policy = 'lru' @slave1_l2c0.penalty_read = 37 ##Stall penalty (in cycles) for any incoming read transaction @slave1_l2c0.penalty_write = 37 ##Stall penalty (in cycles) for any incoming write transaction 36 @slave1_l2c0.penalty_read_next = 22 ##Stall penalty (in cycles) for a read transaction issued by the cache to the next level cache. Rounding error, value should be 7. @slave1_l2c0.penalty_write_next = 22 ##Stall penalty (in cycles) for a write transactions issued by the cache to the next level cache. Rounding error, value should be 7 @slave1_l2c0.timing_model = slave1_staller ##L1- Instruction Cache (ic0) : 32Kb @slave1_ic0 = pre_conf_object("slave1_ic0", "g-cache") @slave1_ic0.cpus = slave1_cpu0 @slave1_ic0.config_line_number = 512 @slave1_ic0.config_line_size = 64 ##64 blocks. Implies 512 lines @slave1_ic0.config_assoc = 8 @slave1_ic0.config_virtual_index = 0 @slave1_ic0.config_virtual_tag = 0 @slave1_ic0.config_write_back = 0 @slave1_ic0.config_write_allocate = 0 @slave1_ic0.config_replacement_policy = 'lru' @slave1_ic0.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read transaction @slave1_ic0.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write transaction @slave1_ic0.penalty_read_next = 9 ##Stall penalty (in cycles) for a read transaction issued by the cache to the next level cache. Rounding error, value should be 7. @slave1_ic0.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions issued by the cache to the next level cache. Rounding error, value should be 7 @slave1_ic0.timing_model = slave1_l2c0 # L1 - Data Cache (dc0) : 32KB Write Through @slave1_dc0 = pre_conf_object("slave1_dc0", "g-cache") @slave1_dc0.cpus = slave1_cpu0 @slave1_dc0.config_line_number = 512 @slave1_dc0.config_line_size = 64 ##64 blocks. Implies 512 lines @slave1_dc0.config_assoc = 8 @slave1_dc0.config_virtual_index = 0 @slave1_dc0.config_virtual_tag = 0 @slave1_dc0.config_write_back = 0 @slave1_dc0.config_write_allocate = 0 @slave1_dc0.config_replacement_policy = 'lru' @slave1_dc0.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read transaction @slave1_dc0.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write transaction @slave1_dc0.penalty_read_next = 9 ##Stall penalty (in cycles) for a read transaction issued by the cache to the next level cache. Rounding error, value should be 7. 37 @slave1_dc0.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions issued by the cache to the next level cache. Rounding error, value should be 7 @slave1_dc0.timing_model = slave1_l2c0 # Transaction splitter for L1 instruction cache for slave1_cpu0 @slave1_ts_i0 = pre_conf_object("slave1_ts_i0", "trans-splitter") @slave1_ts_i0.cache = slave1_ic0 @slave1_ts_i0.timing_model = slave1_ic0 @slave1_ts_i0.next_cache_line_size = 64 # transaction splitter for L1 data cache for slave1_cpu0 @slave1_ts_d0 = pre_conf_object("slave1_ts_d0", "trans-splitter") @slave1_ts_d0.cache = slave1_dc0 @slave1_ts_d0.timing_model = slave1_dc0 @slave1_ts_d0.next_cache_line_size = 64 # ID splitter for L1 cache for slave1_cpu0 @slave1_id0 = pre_conf_object("slave1_id0", "id-splitter") @slave1_id0.ibranch = slave1_ts_i0 @slave1_id0.ibranch = slave1_ts_d0 #Add Configuration @SIM_add_configuration([slave1_staller, slave1_l2c0, slave1_ic0, slave1_dc0, slave1_ts_i0, slave1_ts_d0, slave1_id0], None); @slave1_cpu0.physical_memory.timing_model = conf.slave1_id0 #End of slave1 ##Add L1 and L2 caches to slave2 Node ## Transaction staller to represent memory latency. Stall instructions 239 cycles to simulate memory latency @slave2_staller = pre_conf_object("slave2_staller", "trans-staller", stall_time = 239) ##Latency of (L2 + RAM) in CPU cycles ## Slave2 core @slave2_cpu0 = conf.slave2.motherboard.processor0.core[0][0] ## L2 cache(l2c0) for cpu0: 256KB with write-back @slave2_l2c0 = pre_conf_object("slave2_l2c0", "g-cache") @slave2_l2c0.cpus = slave2_cpu0 @slave2_l2c0.config_line_number = 4096 @slave2_l2c0.config_line_size = 64 ##64 blocks. Implies 512 lines @slave2_l2c0.config_assoc = 8 @slave2_l2c0.config_virtual_index = 0 38 @slave2_l2c0.config_virtual_tag = 0 @slave2_l2c0.config_write_back = 1 @slave2_l2c0.config_write_allocate = 1 @slave2_l2c0.config_replacement_policy = 'lru' @slave2_l2c0.penalty_read = 37 ##Stall penalty (in cycles) for any incoming read transaction @slave2_l2c0.penalty_write = 37 ##Stall penalty (in cycles) for any incoming write transaction @slave2_l2c0.penalty_read_next = 22 ##Stall penalty (in cycles) for a read transaction issued by the cache to the next level cache. Rounding error, value should be 7. @slave2_l2c0.penalty_write_next = 22 ##Stall penalty (in cycles) for a write transactions issued by the cache to the next level cache. Rounding error, value should be 7 @slave2_l2c0.timing_model = slave2_staller ##L1- Instruction Cache (ic0) : 32Kb @slave2_ic0 = pre_conf_object("slave2_ic0", "g-cache") @slave2_ic0.cpus = slave2_cpu0 @slave2_ic0.config_line_number = 512 @slave2_ic0.config_line_size = 64 ##64 blocks. Implies 512 lines @slave2_ic0.config_assoc = 8 @slave2_ic0.config_virtual_index = 0 @slave2_ic0.config_virtual_tag = 0 @slave2_ic0.config_write_back = 0 @slave2_ic0.config_write_allocate = 0 @slave2_ic0.config_replacement_policy = 'lru' @slave2_ic0.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read transaction @slave2_ic0.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write transaction @slave2_ic0.penalty_read_next = 9 ##Stall penalty (in cycles) for a read transaction issued by the cache to the next level cache. Rounding error, value should be 7. @slave2_ic0.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions issued by the cache to the next level cache. Rounding error, value should be 7 @slave2_ic0.timing_model = slave2_l2c0 # L1 - Data Cache (dc0) : 32KB Write Through @slave2_dc0 = pre_conf_object("slave2_dc0", "g-cache") @slave2_dc0.cpus = slave2_cpu0 @slave2_dc0.config_line_number = 512 @slave2_dc0.config_line_size = 64 ##64 blocks. Implies 512 lines @slave2_dc0.config_assoc = 8 @slave2_dc0.config_virtual_index = 0 @slave2_dc0.config_virtual_tag = 0 @slave2_dc0.config_write_back = 0 39 @slave2_dc0.config_write_allocate = 0 @slave2_dc0.config_replacement_policy = 'lru' @slave2_dc0.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read transaction @slave2_dc0.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write transaction @slave2_dc0.penalty_read_next = 9 ##Stall penalty (in cycles) for a read transaction issued by the cache to the next level cache. Rounding error, value should be 7. @slave2_dc0.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions issued by the cache to the next level cache. Rounding error, value should be 7 @slave2_dc0.timing_model = slave2_l2c0 # Transaction splitter for L1 instruction cache for slave2_cpu0 @slave2_ts_i0 = pre_conf_object("slave2_ts_i0", "trans-splitter") @slave2_ts_i0.cache = slave2_ic0 @slave2_ts_i0.timing_model = slave2_ic0 @slave2_ts_i0.next_cache_line_size = 64 # transaction splitter for L1 data cache for slave2_cpu0 @slave2_ts_d0 = pre_conf_object("slave2_ts_d0", "trans-splitter") @slave2_ts_d0.cache = slave2_dc0 @slave2_ts_d0.timing_model = slave2_dc0 @slave2_ts_d0.next_cache_line_size = 64 # ID splitter for L1 cache for slave2_cpu0 @slave2_id0 = pre_conf_object("slave2_id0", "id-splitter") @slave2_id0.ibranch = slave2_ts_i0 @slave2_id0.ibranch = slave2_ts_d0 #Add Configuration @SIM_add_configuration([slave2_staller, slave2_l2c0, slave2_ic0, slave2_dc0, slave2_ts_i0, slave2_ts_d0, slave2_id0], None); @slave2_cpu0.physical_memory.timing_model = conf.slave2_id0 #End of slave2 ##Add L1 and L2 caches to slave3 Node ## Transaction staller to represent memory latency. Stall instructions 239 cycles to simulate memory latency @slave3_staller = pre_conf_object("slave3_staller", "trans-staller", stall_time = 239) ##Latency of (L2 + RAM) in CPU cycles ## Slave3 core @slave3_cpu0 = conf.slave3.motherboard.processor0.core[0][0] 40 ## L2 cache(l2c0) for cpu0: 256KB with write-back @slave3_l2c0 = pre_conf_object("slave3_l2c0", "g-cache") @slave3_l2c0.cpus = slave3_cpu0 @slave3_l2c0.config_line_number = 4096 @slave3_l2c0.config_line_size = 64 ##64 blocks. Implies 512 lines @slave3_l2c0.config_assoc = 8 @slave3_l2c0.config_virtual_index = 0 @slave3_l2c0.config_virtual_tag = 0 @slave3_l2c0.config_write_back = 1 @slave3_l2c0.config_write_allocate = 1 @slave3_l2c0.config_replacement_policy = 'lru' @slave3_l2c0.penalty_read = 37 ##Stall penalty (in cycles) for any incoming read transaction @slave3_l2c0.penalty_write = 37 ##Stall penalty (in cycles) for any incoming write transaction @slave3_l2c0.penalty_read_next = 22 ##Stall penalty (in cycles) for a read transaction issued by the cache to the next level cache. Rounding error, value should be 7. @slave3_l2c0.penalty_write_next = 22 ##Stall penalty (in cycles) for a write transactions issued by the cache to the next level cache. Rounding error, value should be 7 @slave3_l2c0.timing_model = slave3_staller ##L1- Instruction Cache (ic0) : 32Kb @slave3_ic0 = pre_conf_object("slave3_ic0", "g-cache") @slave3_ic0.cpus = slave3_cpu0 @slave3_ic0.config_line_number = 512 @slave3_ic0.config_line_size = 64 ##64 blocks. Implies 512 lines @slave3_ic0.config_assoc = 8 @slave3_ic0.config_virtual_index = 0 @slave3_ic0.config_virtual_tag = 0 @slave3_ic0.config_write_back = 0 @slave3_ic0.config_write_allocate = 0 @slave3_ic0.config_replacement_policy = 'lru' @slave3_ic0.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read transaction @slave3_ic0.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write transaction @slave3_ic0.penalty_read_next = 9 ##Stall penalty (in cycles) for a read transaction issued by the cache to the next level cache. Rounding error, value should be 7. @slave3_ic0.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions issued by the cache to the next level cache. Rounding error, value should be 7 @slave3_ic0.timing_model = slave3_l2c0 41 # L1 - Data Cache (dc0) : 32KB Write Through @slave3_dc0 = pre_conf_object("slave3_dc0", "g-cache") @slave3_dc0.cpus = slave3_cpu0 @slave3_dc0.config_line_number = 512 @slave3_dc0.config_line_size = 64 ##64 blocks. Implies 512 lines @slave3_dc0.config_assoc = 8 @slave3_dc0.config_virtual_index = 0 @slave3_dc0.config_virtual_tag = 0 @slave3_dc0.config_write_back = 0 @slave3_dc0.config_write_allocate = 0 @slave3_dc0.config_replacement_policy = 'lru' @slave3_dc0.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read transaction @slave3_dc0.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write transaction @slave3_dc0.penalty_read_next = 9 ##Stall penalty (in cycles) for a read transaction issued by the cache to the next level cache. Rounding error, value should be 7. @slave3_dc0.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions issued by the cache to the next level cache. Rounding error, value should be 7 @slave3_dc0.timing_model = slave3_l2c0 # Transaction splitter for L1 instruction cache for slave3_cpu0 @slave3_ts_i0 = pre_conf_object("slave3_ts_i0", "trans-splitter") @slave3_ts_i0.cache = slave3_ic0 @slave3_ts_i0.timing_model = slave3_ic0 @slave3_ts_i0.next_cache_line_size = 64 # transaction splitter for L1 data cache for slave3_cpu0 @slave3_ts_d0 = pre_conf_object("slave3_ts_d0", "trans-splitter") @slave3_ts_d0.cache = slave3_dc0 @slave3_ts_d0.timing_model = slave3_dc0 @slave3_ts_d0.next_cache_line_size = 64 # ID splitter for L1 cache for slave3_cpu0 @slave3_id0 = pre_conf_object("slave3_id0", "id-splitter") @slave3_id0.ibranch = slave3_ts_i0 @slave3_id0.ibranch = slave3_ts_d0 #Add Configuration @SIM_add_configuration([slave3_staller, slave3_l2c0, slave3_ic0, slave3_dc0, slave3_ts_i0, slave3_ts_d0, slave3_id0], None); @slave3_cpu0.physical_memory.timing_model = conf.slave3_id0 #End of slave3 42 APPENDIX D. Python Script to Collect Simulation Data This script defines a hap function, which is called by the magic instruction included in the matrix multiplication program. This script uses Simics API to get the CPU time and run the command to start and stop capturing the network traffic. Python script to collect processors and network traffic statistics (matrix_100.py) from cli import * from simics import * def hap_callback(user_arg, cpu, arg): if arg == 1: print "cpu name: ", cpu.name print "Start at= ", SIM_time(cpu) SIM_run_alone(run_command, "ethernet_switch0.pcap-dump matrix_100.txt") if arg == 2: print "cpu name: ", cpu.name print "Start at= ", SIM_time(cpu) if arg == 3: print "cpu name: ", cpu.name print "End at= ", SIM_time(cpu) if arg == 4: print "cpu name: ", cpu.name print "End at= ", SIM_time(cpu) SIM_run_alone(run_command, "ethernet_switch0.pcap-dump-stop") SIM_hap_add_callback("Core_Magic_Instruction", hap_callback, None) 43 APPENDIX E. SSH-Agent script The ssh-agent script was found on the Internet in the following website URL: http://www.cygwin.com/ml/cygwin/2001-06/msg00537.html. This script will be added to the Linux shell startup file of the MPI user. A request to Joseph Reagle, the author of the code, was submitted asking authorization to use his script to automate the ssh-agent at the logging time. Below are the screenshots of the email requesting and authorizing permission to use the ssh-agent script. Email Sent to Request Permission to ssh-agent script. Email Received from Joseph Reagle granting authorization to use his script. 44 The .bash_profile file contains the ssh-agent script, which is executed at login time. In addition, the .bash_profile and .bash_profile include the lines to add the Open MPI libraries and executables to the user’s path. .bash_profile File # .bash_profile # Get the aliases and functions if [ -f ~/.bashrc ]; then . ~/.bashrc fi # User specific environment and startup programs SSH_ENV="$HOME/.ssh/environment" function start_agent { echo "Initialising new SSH agent..." /usr/bin/ssh-agent | sed 's/^echo/#echo/' > "${SSH_ENV}" echo succeeded chmod 600 "${SSH_ENV}" . "${SSH_ENV}" > /dev/null /usr/bin/ssh-add; } # Source SSH settings, if applicable if [ -f "${SSH_ENV}" ]; then . "${SSH_ENV}" > /dev/null #ps ${SSH_AGENT_PID} doesn't work under cywgin ps -ef | grep ${SSH_AGENT_PID} | grep ssh-agent$ > /dev/null || { start_agent; } else start_agent; fi PATH=$PATH:$HOME/bin export PATH=/home/mpiu/openmpi/bin:$PATH export LD_LIBRARY_PATH=/home/mpiu/openmpi/lib:$LD_LIBRARY_PATH export PATH .bashrc File # .bashrc # Source global definitions if [ -f /etc/bashrc ]; then . /etc/bashrc fi # User specific aliases and functions export PATH=/home/mpiu/openmpi/bin:$PATH export LD_LIBRARY_PATH=/home/mpiu/openmpi/lib:$LD_LIBRARY_PATH 45 APPENDIX F. User Guide MP simulation system using Simics User’s Guide This user guide describes the steps to install Open MPI in a Simics target simulated machine. This installation is used as a craff file to open several simulated machine connected through the network inside one Simics session. This user guide has been carefully prepared to avoid extra steps by automatizing configuration and settings that could be used repeatedly. Table of Contents INSTALL SIMICS II. SIMICS SUPPORT III. NETWORK SIMULATION IN SIMICS IV. REQUIRED COMPONENTS AND PREREQUISITES V. OPEN MPI INSTALLATION AND CONFIGURATION VI. CREATING A NEW CRAFF FILE VII. STARTING MPI NETWORK SIMULATION WITH SIMICS SCRIPTS VIII. RUNNING MPI PROGRAMS I. INSTALL SIMICS 1. Download Simics files Go to https://www.simics.net/pub/simics/4.6_wzl263/ You can go back to this link after your first installation and check for new versions and repeat the steps below. A newer version of Simics will be installed inside the Simics directory in a new separated directory. You will need to update the Simics icon to access the newer version. Download the following packages based on your operating system  Simics Base: simics-pkg-1000 This is the base product package that contains Simics Hindsight, Simics Accelerator, Simics Ethernet Networking, Simics Analyzer, and others functionalities. Other packages are optional add-on products 46  x86-440BX Target: simics-pkg-2018 This package will allow users to model various PC systems based on Intel 440BX AGPset system.  Firststeps: simics-pkg-4005. Installation of this package is recommended if you are just starting with Simics. This package will allow users to model various PC systems based on Freescale MPC8641. Most of the examples on Simics Hindsight and Simics Networking and others configurations referred to the virtual systems from this package. 2. Simics Package Installation Run the packages in the following order and enter their required “Decryption key” 1. simics-pkg-1000 2. simics-pkg-2018 3. simics-pkg-4005 II. SIMICS SUPPORT  For Documentation use Help option on Simics Control window  For support about Simics 4.0 (and later versions) https://www.simics.net/mwf/board_show.pl?bid=401  For support related to licensing and installing licenses https://www.simics.net/mwf/board_show.pl?bid=400 III. NETWORK SIMULATION IN SIMICS Simics provides a variety of pre-built virtual platform models. One of the target processor architectures available for academia users is Tango. Tango is one of the models included in the x86-440bx target group, which models various PCs with x86 or AMD64 processor based on the Intel 440BX AGPset [8]. A default configuration of Tango is Fedora 5 operating system, a single 2000 MHz Pentium Pro 4 processor, 256 KB of memory, 19GB IDE disk, and IDE CD-ROM and DEC21143 fast Ethernet controller with direct interface to the PCI bus. A single Tango simulated hardware is used to install an MPI implementation and to be configured to run MPI programs. We create a craff file from this first simulated machine that contains all the software, programs and settings and use it with a Simics script to open all the nodes for the specific simulated network. The disk image for all the nodes will be exactly the same. We also indicate in the Simics script to open 4 nodes, 8 nodes or 16 nodes by specifying their respective parameters such as: MAC 47 addresses, IP addresses and hostnames for each individual node. Simics allows the interconnection of several simulated computers using Ethernet links inside the same session. The Simics eth-links module provides an Ethernet switch component. The Ethernet switch works at the frame level and functions like a switch by learning what MAC addresses the different computers have. The create_network parameter must be set to yes to allow creation of an Ethernet link and connects the primary adapter of each node to it. We can use a Simics script to set this parameter. See Appendix A. The following commands are used in the Simics script to create the network switch and respective connections if not defined create_network {$create_network = "yes"} load-module eth-links Simics allows running network services using a service node. The Simics stdcomponents module provides a service node component. The service node provides a virtual network node that acts as a server for a number of TCP/IP-based protocols and as an IP router between simulated networks. One service node is used, which is connected to the Ethernet switch. The following command is used in the Simics script on Appendix A, to load the std-components module. load-module std-components As described above, Simics provides all the required components to simulate an entire virtual network. For more detailed information of the simulation components, the Configuration Browser and Object Browser tools can be utilized. These tools are accessible via Simics Control window Tools menu. IV. REQUIRED COMPONENTS AND PREREQUISITES To successfully complete the steps in this guide, a basic knowledge of Linux command line usage, network protocols, Simics commands, MPI and clustering concepts are recommended. By carefully following step-by-step instructions from this guide you should not incur any problems in achieving a successful MPI Cluster installation and setup. In this documentation interaction with Simics in Command Line Windows and the simulated target are presented in “consolas font”. User input is presented in “bold font”. Comments are presented in “italic font”. For this User Guide you need the Simics Command Line. In your system start Simics and select Tools  Command Line Window to open it. 48 1. Enabling Video Text Consoles A Simics limitation is that a Windows host cannot run more than one graphic console at a time. To run multiple machines in Windows we need to switch from graphic console to text-console. Video-text consoles are enabled with the text_console variable. simics> $text_console = "yes" 2. Setting the Real-Time Clock The pre-built simulated machines provided by Simics have the date set at 2008. To update the date and time of the real-time clock at boot we use the rtc_time command. If we skip this step we will get an error during Open MPI installation specifying that the configuration files created during the Open MPI configuration are older than the binary files. simics> $rtc_time = "2012-10-27 00:10:00 UTC" 3. Running a machine configuration script All the Simics machine configuration scripts are located in the “targets” folder inside the workspace. Simics can load a Simics script with the command run-commandfile. The following command is used to start a Tango machine. simics> run-command-file targets/x86-440bx/tango-common.simics 4. Starting the simulated machine Now you can start the simulation by clicking Run Forward, or by entering the command “continue” or just “c” at the prompt. simics> c Figure F.1 shows the entering commands prior the start of running a simulated machine Figure F.1 Required Simics Commands 49 5. Log in on the Target Machine The target OS will be loaded on the target machine. It will take a few minutes, then you will be presented with a login prompt. tango login: root Password: simics 6. Enabling Real Time Mode In some cases, simulated time may run faster than real time. This can happen after the OS is loaded and the machine is idle, if you attempt to type the password the machine times out too quickly. To avoid having the virtual time progress too quickly, you can activate the real-time mode feature. simics> enable-real-time-mode The enable-real-time-mode command will prevent the virtual time from progressing faster than the real time. Once you have your environment the way you want it, you can turn off real-time-mode with disable-real-time-mode. 7. Creating MPI user To run MPI programs, each machine must have a user with the same name and the same home folder on all the machines. In this user guide, “mpiu” is the user name given to the user. Type the following commands to create a new user with a password. I chose to enter “simics” as mpiu user password. UNIX will give you a warning prompt if your password is weak, but you can ignore the message. root@tango# useradd mpiu root@tango# passwd mpiu New UNIX password: simics //you will required to enter password twice Figure F.2 shows the entering of the commands to create an MPI user and setup its password. Figure F.2 Creating the mpiu user 50 8. Getting the required files into the Simulated Machine In order to install Open MPI and run MPI programs we need to transfer the Open MPI installation file and all the source code required from the host machine to the target machine. When a target machine mounts the host machine, the mounting is done at the root level. A good recommendation is to arrange the files to be mounted in the host machine. A very accessible location to place open MPI binaries is the c:\ folder. Also, it is recommended that all the source code be placed inside a folder. Figure F.3 displays a screenshot of Windows Explorer showing the file arrangement in the c:\ folder. Notice that a “programs” folder contains all the code to be mounted in the target machine. Figure F.3 Organization of the Open MPI binaries and source codes in Windows host machine 9. Mounting the host machine SimicsFS allows users to access the file system of the host computer from the simulated machine. SimicsFS is already installed on the simulated machines distributed with Simics. To be able to run the mount command the user must have administrative privileges. Then login with root account is required. root@tango# mount /host 51 10. Log-in on the target machine with user “mpiu” To avoid future permissions settings when running open MPI commands or accessing files, a safe recommendation is to copy all files needed and to perform the Open MPI installation after logging in as a user (e.g.mpiu) root@tango# su – mpiu 11. Creating new directories It is recommended that two working directories be created: 1) “openmpi” directory where Open MPI will be installed and 2) “programs” directory where working files will be placed. mpiu@tango$ mkdir openmpi mpiu@tango$ mkdir programs 12. Copying files on MPI user’s home directory We copy the Open MPI tar file directly to mpiu user’s home directory. Also we copy the content of the host machine’s programs folders to the simulated machine’s programs directory. mpiu@tango$ cp /host/openmpi-1.2.9.tar /home/mpiu mpiu@tango$ cp /host/programs/* /home/mpiu/programs/ 13. Unmounting the host machine file system We need to login with root user in order to unmount the host machine from the simulated target machine. Once we login as root we can enter the umount command. mpiu@tango$ su - root root@tango# umount /host 14. Setting up SSH for communication between nodes MPI uses Secure SHell Network Protocol SSH to send and receive data to and from different machines. You must login with mpiu account to configure SSH. root@tango# su – mpiu A personal private/public key pair is generated using the ssh-keygen command. When prompted for the file to save the SSH key click enter to use the default location and enter your own passphrase. In this user guide we use “simics” as the passphrase. To specify the type of key to create the option “-t” is used. It is recommended to use RSA passphrases, as they are more secure than DSA passphrases. 52 mpiu@tango$ ssh-keygen -t rsa <takes few minutes> Enter file in which to save the key (/home/mpiu/.ssh/id_rsa) <enter> Enter passphrase (empty for no passphrase): simics Enter same passphrase again: simics Next we copy the key generated by ssh-keygen command to the authorized_keys file inside ./ssh directory mpiu@tango$ cd .ssh mpiu@tango$ cat id_rsa.pub >> authorized_keys mpiu@tango$ cd .. We also need to correct the file permissions to allow the user to connect remotely to the other nodes mpiu@tango$ chmod 700 ~/.ssh mpiu@tango$ chmod 644 ~/.ssh/authorized_keys Figure F.4 shows a screenshot of the simulated machine running all the commands to configure SSH. Figure F.4 Setting SSH in the simulated machine The first time SSH is used to connect to a target machine the host authentication is required. Because the simulated network consists of several nodes, performing the host authentication on each node could be time consuming. To avoid this step the SSH configuration file must be modified to set StrictHostKeyChecking to no. In order to change this configuration we must login as root user. mpiu@tango$ su – root Password: simics 53 Then we need to edit the SSH file configuration. Locate the StrictHostKeyChecking option, uncomment it and set it to no. To edit the ssh_config file use the command below. root@tango# vi /etc/ssh/ssh_config Figure F.5 shows a screenshot of the ssh_config file being edited to set StrictHostKeyChecking to no Figure F.5 Editing SSH Configuration File 15. Setting ssh-agent to run upon login Because Open MPI will use SSH to connect to each of the machines and run MPI programs, we need to ensure that the passphrase doesn’t have to be entered for each connection. The ssh-agent program allows us to type the passphrase once, and after that all the following SSH invocations will be automatically authenticated. Appendix E presents a modified .bash_profile including the script to run ssh-agent automatically when you logged in as MPI user. It is recommended to copy and paste the files from Appendix E and place them in separate files on the “programs” folder in the host machine. These startup files will be mounted into the target machine. Then you need to replace the original startup files “.bash_profile” and “.bashrc” with their respective modified files. Figure F.3 shows the startup files placed inside the programs folder in the host machine. You can use the following commands to replace the startup files. mpiu@tango$ cd mpiu@tango$ cp mpiu@tango$ cp mpiu@tango$ cd programs .bash_profile /home/mpiu/.bash_profile .bashrc /home/mpiu/.bashrc .. 54 V. OPEN MPI INSTALLATION AND CONFIGURATION Open MPI installation files can be directly downloaded from Open MPI Site at URL: http://www.open-mpi.org/software/ompi/v1.6/. SimicsFS will be required to copy the Open MPI download file from the host to the Tango target machine. SimicsFS is already available in the tango craff file and it will allow you to mount the host into the target machine and copy files from the host to target. The OS version on the Tango target machine is Fedora 5, which is about 10 years old and the GNU Compiler Collection version is 2.96. The recommendation from both the Open MPI forum is to upgrade the Linux OS version to something more up to date rather than just upgrade the GCC version. It may be more complicated to install a new version of GCC because it might open an endless, “can of worms” with the package, resulting in library dependencies that will be hard to resolve. But also, the upgrade of the Linux version will require extra work in Simics to update to the Linux. Consequently, the approach to successfully install Open MPI over the Tango target is to start with the latest Open MPI version (1.6.2) and work backwards in the release series and see which versions work. The installation attempts were v1.6.2, v1.4.5, which failed, however with version 1.2.9 Open MPI was successfully installed. The installation process is similar, as you will perform any package installation in Linux: download, extract, configure and install. The main difference is the amount of time spent; configuration and installation will take about three hours in Simics. The final step to be able to run an Open MPI program is to add Open MPI executable files and libraries in the MPI user shell startup files. This step is very important because Open MPI must be able to find its executables in the MPI user’s PATH on every node In this user-guide, the hostname where the mpi program will be invoked is called “master” and the rest of nodes are identified as “slaves”. 1. Installing Open MPI The Open MPI installation consists of three steps: 1) unpack the tar file, 2) run the provided configure script and 3) run the “make all install” command. Enter the tar command to decompress the Open MPI file. root@tango$ su - mpiu mpiu@tango$ tar xf openmpi-1.2.9.tar The tar command creates a directory with the same name as the tar file where all the installations files are being decompressed. We need to access the new directory in order to configure the file for the Open MPI installation. 55 The configure scripts support different command line options. The “--prefix” option tells Open MPI installer where to install the Open MPI library. In this user guide we install Open MPI libraries under the directory “openmpi” that we created in Section IV step 12. mpiu@tango$ cd openmpi-1.2.9 mpiu@tango$ ./configure --prefix=/home/mpiu/openmpi <...lots of output...> takes about 35 minutes Figure F.6 is a screenshot of entering the tar and configure commands Figure F.6 Decompressing and Configuring the Open MPI installation files Last step to install the Open MPI libraries is to run the “make all install” command. This step collects all the required executables and scripts in the bin subdirectory of the directory specified by the prefix option in the configure command. mpiu@tango$ make all install <...lots of output...> takes about 1hr 50 minutes 2. Adding Open MPI to user’s PATH Open MPI requires that its executables are in the MPI user’s PATH on every node that you run an MPI program. Because we installed Open MPI with the prefix /home/mpiu/openmpi in Step1, the following should be in the mpiu user’s PATH and LD_LIBRARY_PATH. export PATH=/home/mpiu/openmpi/bin:$PATH export LD_LIBRARY_PATH=/home/mpiu/openmpi/lib:$LD_LIBRARY_PATH You can use the “vi” editor command as below to add the above two lines mpiu@tango$ vi .bash_profile mpiu@tango$ vi .bashrc 56 3. Moving the magic-instruction.h header file A magic-instruction.h is found in the simics/src/include installation directory. This file is already placed in the “programs” folder when it is copied into the target in Section step13. Magic instructions have to be compiled into the binaries of the Open MPI. Then this file must be moved into the openmpi/include directory. mpiu@tango$ cd programs mpiu@tango$ mv magic-instruction.h /home/mpiu/openmpi/include/ 4. Testing the Open MPI Installation In order to run any MPI command we need the binaries and libraries in the mpiu user path directory that was added in Section V step2. Then we need to logout and login again to get the directory path working. Figure F.7 shows the entering the commands to verify that the installation finished successfully. mpiu@tango$ su - mpiu Password: simics mpiu@tango$ which mpicc mpiu@tango$ which mpirun Figure F.7 Commands to Test Open MPI installation 5. Compiling the MPI programs In this step we compile all MPI programs to have the programs ready to execute when the 4, 8 or 16 node machines are created. It is recommended to have a copy of the program for each matrix size to avoid making changes on each target machine later. Figure F.8 shows the matrix multiplication MPI programs , one for each matrix size. Figure F.8 MPI programs 57 The command used to compile is mpicc. Figure F.9 shows entering of the mpicc command to compile an MPI program. mpiu@tango$ mpicc /home/mpiu/programs/matrix_100.c –o matrix1_100 Figure F.9 Compiling MPI program Figure F.10 shows all the contents of the home directory of the mpiu user, the hidden shell profile files, the executable MPI programs, the openmpi and programs directories and the secure shell directory created after configuring SSH. Figure F.10 Content of the mpiu user’s home directory 58 VI. CREATING A NEW CRAFF FILE In Simics images are read-only. This means that modifications in a target are not written to the image file. A way to save changes of an image is to shutdown the target machine and then run use the save-persistent command. This will create a craff file with all the changes. We will use this file to create a new craff file. Most of the OS images provided by Simics are files in the “craff” format. The craff utility is used to convert files to and from craff format, and to merge multiple craff files into a single craff file. In this project, we are going to use the craff utility to merge the original simulated target machine OS with the persistent state file that contains the Open MPI installation and the necessary configuration to run MPI programs. The merged output file is used as the new OS image for all the nodes in the simulated MPI Cluster Network. By using a new craff file we are only required to install and configure one simulated machine, instead of repeating the entire configuration steps in each individual node. 1. Saving a Persistent State with Open MPI installed The first step to create a craff file is to shutdown the target properly using the appropriate commands depending upon the target OS. By shutting down the system all target changes are flushed to the simulated disk. Simics will stop after the target system is powered off. At this point the “save-persistent-state” command is used to save the state of the machine with the Open MPI installation and settings previously performed. In order to shutdown the target machine, login as root user because mpiu user has not been granted administrative privileges. Figure F.11 shows the shutdown command. mpiu@tango$ su - root Password: simics root@tango# shutdown –h 1 Figure F.11 Shutting down the target machine 59 “Save-persistent-state” command will dump the entire disk image of the target machine to the host disk. You can run this command using the Simics Command Line Window or Simics Control Window simics> save-persistent-state <file name> In the Simics Control Window: go to File, select Save Persistent State, give a name to the file and exit Simics 2. Using the craff utility to create a craff file The craff utility is found inside a bin folder in each simics-workspace. You will need to have the craff program file, the original target machine craff file (downloaded from the Simics website), and a copy of the save disk image from the recent target machine you shutdown in the previous step. Figure F.12 shows the required files being placed in the same directory prior to running the craff utility. Figure F.12 Files needed to create a new CRAFF file In your Windows host open Windows Command Line, go to the folder location where you placed the three files mentioned above and execute the following command c:\path_to_your_directory\ craff –o <new-file-name.craff> tango1- fedora5.craff tango.disk.hd_image.craff Figure F.13 shows the craff utility command and its completion Figure F.13 Running the CRAFF Utility 60 Once the new craff file is created place the craff file inside the images folder. Select the new craff file, then move the file into the image folder inside the Simics installation path. Figure F.14 shows the new “tango-openmpi.craff” file inside the images folder. Figure F.14 The new craff file inside the images folder 3. Using the new craff file You can directly access this new craff file by entering in Simics Command Line the following commands: simics> $disk_image="tango-openmpi.craff" simics> run-command-file targets/x86-440bx/tango-common.simics simics> c If you need to make modifications to this new craff file and create a second craff file from this one, you can repeat the steps of this section. 61 VII. STARTING MPI NETWORK SIMULATION WITH SIMICS SCRIPTS Simics provides scripting capabilities by using the Python language. All Simics commands are implemented as Python functions. Target machine configurations are configured using Python scripts. In Simics there are two ways of scripting. One way is by writing scripts that contains Simics commands; similar to typing commands to the command-line interface. The other way is by writing scripts in the Python language. These two types of scripting can be combined, because Python instructions can be invoked from the command-line interface, and the command-line instructions can be issued from Python. All target machine setup scripts are located in the read-only installation Simics folder; these scripts should not be modified. However, Simics allows users to add new components and modify configuration parameters in scripts placed inside the “targets” folder of the user workspace. Appendix A contains a new machine script used for this project. This script changes configuration settings and uses the new disk image we created in Section VI of this User Guide. This section explains the configuration parameters used in that script and covers how to run this script. 1. Using a new disk image To use the new disk image that contains Open MPI installation and settings, we use the command below indicating the name of the new craff file. $disk_image="tango-openmpi.craff" 2. Changing the simulated machine parameters Before running the simulation, assign to each target machine in the simulated network their respective hostname, MAC address and IP address. The following parameters are used to specify the individual settings. $host_name = "master" $mac_address = "10:10:10:10:10:31" $ip_address = "10.10.0.13" 3. Changing other configuration parameters This step is optional. You can change the default configuration like memory size, clock frequency or disk size. You can do this at later point but you will save time by doing it now if you decide to. You can enter the following parameters to accomplish these changes. 62 simics> $memory_megs = 1024 simics> $freq_mhz = 2800 The amount of memory is expressed in MB and the clock frequency in MHZ. 4. Running the Simics simulated machine script The tango-common.simics script defines a complete simulated machine. The common script calls 3 different scripts: 1) “system.include” script to define the hardware, 2) “setup.include” script to define the software configuration and 3) “eth-link.include” to define the network settings. run-command-file "%script%/tango-common.simics" 5. Setting the Memory Limit Simics can run out of host memory if very big images are used, or the software running on the simulated system is bigger than the host memory. To prevent these kind of problems, Simics implements a global image memory limitation controlled by the set-memory-limit command [7]. Simics sets a default memory limit at startup based on the amount of memory and number of processors of the host machine. The set-memory-limit command will show the amount of memory available to run Simics in your host machine. Figure F.15 shows the set-memory-limit command input and its output. simics> set-memory-limit Figure F.15 Set-memory-limit command To prevent the simulation crash it is recommended that you check the amount of memory available and always set a memory-limit value lower than the default. On the Simics script from Appendix A the set-memory-limit is set to 980MB. You should change this parameter based on your memory. The amount of memory is specified as follow. set-memory-limit 980 63 6. Starting up a 16-node MPI Cluster Network in Simics To start a 16-node-networked simulation you can copy and paste the script from Appendix A. Otherwise, you can modify the script and change the number of nodes desired. It is recommended that the Simics script be placed inside the “targets” folder of the workspace being used. 7. Log in into each node In section VI step 1 the simulated machines was shutdown in order to create the craff file. Now when we start running the simulation the OS will be booted in each node. After the operating system is loaded in each node, log into each simulated machine as root first, and then login as mpiu in each node. login: root Password: <simics> root@master# su – mpiu A way to establish an SSH connection without re-entering the password is by using SSH-Agent. SSH-Agent is a program that will remember the password while logged in as a specific user. To ensure that SSH does not ask for a password when running the MPI programs it is suggested that ssh-agent be used to remember the password while logged in as mpiu. The following commands are required to enter each time you login as mpiu user. mpiu@master$ eval `ssh-agent` mpiu@master$ ssh-add ~/.ssh/id_dsa The best way to start ssh-agent is to add the above commands to the mpiu user .bash_profile. In this way, all programs started in the mpiu user login shell will see the environment variables, and be able to locate ssh-agent and query it for keys as needed. Appendix E. contains the .bash_profile file including the ssh-agent program. Previously in Section V. Step 2, instructions to load this file into the target machine were indicated. Then when entering the su- mpiu command the ssh-agent program located in the mpiu user startup-shell file will run. Then it is required to enter the passphrase setup on Section IV Step15. Figure F.16 shows the ssh-agent running after logging as mpiu user. 64 Figure F.16 Logging into the simulated machine as mpiu user 8. Saving a Checkpoint Instead of booting the nodes each time we run the script and repeating the steps in this section, use the checkpointing feature. A checkpoint contains the entire state of the system. Simics enables loading the checkpoint at the exact place the simulation was stopped when the checkpoint was saved. To save a checkpoint stop the simulation, click on the Save Checkpoint icon and give it a name. A checkpoint directory will be created containing multiple configuration files. 9. Adding L1 and L2 Cache Memory to Simulated target machines The suggested way to add memory caches to Simics simulated machines is to use a checkpoint of a fully booted and configured machine. Also, adding cache when booting a simulation takes a significant amount of time. Appendix C includes a Simics script to add an L1 and an L2 memory cache to the simulated target machines. The script will simulate a system with a separate instruction and data caches at level 1 backed by a level 2 cache with memory latency of 239 cycles. The values of cache memory size, cache line size, number of blocks, and read and write penalty cycles have been taken from the “Performance Analysis Of A Hardware Queue In Simics” project prepared by Mukta Siddharth Jain in Summer 2012. To add memory cache, open the checkpoint saved in the previous step and before start the simulation click on File and select Append from Script browse to find the Simics script and select it. Then start the simulation. You can verify that memory caches have been added by looking the Object Browser Tool or by running the following commands for each simulated target machine. simics> master_l2c0.status simics> master_l2c0.statistics simics> master_dc0.statistics 65 VIII. RUNNING MPI PROGRAMS 1. Create a host file To let open MPI know in which processors to run MPI programs, a file with the machine names must be created. You can use “vi” editor to create the file and add the hostnames. mpiu@master$ vi nodes Figure F.17 shows the “cat nodes” command to display the content of the hostfile Figure F.17 The hostfile indicates which hosts will run MPI programs 2. Collecting Simics Statistics To collect data, run a Python file prior starting the program execution. From the File tab in Simics Control Windows click on “Run Python File” to run a Python script in the current session. Appendix D contains a Python script that uses Simics API to define a callback that triggers Core Magic Instructions to collect CPU processing time and start and stop network traffic capturing. Four magic instruction functions have been added in the matrix multiplication code, to trigger master and slave nodes start and finish program execution tasks. Load the Python script before starting the executing the MPI program to be able to capture CPU time and network traffic. The CPU processing time will be displayed on Simics Command line. See Figure F.18 66 Figure F.18 Simics output data 3. Running MPI programs On the master node type the following commands to run MPI programs. Figure F.19 shows the mpirun input command. mpiu@master$ mpirun –np 16 –hostfile nodes matrix1_500 Figure F.19 Running an MPI program with 16 processes To assign processes to the nodes in a round-robin fashion until the processes are exhausted, the “—bynode” option can be entered with the mpirun command. See Figure F.20 mpiu@master$ mpirun –np 16 –hostfile nodes –-bynode matrix1_500 Figure F.20 Running an MPI program using –bynode option 67 APPENDIX G. Simulation Data 4cpu MPI Network Simulation Data Matrix: 100x100 MPI_Start Master Slave3 Slave1 Slave2 722.365702 746.419807 751.385091 752.385222 End Computation 746.475810 751.440992 752.441118 MPI_End 766.083543 767.082269 767.081550 767.081832 Computation Time 0.05600260 0.05590115 0.05589621 MPI_Time 43.71784104 20.66246190 15.69645875 14.69661042 Matrix: 200x200 MPI_Start Master Slave3 Slave2 Slave1 2378.286509 2414.397635 2415.370481 2416.370238 End Computation 2418.888566 2427.850499 2428.861047 MPI_End 2474.091269 2475.090210 2475.090070 2475.089694 Computation Time 4.49093051 12.48001824 12.49080936 MPI_Time 95.80475925 60.69257428 59.71958884 58.71945600 Matrix: 400x400 MPI_Start Master Slave1 Slave2 Slave3 1489.546897 1590.701789 1595.709734 1598.709066 End Computation 1616.960821 1617.970315 1620.963406 MPI_End 1686.406596 1687.405984 1687.405984 1687.406260 Computation Time 26.25903226 22.26058160 22.25433983 MPI_Time 196.85969970 96.70419566 91.69625050 88.69719363 Matrix: 500x500 MPI_Start Master Slave1 Slave2 Slave3 862.173132 1005.441582 1030.452400 1039.453641 End Computation 1033.636250 1058.638103 1067.639563 MPI_End 1116.978143 1117.974979 1117.975812 1117.975760 Computation Time 28.19466734 28.18570287 28.18592278 MPI_Time 254.80501086 112.53339692 87.52341195 78.52211974 Matrix: 600x600 MPI_Start Master Slave1 Slave2 Slave3 1533.081170 1792.468321 1797.479235 1800.480287 End Computation 1870.902267 1837.878533 1840.880464 MPI_End 1972.819810 1973.818846 1973.819214 1973.819494 Computation Time 78.43394612 40.39929791 40.40017697 MPI_Time 439.73863987 181.35052495 176.33997826 173.33920792 68 Matrix: 800x800 MPI_Start Master Slave1 Slave2 Slave3 1778.569250 2062.089247 2065.096370 2068.091930 End Computation 2130.028067 2133.378945 2138.029637 MPI_End 2273.258913 2274.258198 2274.258586 2274.258911 Computation Time 67.93881903 68.28257501 69.93770745 MPI_Time 494.68966245 212.16895092 209.16221590 206.16698079 Matrix: 1000x1000 MPI_Start Master Slave1 Slave2 Slave3 3125.391112 3489.433002 3492.447349 3495.442630 End Computation 3590.165339 3609.212790 3610.223270 MPI_End 3792.837398 3793.835218 3793.836341 3793.836196 Computation Time 100.73233659 116.76544179 114.78063929 MPI_Time 667.44628658 304.40221575 301.38899256 298.39356539 8cpu MPI Network Simulation Data Matrix: 100x100 MPI_Start Master Slave7 Slave2 Slave1 Slave4 Slave6 Slave3 Slave5 1229.67042716 1258.68981003 1260.77170613 1261.69382450 1267.77225009 1267.80813873 1268.69427626 1268.73028337 End Computation 1258.71646696 1260.79805126 1261.72032408 1267.79865797 1267.83451904 1268.72075278 1268.75665251 MPI_End 1280.35195825 1281.35212409 1281.35024746 1281.34990063 1281.35081107 1281.35130992 1281.35048816 1281.35133556 Computation Time 0.02665693 0.02634513 0.02649958 0.02640788 0.02638031 0.02647652 0.02636914 MPI_Time 50.68153109 22.66231406 20.57854133 19.65607613 13.57856098 13.54317119 12.65621190 12.62105219 Matrix: 200x200 MPI_Start Master Slave1 Slave2 Slave3 Slave4 Slave5 Slave6 Slave7 1626.29713568 1672.36681399 1671.36703746 1683.35151857 1682.35178777 1683.38737713 1682.38766433 1677.34618447 End Computation 1672.59335311 1671.59338341 1683.57789223 1682.58220647 1683.61851696 1682.61424883 1677.57204794 MPI_End 1700.53169460 1701.52885618 1701.52931159 1701.52967234 1701.53018634 1701.53007522 1701.53104728 1701.53139983 Computation Time 0.22653912 0.22634595 0.22637366 0.23041870 0.23113983 0.22658450 0.22586347 MPI_Time 74.23455892 29.16204219 30.16227413 18.17815377 19.17839857 18.14269809 19.14338295 24.18521536 69 Matrix: 400x400 MPI_Start Master Slave7 Slave2 Slave5 Slave1 Slave6 Slave3 Slave4 871.91066166 937.03865279 939.05092996 939.05759105 940.05103014 940.05757316 942.04219551 943.04201990 End Computation 942.88445734 956.89924832 954.89942531 957.89987632 955.89884380 957.88374382 958.88396380 MPI_End 1070.10285269 1071.10191833 1071.10004529 1071.10119812 1071.09969739 1071.10154341 1071.10039223 1071.10072383 Computation Time 5.84580455 17.84831836 15.84183426 17.84884617 15.84127064 15.84154831 15.84194390 MPI_Time 198.19219103 134.06326554 132.04911533 132.04360707 131.04866725 131.04397026 129.05819672 128.05870393 Matrix: 500x500 MPI_Start Master Slave7 Slave2 Slave6 Slave1 Slave3 Slave5 Slave4 1719.73229724 1794.95598571 1796.96934522 1797.95127388 1797.95470917 1799.94924857 1800.93787157 1800.94680752 End Computation 1804.55938086 1822.58502584 1817.55571333 1823.56646722 1819.55218409 1820.54165492 1820.55046424 MPI_End 1952.07326800 1953.07216773 1953.07030933 1953.07217221 1953.06980916 1953.07077363 1953.07117095 1953.07106586 Computation Time 9.60339515 25.61568062 19.60443945 25.61175805 19.60293552 19.60378335 19.60365672 MPI_Time 232.34097076 158.11618202 156.10096411 155.12089833 155.11509999 153.12152506 152.13329938 152.12425834 Matrix: 600x600 MPI_Start Master Slave7 Slave2 Slave1 Slave6 Slave4 Slave5 Slave3 3123.34844119 3208.59624951 3212.58802805 3213.58839260 3213.58925662 3213.59907751 3214.58960817 3214.59065264 End Computation 3221.07851834 3247.07913131 3248.07189130 3238.06224079 3238.06807530 3239.06457682 3239.07085680 MPI_End 3397.21104070 3398.20994055 3398.20807244 3398.20761698 3398.20956417 3398.20875652 3398.20908689 3398.20841173 Computation Time 12.48226883 34.49110326 34.48349870 24.47298417 24.46899779 24.47496865 24.48020416 MPI_Time 273.86259951 189.61369104 185.62004439 184.61922438 184.62030755 184.60967901 183.61947872 183.61775909 Matrix: 800x800 MPI_Start Master Slave1 Slave2 Slave3 Slave4 Slave5 Slave6 Slave7 284.25757219 801.93260784 826.94300260 845.95492160 854.96371434 863.95408786 870.96506228 873.96771140 End Computation 876.65523592 863.63877207 882.81892302 891.66248462 900.73314571 907.87970827 910.78251641 MPI_End 1074.64012311 1075.63734757 1075.63742559 1075.63799198 1075.63783111 1075.63858965 1075.64687475 1075.64746584 Computation Time 74.72262808 36.69576947 36.86400143 36.69877028 36.77905786 36.91464599 36.81480501 MPI_Time 790.38255092 273.70473973 248.69442299 229.68307039 220.67411677 211.68450179 204.68181247 201.67975444 70 Matrix: 1000x1000 MPI_Start Master Slave1 Slave2 Slave3 Slave4 Slave5 Slave6 Slave7 1522.91702460 2336.94109024 2359.94818475 2394.96030872 2411.97553305 2430.98142645 2439.98709711 2444.98853204 End Computation 2442.54765154 2417.52660733 2454.51962727 2469.53793908 2488.53526851 2497.55376958 2504.54846994 MPI_End 2700.00905902 2701.00549542 2701.00568689 2701.00570714 2701.00671076 2701.00719029 2701.00731751 2701.00798103 Computation Time 105.60656130 57.57842258 59.55931855 57.56240603 57.55384206 57.56667247 59.55993790 MPI_Time 1177.09203442 364.06440518 341.05750214 306.04539842 289.03117771 270.02576384 261.02022040 256.01944899 16cpu MPI Network Simulation Data Matrix: 100x100 MPI_Start Master Slave2 Slave1 Slave15 Slave4 Slave3 Slave6 Slave5 Slave8 Slave12 Slave7 Slave11 Slave10 Slave14 Slave9 Slave13 1993.58129168 2026.72543563 2027.64725516 2031.64490296 2033.72933284 2034.65115793 2039.73346672 2040.65528163 2040.72675336 2040.76566321 2041.64871663 2041.68769368 2046.73361315 2046.77024893 2047.65574569 2047.69219107 End Computation 2026.75168710 2027.67361931 2031.67160120 2033.75552625 2034.67754021 2039.75995439 2040.68164647 2040.75321101 2040.79192126 2041.67496412 2041.71407727 2046.75989281 2046.79652894 2047.68213582 2047.71860572 MPI_End 2059.32665899 2060.32033448 2060.31981172 2060.32516928 2060.32068622 2060.32099322 2060.32167049 2060.32140158 2060.32231414 2060.32399194 2060.32193627 2060.32372970 2060.32315891 2060.32465165 2060.32320862 2060.32487018 Computation Time 0.02625147 0.02636415 0.02669824 0.02619341 0.02638228 0.02648767 0.02636484 0.02645765 0.02625805 0.02624749 0.02638359 0.02627966 0.02628001 0.02639013 0.02641465 MPI_Time 65.74536731 33.59489885 32.67255656 28.68026632 26.59135338 25.66983529 20.58820377 19.66611995 19.59556078 19.55832873 18.67321964 18.63603602 13.58954576 13.55440272 12.66746293 12.63267911 Matrix: 200x200 MPI_Start Master Slave2 Slave1 Slave4 Slave3 Slave15 Slave6 Slave5 Slave8 Slave12 Slave7 3268.08385629 3313.19804672 3314.19161181 3324.18439300 3325.18392085 3326.17686595 3326.20413662 3327.20410713 3331.18546993 3331.22084630 3332.18526357 End Computation 3313.32585749 3314.32037508 3324.31313181 3325.31685032 3326.31164634 3326.33584576 3327.33245527 3331.31426741 3331.34848416 3332.31902701 MPI_End 3353.22010746 3354.21475174 3354.21387719 3354.21556876 3354.21542403 3354.21910366 3354.21580929 3354.21547729 3354.21681386 3354.21902169 3354.21654440 Computation Time 0.12781077 0.12876327 0.12873881 0.13292947 0.13478039 0.13170914 0.12834814 0.12879748 0.12763786 0.13376344 MPI_Time 85.13625117 41.01670502 40.02226538 30.03117576 29.03150318 28.04223771 28.01167267 27.01137016 23.03134393 22.99817539 22.03128083 71 Matrix: 200x200 MPI_Start Slave11 Slave10 Slave14 Slave9 Slave13 3332.22040924 3337.19866067 3337.22516527 3338.19849702 3338.22499663 End Computation 3332.34901736 3337.33340938 3337.35897701 3338.33143329 3338.35316491 MPI_End 3354.21783930 3354.21760452 3354.26858100 3354.21750009 3354.21911963 Computation Time 0.12860812 0.13474871 0.13381174 0.13293627 0.12816828 MPI_Time 21.99743006 17.01894385 17.04341573 16.01900307 15.99412300 Matrix: 400x400 MPI_Start Master Slave1 Slave2 Slave3 Slave4 Slave5 Slave6 Slave7 Slave8 Slave9 Slave10 Slave11 Slave12 Slave13 Slave14 Slave15 1669.96760274 1713.10472271 1722.12646982 1731.13064021 1740.13596937 1749.14096623 1758.14673722 1765.26355876 1776.15687086 1785.15863479 1794.16359585 1803.16815393 1812.17588832 1821.17720943 1830.17268992 1831.18145679 End Computation 1722.24884089 1733.27656146 1742.27521609 1751.28567462 1760.28568895 1769.29819731 1776.40793429 1787.30687274 1796.30358620 1805.31369708 1814.31288272 1832.32146277 1832.32146277 1841.32786388 1832.33624252 MPI_End 1906.29251744 1907.26392218 1907.26421054 1907.26436907 1907.26516969 1907.26528472 1907.26584324 1907.26660398 1907.26657200 1907.26674351 1907.26708432 1907.26768403 1907.26792267 1907.26893542 1907.26916532 1906.30553260 Computation Time 9.14411818 11.15009164 11.14457588 11.14970525 11.14472272 11.15146009 11.14437553 11.15000188 11.14495141 11.15010123 11.14472879 20.14557445 11.14425334 11.15517396 1.15478573 MPI_Time 236.32491470 194.15919947 185.13774072 176.13372886 167.12920032 158.12431849 149.11910602 142.00304522 131.10970114 122.10810872 113.10348847 104.09953010 95.09203435 86.09172599 77.09647540 75.12407581 Matrix: 500x500 MPI_Start Master Slave1 Slave2 Slave3 Slave4 Slave5 Slave6 Slave7 Slave8 Slave9 Slave10 Slave11 Slave12 Slave13 Slave14 Slave15 2231.67023532 2284.88848109 2291.90391551 2300.90452053 2309.91079733 2314.93637724 2325.92303026 2332.91889714 2343.92107338 2350.93480421 2359.93753845 2368.94149554 2377.94327473 2384.94717535 2393.93566879 2394.93668118 End Computation 2296.74258867 2303.75805757 2312.75872769 2321.76539874 2326.79081288 2337.77682365 2344.77278215 2355.77498362 2362.78990935 2371.79206804 2380.79551390 2389.79703215 2396.80154269 2405.79032685 2408.78922192 MPI_End 2492.36249418 2493.35673763 2493.35749369 2493.35783404 2493.35824820 2493.35880853 2493.35871283 2493.35896358 2493.35997552 2493.35987774 2493.36033007 2493.36087274 2493.36110547 2493.36172079 2493.36199705 2493.36249887 Computation Time 11.85410758 11.85414206 11.85420716 11.85460141 11.85443564 11.85379339 11.85388501 11.85391024 11.85510514 11.85452959 11.85401836 11.85375742 11.85436734 11.85465806 13.85254074 MPI_Time 260.69225886 208.46825654 201.45357818 192.45331351 183.44745087 178.42243129 167.43568257 160.44006644 149.43890214 142.42507353 133.42279162 124.41937720 115.41783074 108.41454544 99.42632826 98.42581769 72 Matrix: 600x600 MPI_Start Master Slave1 Slave2 Slave3 Slave4 Slave5 Slave6 Slave7 Slave8 Slave9 Slave10 Slave11 Slave12 Slave13 Slave14 Slave15 2321.48705911 2382.80632360 2395.78427512 2406.79969996 2417.79990693 2428.80836619 2439.81930768 2452.81946747 2461.82359410 2470.86287964 2485.84165891 2496.84633005 2507.85812801 2518.94034520 2529.91874286 2530.91914360 End Computation 2391.69300389 2412.66287500 2423.67547568 2434.67689186 2445.69091754 2456.69751829 2469.69696428 2478.70001509 2487.73904893 2502.72455471 2513.72216210 2524.73543172 2535.82257599 2546.79605027 2547.80207608 MPI_End 2630.98981395 2631.98380440 2631.98439996 2631.98491515 2631.98507321 2631.98570722 2631.98621068 2631.98633674 2631.98647162 2631.98670718 2631.98752783 2631.98784259 2631.98810105 2631.98868509 2631.98892925 2631.98930952 Computation Time 8.88668029 16.87859988 16.87577572 16.87698493 16.88255135 16.87821061 16.87749681 16.87642099 16.87616929 16.88289580 16.87583205 16.87730371 16.88223079 16.87730741 16.88293248 MPI_Time 309.50275484 249.17748080 236.20012484 225.18521519 214.18516628 203.17734103 192.16690300 179.16686927 170.16287752 161.12382754 146.14586892 135.14151254 124.12997304 113.04833989 102.07018639 101.07016592 Matrix: 800x800 MPI_Start Master Slave1 Slave2 Slave3 Slave4 Slave5 Slave6 Slave7 Slave8 Slave9 Slave10 Slave11 Slave12 Slave13 Slave14 Slave15 1764.57617370 1854.07358432 1865.07911709 1876.08806460 1887.09279447 1898.09497187 1909.09998482 1918.11403623 1931.12276182 1938.15990048 1953.13520228 1962.11486616 1975.15379495 1982.18493073 1997.15278888 1998.15525624 End Computation 1882.14500514 1888.44694720 1899.79853059 1911.56699445 1921.49781456 1932.94678633 1942.21379390 1956.58445661 1961.65359396 1977.63628562 1985.99358173 1998.67850110 2005.85221011 2021.03744474 2021.66745458 MPI_End 2158.02067293 2159.01369455 2159.01447136 2159.01431210 2159.01480510 2159.01532997 2159.01581722 2159.01617790 2159.01685689 2159.01701940 2159.01731192 2159.01804465 2159.01833244 2159.01836916 2159.01890707 2159.01944119 Computation Time 28.07142082 23.36783011 23.71046599 24.47419998 23.40284269 23.84680151 24.09975767 25.46169479 23.49369348 24.50108334 23.87871557 23.52470615 23.66727938 23.88465586 23.51219834 MPI_Time 393.44449923 304.94011023 293.93535427 282.92624750 271.92201063 260.92035810 249.91583240 240.90214167 227.89409507 220.85711892 205.88210964 196.90317849 183.86453749 176.83343843 161.86611819 160.86418495 73 Matrix: 1000x1000 MPI_Start Master Slave1 Slave2 Slave3 Slave4 Slave5 Slave6 Slave7 Slave8 Slave9 Slave10 Slave11 Slave12 Slave13 Slave14 Slave15 3275.68051529 4395.80169101 4410.80356001 4421.80934211 4445.97720510 4463.82758659 4472.82831643 4487.83763063 4504.84396442 4521.84643371 4536.85516959 4543.85991717 4552.86620094 4567.87296207 4584.87956604 4591.88133015 End Computation 4440.97843196 4445.97720510 4456.97700815 4485.99339561 4499.00344678 4508.00208222 4523.00864127 4540.01671372 4557.01450955 4572.02277033 4579.05846781 4588.03586814 4603.04761829 4620.04697404 4627.06450868 MPI_End 4787.31447475 4788.30805880 4788.30900741 4788.30921830 4788.30985496 4788.31030996 4788.31071203 4788.31111848 4788.31134976 4788.31160182 4788.31197106 4788.31261940 4788.31291667 4788.31361662 4788.31353722 4788.31351373 Computation Time 45.17674095 35.17364509 35.16766604 40.01619051 35.17586019 35.17376579 35.17101064 35.17274930 35.16807584 35.16760074 35.19855064 35.16966720 35.17465622 35.16740800 35.18317853 MPI_Time 1511.63395946 392.50636779 377.50544740 366.49987619 342.33264986 324.48272337 315.48239560 300.47348785 283.46738534 266.46516811 251.45680147 244.45270223 235.44671573 220.44065455 203.43397118 196.43218358 Table 6 Processing Time and Network Traffic Data Collected 100x100 100x100 100x100 200x200 200x200 200x200 400x400 400x400 400x400 500x500 500x500 500x500 600x600 600x600 600x600 800x800 800x800 800x800 1000x1000 1000x1000 1000x1000 Nodes Matrix 4 8 16 4 8 16 4 8 16 4 8 16 4 8 16 4 8 16 4 8 16 cpu_time (sec) 43.72 50.68 65.75 95.80 74.23 85.14 196.86 198.19 234.35 254.81 232.34 260.69 439.74 273.86 309.50 494.69 790.38 393.44 667.45 1,177.09 1,511.63 Process ing Time per nodes 10.93 6.34 4.11 23.95 9.28 5.32 49.21 24.77 14.65 63.70 29.04 16.29 109.93 34.23 19.34 123.67 98.80 24.59 166.86 147.14 94.48 Avg. Comp utation Time (sec) 0.06 0.03 0.03 9.82 0.23 0.13 23.59 14.99 11.01 28.19 19.89 10.95 53.08 25.62 16.35 68.72 42.21 24.19 110.76 25.62 24.15 Total Bytes 453,918 834,269 1,713,104 1,756,748 3,209,844 6,139,355 6,800,694 12,306,152 23,753,325 10,597,448 19,196,017 36,550,814 15,219,443 27,651,090 52,020,906 27,045,266 48,799,556 92,816,860 42,197,191 76,366,413 145,552,789 Bytes per node No. of Packet s 113,480 641 104,284 1,256 107,069 2,910 439,187 1,860 401,231 3,492 383,710 6,974 1,700,174 5,487 1,538,269 10,534 1,484,583 20,661 2,649,362 8,118 2,399,502 15,493 2,284,426 30,445 3,804,861 11,457 3,456,386 21,911 3,251,307 42,209 6,761,317 19,897 6,099,945 36,856 5,801,054 73,645 10,549,298 30,574 9,545,802 56,899 9,097,049 114,989 No. of Packe ts per Node 160 157 182 465 437 436 1,372 1,317 1,291 2,030 1,937 1,903 2,864 2,739 2,638 4,974 4,607 4,603 7,644 7,112 7,187 Time between first/last packet (sec) 42.68 48.64 64.70 94.75 74.18 84.50 195.79 197.12 233.23 253.64 230.18 259.53 438.55 272.71 308.28 493.43 789.02 392.08 665.91 1,175.55 1,510.35 Avg Avg. packet packet /sec /size 15 26 45 20 47 83 28 53 89 32 67 117 26 80 137 40 47 188 46 48 76 708 664 589 944 919 880 1,239 1,168 1,150 1,305 1,239 1,201 1,328 1,262 1,232 1,359 1,324 1,260 1,380 1,342 1,266 Avg. bytes/sec Avg. Mbit /sec 10,636 17,152 26,478 18,541 43,270 72,657 34,735 62,431 101,846 41,781 83,396 140,837 34,704 101,394 168,745 54,811 61,848 236,728 63,368 64,962 96,370 0.09 0.14 0.21 0.15 0.35 0.58 0.28 0.50 0.82 0.33 0.67 1.13 0.28 0.81 1.35 0.44 0.50 1.89 0.51 0.52 0.77 74 75 Table 7 Processing Time, Total Bytes and Number of Packets Ratios Matrix Ratios Processing Time Ratio Bytes Ratio 100x100 100x100 200x200 200x200 400x400 400x400 500x500 500x500 600x600 600x600 800x800 800x800 1000x1000 1000x1000 8 vs. 4 16 vs. 4 8 vs. 4 16 vs. 4 8 vs. 4 16 vs. 4 8 vs. 4 16 vs. 4 8 vs. 4 16 vs. 4 8 vs. 4 16 vs. 4 8 vs. 4 16 vs. 4 0.5796 0.3760 0.3874 0.2222 0.5034 0.2976 0.4559 0.2557 0.3114 0.1760 0.7988 0.1988 0.8818 0.5662 0.9190 0.9435 0.9136 0.8737 0.9048 0.8732 0.9057 0.8622 0.9084 0.8545 0.9021 0.8580 0.9049 0.8623 No. of Packets Ratio 0.9797 1.1349 0.9387 0.9374 0.9599 0.9414 0.9542 0.9375 0.9562 0.9210 0.9261 0.9253 0.9305 0.9403 Table 8 Time before the start of 1st slave Matrix 4-node 8-node 16-node 100x100 200x200 24.05 29.02 33.14 36.11 46.07 45.11 400x400 500x500 101.15 65.13 43.14 143.27 75.22 53.22 600x600 800x800 259.39 85.25 61.32 283.52 517.68 89.50 1000x1000 364.04 814.02 1120.12 76 BIBLIOGRAPHY [1] David E. Culler, Jaswinder Pal Singh “Parallel Computer Architecture A Hardware /Software Approach”. Morgan Kaufmann Publishers, 1999. [2] Wind River Simics, URL: http://www.simics.net. [3] MPI Group Management & Communicator, URL: http://static.msi.umn.edu/tutorial/scicomp/general/MPI/communicator.html [4] Message Passing Interface (MPI) :Overview and Goals, URL: www.mcs.anl.gov/research/projects/mpi/mpi-standard/mpi-report1.1/node2.htm#Node2 [5] FAQ: General information about the Open MPI Project Section: 3. What are the goals of the Open MPI Project? http://www.open-mpi.org/faq/?category=general [6] An Assessment of Beowulf-class Computing for NASA Requirements: Initial Findings from the First NASA Workshop on Beowulf-class Clustered Computing. [7] Wind River Simics, “Hindsight User Guide.pdf”, Simics version 4.6, Revision 4076, Date 2012-10-11, pp. 207, URL: http://www.simics.net. [8] Message Passing Interface (MPI), URL: https://computing.llnl.gov/tutorials/mpi/ [9] Wind River Simics, “Target Guide x86.pdf”, Simics version 4.6, Revision 4071, Date 2012-09-06, pp. 9, URL: http://www.simics.net. [10] Simics Forum, UR: https://www.simics.net/mwf/forum_show.pl 77 [11] Wind River Simics, “Ethernet Networking User Guide.pdf”, Simics version 4.6, Revision 4076, Date 2012-10-11, pp. 207, URL: http://www.simics.net. [12] K computer. Specifications: Network, URL: http://en.wikipedia.org/wiki/K_computer [13] MPICH2 Frequently Asked Questions, URL: http://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions [14] Setting up a Beowulf Cluster Using Open MPI on Linux, URL: http://techtinkering.com/2009/12/02/setting-up-a-beowulf-cluster-using-openmpi-on-linux/ [15] Considerations in Specifying Beowulf Clusters, URL: http://h18002.www1.hp.com/alphaserver/download/Beowulf_Clusters.PDF [16] FAQ: What kinds of systems / networks / run-time environments does Open MPI support? Section 4: 4. What run-time environments does Open MPI support? http://www.open-mpi.org/faq/?category=supported-systems [17] Jain, Mukta Siddharth, “Performance analysis of a hardware queue in Simics” URL: http://csus-dspace.calstate.edu/xmlui/handle/10211.9/1857 [18] MPI example programs, URL: http://users.abo.fi/Mats.Aspnas/PP2010/examples/MPI/

APPENDIX B. MPI Program for Matrix Multiplication

Related documents

Products

Support

APPENDIX B. MPI Program for Matrix Multiplication

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib