APPENDIX B. MPI Program for Matrix Multiplication

MESSAGE PASSING MULTIPROCESSING SYSTEM SIMULATION
USING SIMICS
A Project
Presented to the faculty of the Department of Computer Science
California State University, Sacramento
Submitted in partial satisfaction of
the requirements for the degree of
MASTER OF SCIENCE
in
Computer Science
by
Sandra Guija
FALL
2012
© 2012
Sandra Guija
ALL RIGHTS RESERVED
ii
MESSAGE PASSING MULTIPROCESSING SYSTEM SIMULATION
USING SIMICS
A Project
by
Sandra Guija
Approved by:
__________________________________, Committee Chair
Nikrouz Faroughi, Ph.D.
__________________________________, Second Reader
William Mitchell, Ph.D.
____________________________
Date
iii
Student: Sandra Guija
I certify that this student has met the requirements for format contained in the University
format manual, and that this project is suitable for shelving in the Library and credit is to
be awarded for the project.
__________________________, Graduate Coordinator
Nikrouz Faroughi
Department of Computer Science
iv
___________________
Date
Abstract
of
MESSAGE PASSING MULTIPROCESSING SYSTEM SIMULATION
USING SIMICS
by
Sandra Guija
Parallel processing uses multiple processors to compute a large computer problem. Two
main multiprocessing programming models are shared memory and message passing. In
the latter model, processes communicate by exchanging messages using the network.
The project consisted of two parts:
1) To investigate the performance of a multithreaded matrix multiplication program,
and
2) To create a user guide for how to setup a message passing multiprocessor
simulation environment using Simics including MPI (message passing interface)
installation, OS craff file creation, memory caches addition and python scripts
usage.
v
The simulation results and performance analysis indicate as matrix size increases and the
number of processing nodes increases, the rate at which bytes communicated and the
number of packets increase is faster than the rates at which processing time per node
decreases.
_______________________, Committee Chair
Nikrouz Faroughi, Ph.D.
_______________________
Date
vi
ACKNOWLEDGEMENTS
Success is the ability to go from one failure to another with no loss of enthusiasm.
[Sir Winston Churchill]
To God who gave me strength, enthusiasm, and health to be able to complete my project.
To my husband Allan who said, “you can do this”, I would like to thank him for being
there for me. I would like to thank my parents Lucho and Normita and my sister Giovana
for their love and support despite the distance.
I would like to thank Dr. Nikrouz Faroughi for his guidance during this project, his
knowledge, time and constant feedback. I would also like to thank Dr. William Mitchell,
who was kind enough to be my second reader.
I would like to thank these special people: Cara for her always-sincere advice, Tom Pratt
for his kindness, dedication, patience and time and Sandhya for being helpful, my
manager Jay Rossi and my co-workers for their support. I truly believe their help has had
a significant and positive impact on my project.
vii
TABLE OF CONTENTS
Page
Acknowledgments............................................................................................................. vii
List of Tables ...................................................................................................................... x
List of Figures .................................................................................................................... xi
Chapter
INTRODUCTION .............................................................................................................. 1
1.1
Shared Memory Multiprocessor Architecture ........................................................... 2
1.2
Message Passing Multiprocessor Architecture .......................................................... 3
1.3
Project Overview ....................................................................................................... 3
MESSAGE PASSING SYSTEM MODELING ................................................................. 5
2.1
Simics Overview ........................................................................................................ 5
2.2
Message Passing Interface ......................................................................................... 6
2.3
2.4
2.2.1
MPICH2 ..................................................................................................... 10
2.2.2
Open MPI ................................................................................................... 11
MPI Overview .......................................................................................................... 12
2.3.1
Beowulf Cluster and MPI Cluster .............................................................. 12
2.3.2
MPI Network Simulation ........................................................................... 12
Simulation of Matrix Multiplication ........................................................................ 13
viii
SIMULATION RESULTS AND PERFORMANCE ANALYSIS .................................. 15
3.1
Simulation Parameters ............................................................................................. 15
3.2
Data Analysis ........................................................................................................... 16
CONCLUSION ................................................................................................................. 25
Appendix A. Simics Script to Run an 16-node MPI Network Simulation ....................... 26
Appendix B. MPI Program for Matrix Multiplication ...................................................... 28
Appendix C. Simics Script to Add L1 and L2 Cache Memories ...................................... 33
Appendix D. Python Script to Collect Simulation Data ................................................... 42
Appendix E. SSH-Agent script ......................................................................................... 43
Appendix F. User Guide ................................................................................................... 45
Appendix G. Simulation Data ........................................................................................... 67
Bibliography ..................................................................................................................... 76
ix
LIST OF TABLES
Tables
Page
1. Table 1 MPI_Init and MPI_Finalize Functions .......................................................7
2. Table 2 MPI_Comm Functions ................................................................................8
3. Table 3 MPI Send and Receive Functions ...............................................................9
4. Table 4 MPI Broadcast Function .............................................................................9
5. Table 5 Configuration Information ........................................................................15
6. Table 6 Processing Time and Network Traffic Data Collected .............................74
7. Table 7 Processing Time, Total Bytes and Number of Packets Ratios .................75
8. Table 8 Time before the start of 1st slave ..............................................................75
x
LIST OF FIGURES
Figures
Page
1. Figure 1 Shared memory multiprocessor interconnected via bus ............................2
2. Figure 2 Scalable Shared Memory Multiprocessor .................................................3
3. Figure 3 Processing Time per node .......................................................................17
4. Figure 4 Time before the start of 1st slave ............................................................18
5. Figure 5 Total Bytes per node................................................................................19
6. Figure 6 Number of Packets per node....................................................................21
7. Figure 7 Processing Time Ratio .............................................................................22
8. Figure 8 Bytes Ratio ..............................................................................................23
9. Figure 9 Number of Packets Ratio .........................................................................24
xi
1
Chapter 1
INTRODUCTION
A parallel computer is a “collection of processing elements that
communicate and cooperate to solve large problems fast”
[Almasi and Gollieb, Highly Parallel Computing, 1989]
Parallel Computing is the main approach to process massive data and to solve complex
problems. Parallel computing is used in a wide range of applications including galaxy
formation, weather forecasting, quantum physics, climate research, manufacturing
processes, chemical reactions and planetary movements.
Parallel processing means to divide a workload into subtasks and complete the subtasks
concurrently. In order to achieve that, communication between processing elements is
required.
Parallel Programming Models such as, Shared Address Space (SAS) and Message
Passing (MP) will define how a set of parallel processes communicate, share information
and coordinate their activities [1].
2
1.1
Shared Memory Multiprocessor Architecture
In this case multiple processes access a shared memory space using standard load and
store instructions. Each thread/process accesses a portion of the shared data address
space. The threads communicate with each other by reading and writing shared variables.
Synchronization functions are used to prevent a thread from updating the same-shared
variable at the same time or for the threads to coordinate their activities.
A shared memory system is implemented using a bus or interconnection network to
interconnect the processors. Figure 1 illustrates a bus based multiprocessor system called
UMA (Uniform Memory Access) because all memory accesses have the same latency. A
NUMA (Non-uniform Memory Access) multiprocessor, on the other hand, is designed by
distributing the shared memory space among the different processors as illustrated in
Figure 2. The processors are interconnected using an interconnection network, making
the architecture scalable.
Figure 1 Shared memory multiprocessor interconnected via bus [1]
3
Figure 2 Scalable Shared Memory Multiprocessor [1]
1.2
Message Passing Multiprocessor Architecture
In a message passing system, processes communicate by sending and receiving messages
thought the network. To send a message, a processor executes a system call to request an
operating system to send the message to a destination process.
A common message passing system is a cluster network. A message passing architecture
diagram is also similar to that shown for NUMA in Figure 2; except that each processor
can only access its own memory and can send and receive data to and from other
processors.
1.3
Project Overview
Chapter 2 covers tools and concepts to model a message passing system. Chapter 3
describes simulation data collection and analysis, and Chapter 4 is the conclusion, and
future work. Appendix A presents the Simics script to start a 16-node MPI network
simulation. Appendix B includes an MPI program for Matrix Multiplication. Appendix C
4
presents the Simics script to add an L1 and L2 caches to simulated machines. Appendix
D presents the Python script to collect the processing time and network traffic data from
Simics. Appendix E presents the SSH-Agent script. Appendix F contains a step-by-step
User Guide to configure and simulate a message passing system model using Simics.
5
Chapter 2
MESSAGE PASSING SYSTEM MODELING
This chapter presents a description of the Simics simulation environment, MPI, and the
multithreaded message passing matrix multiplication program.
2.1
Simics Overview
Simics is a complete machine simulator that models all the hardware components found
in a typical computer system. It is used by software developers to simulate any target
hardware from a single processor to large and complex systems [2]. Simics facilitates
integration and testing environment for software by providing the same experience as a
real hardware system.
Simics is a user-friendly interface with many tools and options. Among the many
products and features of Simics are the craff utility, SimicsFS, Ethernet networking and
scripting with Python. These are the main Simics functionalities used in this project for
the simulation of a message passing multiprocessor system simulation. Each processor is
modeled as a stand-alone processing node with its own copy of the OS. The craff utility
allows users to create an operating system image from a simulated machine and use it to
simulate multiple identical nodes. This utility saves significant time by setting only one
target machine with all the software and configuration features, which is then replicated
6
to the remaining nodes. SimicsFS allows users to copy files from a host directory to a
simulated node. The Ethernet Networking provides network connectivity for a Simics
platform inside one Simics session. Scripting with Python is very simple and can be used
to access system configuration parameters, invoke command line functions, define hap
events and interface with Simics API functions. The primary use of hap and Simics API
functions is for collecting simulation data.
2.2
Message Passing Interface
Message Passing Interface (MPI) is a standard message library developed to create
practical, portable, efficient, and flexible message passing programs [4]. The MPI
standardization process commenced in 1992. A group of researchers from academia and
industry worked together in a standardization process exploiting the most advantageous
features of the existing message passing systems. The MPI standard consists of two
publications: MPI-1 (1994) and MPI-2 (1996). The MPI-2 is mainly additions and
extensions to MPI-1.
The MPI Standard includes point-to-point communication, collective operations, process
groups, communication contexts, process topologies, and interfaces supported in
FORTRAN, C and C++.
Processes/threads communicate by calling MPI library routines to send and receive
messages to other processes. All programs using MPI require the mpi.h header file to
7
make MPI library calls. The MPI includes over one hundred different functions. The first
MPI function that a MPI-base message passing program must call is MPI_INIT, which
initializes an MPI execution. The last function is MPI_FINALIZE which terminates the
MPI execution. Both functions are called once during a program execution. Table 1
illustrates the declaration and description of MPI_INIT and MPI_FINALIZE functions.
Table 1 MPI_Init and MPI_Finalize Functions
MPI_INIT (int *argc, char *argv[] )
First MPI function called in a program.
Some of the common arguments taken from the command-line are number of
processes, specified hosts or list of hosts, hostfile (text file with hosts specified),
directory of the program,
Initializes MPI variables and forms the MPI_COMM_WORLD communicator
Opens TCP connections
MPI_FINALIZE ()
Terminates MPI execution environment
Called last by all processes
Closes TCP Connections
Cleans up
The two basic concepts to program with MPI are groups and communicators. A group is
an ordered set of processes, where each process has its own rank number. “A
communicator determines the scope and the "communication universe" in which a pointto-point or collective operation is to operate. Each communicator is associated with a
group” [3].
MPI_COMM_WORLD is a communicator defined by MPI referring to all the processes.
Groups and communicators are dynamic objects that may get created and destroyed
during program execution. MPI provides flexibility to create groups and communicators
8
for applications that might require communications among selected subgroup of
processes. MPI_COMM_SIZE and MPI_COMM_RANK are the most commonly used
communication functions in an MPI program. MPI_COMMON_SIZE determines the size
of the group or number of the processes associated with a communicator.
MPI_COMM_RANK determines the rank of the calling process in the communicator.
The Matrix Multiplication MPI program uses the MPI_COMM_WORLD as the
communicator. Table 2 illustrates the declaration and description of MPI_COMM
functions.
Table 2 MPI_Comm Functions
MPI_COMM_SIZE(MPI_Comm comm, int *size)
Determines number of processes within a communicator.
In this study the MPI_Comm argument is MPI_COMM_WORLD.
MPI_COMM_RANK(MPI_Comm comm, int *rank)
Returns the process identifier for the process that invokes it.
Rank is integer between 0 and size-1.
In MPI, point-to-point communication is fundamental for sending and receiving
operations. MPI defines two models of communication blocking and non-blocking. The
non-blocking functions return immediately even if the communication is not finished yet,
while the blocking functions do not return until the communication is finished. Using
non-blocking functions allows computations and calculations to proceed simultaneously.
9
For this study, we use the asynchronous non-blocking MPI_ISend and MPI_Recv
function. Table 3 illustrates the declaration and description of MPI_ISend and MPI_Recv
functions.
Table 3 MPI Send and Receive Functions
MPI_Isend (void *buffer, int count, MPI_Datatype datatype, int dest, int tag,
MPI_Comm comm, MPI_Request *request)
Sends a message.
An MPI Nonblocking call, where the computation can proceed immediately allowing
both communications and computations to proceed concurrently.
MPI supports messages with all the basic datatypes.
MPI_Recv (void *buffer, int count, MPI_Datatype datatype, int source, int tag,
MPI_Comm comm, MPI_Status *status)
Receives a message
The count argument indicates the maximum length of a message.
The tag argument must match between sender and receiver.
Many applications require a communication between two or more processes. MPI
includes collective communication operations that involve the participation of all
processes in a communicator. Broadcast is one of the most common collective operations
that is used in this study. Broadcast is defined as MPI_Bcast and is used for a process,
which is the root, to send a message to all the members of the communicator. Table 4
illustrates the declaration and description of MPI_Bcast function.
Table 4 MPI Broadcast Function
MPI_Bcast (void *buffer, int count, MPI_Datatype datatype, int master,
MPI_Comm comm)
Broadcasts a message from the process with rank "master" to all other processes of the
communicator.
10
The MPI Standard is a set of functions and capabilities that any implementation of the
message-passing library must follow. The two leading open source implementations of
MPI are MPICH2 and Open MPI. Both implementations are available for different
versions of Unix, Linux, Mac OS X and MS Windows.
2.2.1
MPICH2
MPICH2 is a broadly used MPI implementation developed at the Argonne National
Laboratory (ANL) and Mississippi State University (MSU). MPICH2 is a highperformance and widely portable implementation of the Message Passing Interface (MPI)
standard (both MPI-1 and MPI-2). The CH comes from “Chameleon”, the portability
layer used in the original MPICH. The founder of MPICH developed the Chameleon
parallel programming library.
MPICH2 uses an external process manager that spawns and manages parallel jobs. MPD
is the default process manager and is used to manage all MPI processes. This process
manager uses PMI (process management interface) to communicate with MPICH2. MPD
involves starting up an mpd daemon on each of the worker nodes. MPD used to be the
default process manager for MPICH2. Starting with version 1.3 Hydra, a more robust and
reliable process manager, is the default MPICH2 process manager.
11
2.2.2
Open MPI
Open MPI is evolved from the merger of three established MPI implementations:
FT_MPI, LA_MPI and LAM/MPI plus contributions of PACX-MPI. Open MPI is
developed using the best practices among them established MPI implementations. Open
MPI runs using Open Run-Time Environment ORTE. ORTE is open source software
developed to support distributed high-performance applications and transparent
scalability. ORTE starts MPI jobs and provides some status information to the upperlayer Open MPI [5].
The Open MPI project goal is to work with and for the supercomputer community to
support MPI implementation for a large number and variety of systems. A K computer is
a supercomputer produced by Fujitsu and is currently the world’s second fastest
supercomputer. It uses a Tofu-optimized MPI based on Open MPI.
MPICH2 and Open MPI are the most common MPI implementations used by
supercomputers. Open MPI does not require the usage of a process manager, which
makes the installation, configuration and execution a simpler process. Open MPI is the
MPI implementation used in this project.
12
2.3
MPI Overview
2.3.1
Beowulf Cluster and MPI Cluster
Beowulf Project started in 1994 at NASA's Goddard Space Flight Center. A result of this
research was the Beowulf Cluster system, a scalable combination of hardware and
software that provides a sophisticated and robust environment to support a wide range of
applications [6]. The name “Beowulf” comes from the mythical Old-English hero with
extraordinary strength who defeats Grendel, the green dragon. The motivation to develop
a cost-efficient solution makes a Beowulf Cluster attainable to anyone. The three required
components are a collection of stand-alone computers networked together, an open
source operating system such as Linux and a message passing interface or PVM Parallel
Virtual Machine implementation. Based on that, the components selected for this project
are: 4, 8 and 16 Pentium PCs with Fedora 5, TCP/IP network connectivity and Open MPI
implementation.
The Beowulf Cluster is known as “MPI Cluster” when MPI is used for communication
and coordination between the processing nodes.
2.3.2
MPI Network Simulation
One early consideration to make when setting an MPI Network is to determine the use of
a file system. The options are whether to setup a Network File System (NFS) or not. NFS
is a protocol that facilitates access to files in the network, as if they were local. With NFS
13
a folder containing the Open MPI program can be shared in the master node to all the
other slave nodes. NFS can become a bottleneck when the nodes all use the NFS shared
directory. NFS will not be used in this project; instead Open MPI is installed in the local
drive in each node.
A second consideration is setting up a secure SSH protocol in the master node. MPI uses
SSH to communicate among the nodes. Simics Tango targets are loaded with OpenSSH,
a widely used implementation of SSH protocol, configured with password protection.
Because Open MPI will rely on OpenSSH at the execution time, additional commands
will be run to ensure a connection without a password.
The last setting to be performed in this simulation is to define a username with the same
user ID to be created on each node with the same path directory to access common files.
Now that, all the required components have been introduced, the MPI matrix
multiplication program will be described next.
2.4
Simulation of Matrix Multiplication
In this project, Simics scripts are used to configure and simulate 4, 8 or 16-node message
passing system. Each node is configured as a complete single processor system. Each
node also includes a copy of executable matrix multiplication code. A file with all the
nodes hostname must be created.
14
When entering the execution command two arguments are passed: 1) the number of
processes “np” to specify how many processors to use and 2) a file that includes the
names of the processing nodes. However, in this project, the node names are not
referenced in the program explicitly; only the node ID (also called rank) as 0, 1, 2, etc.
are referenced.
One of the nodes is the master node, which coordinates the task of multiplying two
matrices A and B. The master partitions matrix A among the different (slave) nodes and
then broadcasts Matrix B to all the slaves. Each slave node multiplies its portion of
matrix A with matrix B and sends the results to the master, which combines the results to
produce the final product of A and B.
15
Chapter 3
SIMULATION RESULTS AND PERFORMANCE ANALYSIS
As was described in the previous chapter, for modeling a message passing system
simulation, Open MPI was installed and configured on a Simics target machine. This
chapter presents the performance simulation results and analysis of running a message
passing matrix multiplication program.
3.1
Simulation Parameters
The matrix multiplication program is executed in three simulated MP systems with 4, 8
and 16 processing nodes. The nodes are identical in terms of processor type, clock
frequency, and the size of main and cache memories. Table 5 displays the configuration
data of each node. The master and slave nodes are interconnected by Ethernet with the
MTU (Maximum Transmission Unit) set to the default value of 1500B.
The nodes are independent and each includes a copy of the test program. Seven different
matrix sizes were used in the simulation.
Table 5 Configuration Information
Nodes
Cores
MHz
Main
Memory
L1 Data
Cache
L1
Instruction
Cache
L2 Cache
4
1
2000
1 GB
32K
32 K
256 KB
8
16
1
1
2000
2000
1 GB
1 GB
32K
32K
32 K
32 K
256 KB
256 KB
16
3.2
Data Analysis
Figure 3 shows the average processing time per node using 100x100, 200x200, etc.,
matrices. As expected, the average processing time per nodes decreases as number of
nodes increases. Also, as expected, as the matrix size increases the average processing
time per node in each system also increases.
In the 4-node system, the average processing time increases linearly as the matrix size
increases. On the other hand, in the 8-node and 16-node systems, the increases of the
average processing time per node are not linear as matrix size increase. In the 8-node
system, when the matrix size is 800x800 the average processing time climbs from 34.23
to 98.80. In the case of the 16-node system, when the matrix is 1000x1000 the average
processing time per node climbs from 24.59 to 94.48. This jump in the average
processing times is due to the increased delay from the time the program starts running
until the first slave starts multiplying its portion of the matrix as illustrated in Figure 4. In
the 8-node and 16-node systems, the delay to start the first slave node jumps when the
matrix size is 800x800 and 1000x1000, respectively. One can conclude that the
communication delay time increases at a higher rate as the matrix size and number of
processing nodes increase.
Figure 3 Processing Time per node
17
Figure 4 Time before the start of 1st slave
18
Figure 5 shows the average of the total bytes communicated per node; as expected the larger the matrix sizes are the larger the
number of bytes transmitted. This increase is proportional to the number of elements in each matrix. For example, in the 16node system, the number of transmitted bytes for 500x500 matrix is 2,284,426 and for 1000x1000 matrix is 9,097,049, a ratio
of 3.98 equal approximately to the number of elements in each matrix.
19
Figure 5 Total Bytes per node
20
Figure 6 shows the average number of packets per nodes; as expected the larger the
matrix sizes the larger the number of packets incurred during the program execution. In
general as matrix size increases there are more packets per node when there are fewer
nodes. Each node must receive a bigger section of matrix A when there are fewer nodes.
Figure 7 through Figure 9 illustrate the processing time, number of bytes communicated,
number of packets of the 8-node and 16-node systems as compared with those of the 4node system. While the ratios of the number of bytes communicated and number of
packets between 8 vs. 4 and 16 vs. 4 remain the same as matrix size increases, the 16node system has the least processing time per node. However, the ratios of the 8 vs. 4 and
16 vs. 4 processing time per node decrease as matrix size became larger.
Figure 6 Number of Packets per node
21
Figure 7 Processing Time Ratio
22
Figure 8 Bytes Ratio
23
.
Figure 9 Number of Packets Ratio
24
25
Chapter 4
CONCLUSION
This project simulates a message passing multiprocessor system using Simics. Using an
MPI matrix multiplication program, the processing time and network traffic information
were collected to evaluate the performance in three separated systems: 4-node, 8-node
and 16-node. Several iterations of Simics simulations were performed to study the
performance by varying the matrix size. The results indicate that as the matrix size gets
larger and there are more processing nodes, there is a rapid increase in the processing
time per node. However, the average processing time per node is less when there are
more nodes.
This project serves as the base research for future projects. Further studies may include
performance analysis of a different problem. Other studies may include the simulation of
alternative interconnection networks in Simics. For example, this can be done with
multiple Ethernet connections per node to implement a Hypercube interconnection
network.
26
APPENDIX A. Simics Script to Run an 16-node MPI Network Simulation
if not defined create_network {$create_network = "yes"}
if not defined disk_image {$disk_image="tango-openmpi.craff"}
load-module std-components
load-module eth-links
$host_name = "master"
run-command-file "%script%/tango-common.simics"
$mac_address = "10:10:10:10:10:31"
$ip_address = "10.10.0.13"
$host_name = "slave1"
run-command-file "%script%/tango-common.simics"
$mac_address = "10:10:10:10:10:32"
$ip_address = "10.10.0.14"
$host_name = "slave2"
run-command-file "%script%/tango-common.simics"
$mac_address = "10:10:10:10:10:33"
$ip_address = "10.10.0.15"
$host_name = "slave3"
run-command-file "%script%/tango-common.simics"
$mac_address = "10:10:10:10:10:34"
$ip_address = "10.10.0.16"
$host_name = "slave4"
run-command-file "%script%/tango-common.simics"
$mac_address = "10:10:10:10:10:35"
$ip_address = "10.10.0.17"
$host_name = "slave5"
run-command-file "%script%/tango-common.simics"
$mac_address = "10:10:10:10:10:36"
$ip_address = "10.10.0.18"
$host_name = "slave6"
run-command-file "%script%/tango-common.simics"
$mac_address = "10:10:10:10:10:37"
$ip_address = "10.10.0.19"
$host_name = "slave7"
run-command-file "%script%/tango-common.simics"
27
$mac_address = "10:10:10:10:10:38"
$ip_address = "10.10.0.20"
$host_name = "slave8"
run-command-file "%script%/tango-common.simics"
$mac_address = "10:10:10:10:10:39"
$ip_address = "10.10.0.21"
$host_name = "slave9"
run-command-file "%script%/tango-common.simics"
$mac_address = "10:10:10:10:10:40"
$ip_address = "10.10.0.22"
$host_name = "slave10"
run-command-file "%script%/tango-common.simics"
$mac_address = "10:10:10:10:10:41"
$ip_address = "10.10.0.23"
$host_name = "slave11"
run-command-file "%script%/tango-common.simics"
$mac_address = "10:10:10:10:10:42"
$ip_address = "10.10.0.24"
$host_name = "slave12"
run-command-file "%script%/tango-common.simics"
$mac_address = "10:10:10:10:10:43"
$ip_address = "10.10.0.25"
$host_name = "slave13"
run-command-file "%script%/tango-common.simics"
$mac_address = "10:10:10:10:10:44"
$ip_address = "10.10.0.26"
$host_name = "slave14"
run-command-file "%script%/tango-common.simics"
$mac_address = "10:10:10:10:10:45"
$ip_address = "10.10.0.27"
$host_name = "slave15"
run-command-file "%script%/tango-common.simics"
set-memory-limit 980
28
APPENDIX B. MPI Program for Matrix Multiplication
The Matrix Multiplication MPI program was found on the Internet in the following
website URL: http://www.daniweb.com/software-development/c/code/334470/matrixmultiplication-using-mpi-parallel-programming-approach. A request to Viraj Brian
Wijesuriya, the author of the code, was submitted asking authorization to use his code in
this study. Below are the screenshots of the email requesting and authorizing permission
to use the Matrix Multiplication MPI program.
Email Sent to Request Permission to Use Matrix Multiplication Program using MPI.
Email Received from Viraj Brian Wijesuriya granting authorization to use his program.
29
A Simics MAGIC(n) function has been added to the Matrix Multiplication Program to
insert a breakpoint to invoke a callback function to collect simulation data. MAGIC (1)
and MAGIC(2) are executed by the master node to dump start and end processing time
and to Start and Stop network traffic capture. MAGIC(3) and MAGIC(4) are executed by
each slaves to dump start and end processing time.
/***********************************************************************
* Matrix Multiplication Program using MPI.
* Viraj Brian Wijesuriya - University of Colombo School of Computing, Sri Lanka.
* Works with any type of two matrixes [A], [B] which could be multiplied to produce
* a matrix [c].
* Master process initializes the multiplication operands, distributes the multiplication
* operation to worker processes and reduces the worker results to construct the final
* output.
***********************************************************************/
#include<stdio.h>
#include<mpi.h>
#include <magic-instruction.h>
//part of Simics SW
#define NUM_ROWS_A 12
//rows of input [A]
#define NUM_COLUMNS_A 12
//columns of input [A]
#define NUM_ROWS_B 12
//rows of input [B]
#define NUM_COLUMNS_B 12
//columns of input [B]
#define MASTER_TO_SLAVE_TAG 1
//tag for messages sent from master to slaves
#define SLAVE_TO_MASTER_TAG 4
//tag for messages sent from slaves to master
void makeAB();
void printArray();
//makes the [A] and [B] matrixes
//print the content of output matrix [C];
int rank;
//process rank
int size;
//number of processes
int i, j, k;
//helper variables
double mat_a[NUM_ROWS_A][NUM_COLUMNS_A]; //declare input [A]
double mat_b[NUM_ROWS_B][NUM_COLUMNS_B]; //declare input [B]
double mat_result[NUM_ROWS_A][NUM_COLUMNS_B];//declare output [C]
double start_time;
//hold start time
double end_time;
// hold end time
int low_bound;
//low bound of the number of rows of [A] allocated to a slave
int upper_bound;
//upper bound of the number of rows of [A] allocated to a slave
int portion;
//portion of the number of rows of [A] allocated to a slave
MPI_Status status;
MPI_Request request;
int main(int argc, char *argv[])
// store status of an MPI_Recv
//capture request of an MPI_Isend
30
{
MPI_Init(&argc, &argv);
//initialize MPI operations
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
//get the rank
MPI_Comm_size(MPI_COMM_WORLD, &size);
//get number of processes
/* master initializes work*/
if (rank == 0)
{
MAGIC (1);
makeAB();
start_time = MPI_Wtime();
for (i = 1; i < size; i++) {
//for each slave other than the master
portion = (NUM_ROWS_A / (size - 1));
// calculate portion without master
low_bound = (i - 1) * portion;
if (((i + 1) == size) && ((NUM_ROWS_A % (size - 1)) != 0))
{
//if rows of [A] cannot be equally divided among slaves
upper_bound = NUM_ROWS_A;
//last slave gets all the remaining rows
} else {
//rows of [A] are equally divisable among slaves
upper_bound = low_bound + portion;
}
//send the low bound first without blocking, to the intended slave
MPI_Isend(&low_bound, 1, MPI_INT, i, MASTER_TO_SLAVE_TAG,
MPI_COMM_WORLD, &request);
//next send the upper bound without blocking, to the intended slave
MPI_Isend(&upper_bound, 1, MPI_INT, i, MASTER_TO_SLAVE_TAG + 1,
MPI_COMM_WORLD, &request);
//finally send the allocated row portion of [A] without blocking, to the intended slave
MPI_Isend(&mat_a[low_bound][0], (upper_bound - low_bound) *
NUM_COLUMNS_A, MPI_DOUBLE, i, MASTER_TO_SLAVE_TAG + 2,
MPI_COMM_WORLD, &request);
}
}
//broadcast [B] to all the slaves
MPI_Bcast(&mat_b, NUM_ROWS_B*NUM_COLUMNS_B, MPI_DOUBLE, 0,
MPI_COMM_WORLD);
/* work done by slaves*/
if (rank > 0)
{
MAGIC(3);
//receive low bound from the master
MPI_Recv(&low_bound, 1, MPI_INT, 0, MASTER_TO_SLAVE_TAG,
31
MPI_COMM_WORLD, &status);
//next receive upper bound from the master
MPI_Recv(&upper_bound, 1, MPI_INT, 0, MASTER_TO_SLAVE_TAG + 1,
MPI_COMM_WORLD, &status);
//finally receive row portion of [A] to be processed from the master
MPI_Recv(&mat_a[low_bound][0], (upper_bound - low_bound) *
NUM_COLUMNS_A, MPI_DOUBLE, 0, MASTER_TO_SLAVE_TAG + 2,
MPI_COMM_WORLD, &status);
for (i = low_bound; i < upper_bound; i++)
{
//iterate through a given set of rows of [A]
for (j = 0; j < NUM_COLUMNS_B; j++)
{
//iterate through columns of [B]
for (k = 0; k < NUM_ROWS_B; k++)
{
//iterate through rows of [B]
mat_result[i][j] += (mat_a[i][k] * mat_b[k][j]);
}
}
}
//send back the low bound first without blocking, to the master
MPI_Isend(&low_bound, 1, MPI_INT, 0, SLAVE_TO_MASTER_TAG,
MPI_COMM_WORLD, &request);
//send the upper bound next without blocking, to the master
MPI_Isend(&upper_bound, 1, MPI_INT, 0, SLAVE_TO_MASTER_TAG + 1,
MPI_COMM_WORLD, &request);
//finally send the processed portion of data without blocking, to the master
MPI_Isend(&mat_result[low_bound][0], (upper_bound - low_bound) *
NUM_COLUMNS_B, MPI_DOUBLE, 0, SLAVE_TO_MASTER_TAG + 2,
MPI_COMM_WORLD, &request);
MAGIC(4);
}
/* master gathers processed work*/
if (rank == 0)
{
for (i = 1; i < size; i++)
{
// untill all slaves have handed back the processed data
//receive low bound from a slave
MPI_Recv(&low_bound, 1, MPI_INT, i, SLAVE_TO_MASTER_TAG,
MPI_COMM_WORLD, &status);
//receive upper bound from a slave
32
MPI_Recv(&upper_bound, 1, MPI_INT, i, SLAVE_TO_MASTER_TAG + 1,
MPI_COMM_WORLD, &status);
//receive processed data from a slave
MPI_Recv(&mat_result[low_bound][0], (upper_bound - low_bound) *
NUM_COLUMNS_B, MPI_DOUBLE, i, SLAVE_TO_MASTER_TAG + 2,
MPI_COMM_WORLD, &status);
}
printArray();
end_time = MPI_Wtime();
printf("\nRunning Time = %f\n\n", end_time - start_time);
}
MPI_Finalize();
MAGIC(2);
return 0; }
//finalize MPI operations
void makeAB()
{
for (i = 0; i < NUM_ROWS_A; i++) {
for (j = 0; j < NUM_COLUMNS_A; j++) {
mat_a[i][j] = i + j; }
} for (i = 0; i < NUM_ROWS_B; i++) {
for (j = 0; j < NUM_COLUMNS_B; j++) {
mat_b[i][j] = i*j;
}
}
}
void printArray()
{
for (i = 0; i < NUM_ROWS_A; i++)
{
printf("\n");
for (j = 0; j < NUM_COLUMNS_B; j++)
printf("%8.2f ", mat_result[i][j]);
}
printf ("Done.\n");
end_time = MPI_Wtime();
printf("\nRunning Time = %f\n\n", end_time - start_time);
}
33
APPENDIX C. Simics Script to Add L1 and L2 Cache Memories
This script adds L1 and L2 cache memory to each simulated machine in a 4-node
network simulation. Each processor has a 32KB write-through L1 data cache, a 32KB L1
instruction cache and a 256KB L2 cache with write-back policy. Instruction and data
accesses are separated out by id-splitters and are sent to the respective caches. The
splitter allows the correctly aligned accesses to go through and splits the incorrectly
aligned ones into two accesses. The transaction staller (trans-staller) simulates main
memory latency [11].
##Add L1 and L2 caches to Master Node
## Transaction staller to represent memory latency. Stall instructions 239 cycles to
simulate memory latency
@master_staller = pre_conf_object("master_staller", "trans-staller", stall_time = 239)
##Latency of (L2 + RAM) in CPU cycles
## Master core
@master_cpu0 = conf.master.motherboard.processor0.core[0][0]
## L2 cache(l2c0) for cpu0: 256KB with write-back
@master_l2c0 = pre_conf_object("master_l2c0", "g-cache")
@master_l2c0.cpus = master_cpu0
@master_l2c0.config_line_number = 4096
@master_l2c0.config_line_size = 64 ##64 blocks. Implies 512 lines
@master_l2c0.config_assoc = 8
@master_l2c0.config_virtual_index = 0
@master_l2c0.config_virtual_tag = 0
@master_l2c0.config_write_back = 1
@master_l2c0.config_write_allocate = 1
@master_l2c0.config_replacement_policy = 'lru'
@master_l2c0.penalty_read =37 ##Stall penalty (in cycles) for any incoming read
transaction
@master_l2c0.penalty_write =37 ##Stall penalty (in cycles) for any incoming write
transaction
@master_l2c0.penalty_read_next =22 ##Stall penalty (in cycles) for a read transaction
issued by the cache to the next level cache. Rounding error, value should be 7.
@master_l2c0.penalty_write_next =22 ##Stall penalty (in cycles) for a write transactions
issued by the cache to the next level cache. Rounding error, value should be 7
@master_l2c0.timing_model = master_staller
##L1- Instruction Cache (ic0) : 32Kb
@master_ic0 = pre_conf_object("master_ic0", "g-cache")
@master_ic0.cpus = master_cpu0
34
@master_ic0.config_line_number = 512
@master_ic0.config_line_size = 64 ##64 blocks. Implies 512 lines
@master_ic0.config_assoc = 8
@master_ic0.config_virtual_index = 0
@master_ic0.config_virtual_tag = 0
@master_ic0.config_write_back = 0
@master_ic0.config_write_allocate = 0
@master_ic0.config_replacement_policy = 'lru'
@master_ic0.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read
transaction
@master_ic0.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write
transaction
@master_ic0.penalty_read_next = 9 ##Stall penalty (in cycles) for a read transaction
issued by the cache to the next level cache. Rounding error, value should be 7.
@master_ic0.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions
issued by the cache to the next level cache. Rounding error, value should be 7
@master_ic0.timing_model = master_l2c0
# L1 - Data Cache (dc0) : 32KB Write Through
@master_dc0 = pre_conf_object("master_dc0", "g-cache")
@master_dc0.cpus = master_cpu0
@master_dc0.config_line_number = 512
@master_dc0.config_line_size = 64 ##64 blocks. Implies 512 lines
@master_dc0.config_assoc = 8
@master_dc0.config_virtual_index = 0
@master_dc0.config_virtual_tag = 0
@master_dc0.config_write_back = 0
@master_dc0.config_write_allocate = 0
@master_dc0.config_replacement_policy = 'lru'
@master_dc0.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read
transaction
@master_dc0.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write
transaction
@master_dc0.penalty_read_next = 9 ##Stall penalty (in cycles) for a read transaction
issued by the cache to the next level cache. Rounding error, value should be 7.
@master_dc0.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions
issued by the cache to the next level cache. Rounding error, value should be 7
@master_dc0.timing_model = master_l2c0
# Transaction splitter for L1 instruction cache for master_cpu0
@master_ts_i0 = pre_conf_object("master_ts_i0", "trans-splitter")
@master_ts_i0.cache = master_ic0
@master_ts_i0.timing_model = master_ic0
@master_ts_i0.next_cache_line_size = 64
35
# transaction splitter for L1 data cache for master_cpu0
@master_ts_d0 = pre_conf_object("master_ts_d0", "trans-splitter")
@master_ts_d0.cache = master_dc0
@master_ts_d0.timing_model = master_dc0
@master_ts_d0.next_cache_line_size = 64
# ID splitter for L1 cache for master_cpu0
@master_id0 = pre_conf_object("master_id0", "id-splitter")
@master_id0.ibranch = master_ts_i0
@master_id0.ibranch = master_ts_d0
#Add Configuration
@SIM_add_configuration([master_staller, master_l2c0, master_ic0, master_dc0,
master_ts_i0, master_ts_d0, master_id0], None);
@master_cpu0.physical_memory.timing_model = conf.master_id0
#End of master
##Add L1 and L2 caches to slave1 Node
## transaction staller to represent memory latency. Stall instructions 239 cycles to
simulate memory latency
@slave1_staller = pre_conf_object("slave1_staller", "trans-staller", stall_time = 239)
##Latency of (L2 + RAM) in CPU cycles
## Slave1 core
@slave1_cpu0 = conf.slave1.motherboard.processor0.core[0][0]
## L2 cache(l2c0) for cpu0: 256KB with write-back
@slave1_l2c0 = pre_conf_object("slave1_l2c0", "g-cache")
@slave1_l2c0.cpus = slave1_cpu0
@slave1_l2c0.config_line_number = 4096
@slave1_l2c0.config_line_size = 64 ##64 blocks. Implies 512 lines
@slave1_l2c0.config_assoc = 8
@slave1_l2c0.config_virtual_index = 0
@slave1_l2c0.config_virtual_tag = 0
@slave1_l2c0.config_write_back = 1
@slave1_l2c0.config_write_allocate = 1
@slave1_l2c0.config_replacement_policy = 'lru'
@slave1_l2c0.penalty_read = 37 ##Stall penalty (in cycles) for any incoming read
transaction
@slave1_l2c0.penalty_write = 37 ##Stall penalty (in cycles) for any incoming write
transaction
36
@slave1_l2c0.penalty_read_next = 22 ##Stall penalty (in cycles) for a read transaction
issued by the cache to the next level cache. Rounding error, value should be 7.
@slave1_l2c0.penalty_write_next = 22 ##Stall penalty (in cycles) for a write transactions
issued by the cache to the next level cache. Rounding error, value should be 7
@slave1_l2c0.timing_model = slave1_staller
##L1- Instruction Cache (ic0) : 32Kb
@slave1_ic0 = pre_conf_object("slave1_ic0", "g-cache")
@slave1_ic0.cpus = slave1_cpu0
@slave1_ic0.config_line_number = 512
@slave1_ic0.config_line_size = 64 ##64 blocks. Implies 512 lines
@slave1_ic0.config_assoc = 8
@slave1_ic0.config_virtual_index = 0
@slave1_ic0.config_virtual_tag = 0
@slave1_ic0.config_write_back = 0
@slave1_ic0.config_write_allocate = 0
@slave1_ic0.config_replacement_policy = 'lru'
@slave1_ic0.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read
transaction
@slave1_ic0.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write
transaction
@slave1_ic0.penalty_read_next = 9 ##Stall penalty (in cycles) for a read transaction
issued by the cache to the next level cache. Rounding error, value should be 7.
@slave1_ic0.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions
issued by the cache to the next level cache. Rounding error, value should be 7
@slave1_ic0.timing_model = slave1_l2c0
# L1 - Data Cache (dc0) : 32KB Write Through
@slave1_dc0 = pre_conf_object("slave1_dc0", "g-cache")
@slave1_dc0.cpus = slave1_cpu0
@slave1_dc0.config_line_number = 512
@slave1_dc0.config_line_size = 64 ##64 blocks. Implies 512 lines
@slave1_dc0.config_assoc = 8
@slave1_dc0.config_virtual_index = 0
@slave1_dc0.config_virtual_tag = 0
@slave1_dc0.config_write_back = 0
@slave1_dc0.config_write_allocate = 0
@slave1_dc0.config_replacement_policy = 'lru'
@slave1_dc0.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read
transaction
@slave1_dc0.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write
transaction
@slave1_dc0.penalty_read_next = 9 ##Stall penalty (in cycles) for a read transaction
issued by the cache to the next level cache. Rounding error, value should be 7.
37
@slave1_dc0.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions
issued by the cache to the next level cache. Rounding error, value should be 7
@slave1_dc0.timing_model = slave1_l2c0
# Transaction splitter for L1 instruction cache for slave1_cpu0
@slave1_ts_i0 = pre_conf_object("slave1_ts_i0", "trans-splitter")
@slave1_ts_i0.cache = slave1_ic0
@slave1_ts_i0.timing_model = slave1_ic0
@slave1_ts_i0.next_cache_line_size = 64
# transaction splitter for L1 data cache for slave1_cpu0
@slave1_ts_d0 = pre_conf_object("slave1_ts_d0", "trans-splitter")
@slave1_ts_d0.cache = slave1_dc0
@slave1_ts_d0.timing_model = slave1_dc0
@slave1_ts_d0.next_cache_line_size = 64
# ID splitter for L1 cache for slave1_cpu0
@slave1_id0 = pre_conf_object("slave1_id0", "id-splitter")
@slave1_id0.ibranch = slave1_ts_i0
@slave1_id0.ibranch = slave1_ts_d0
#Add Configuration
@SIM_add_configuration([slave1_staller, slave1_l2c0, slave1_ic0, slave1_dc0,
slave1_ts_i0, slave1_ts_d0, slave1_id0], None);
@slave1_cpu0.physical_memory.timing_model = conf.slave1_id0
#End of slave1
##Add L1 and L2 caches to slave2 Node
## Transaction staller to represent memory latency. Stall instructions 239 cycles to
simulate memory latency
@slave2_staller = pre_conf_object("slave2_staller", "trans-staller", stall_time = 239)
##Latency of (L2 + RAM) in CPU cycles
## Slave2 core
@slave2_cpu0 = conf.slave2.motherboard.processor0.core[0][0]
## L2 cache(l2c0) for cpu0: 256KB with write-back
@slave2_l2c0 = pre_conf_object("slave2_l2c0", "g-cache")
@slave2_l2c0.cpus = slave2_cpu0
@slave2_l2c0.config_line_number = 4096
@slave2_l2c0.config_line_size = 64 ##64 blocks. Implies 512 lines
@slave2_l2c0.config_assoc = 8
@slave2_l2c0.config_virtual_index = 0
38
@slave2_l2c0.config_virtual_tag = 0
@slave2_l2c0.config_write_back = 1
@slave2_l2c0.config_write_allocate = 1
@slave2_l2c0.config_replacement_policy = 'lru'
@slave2_l2c0.penalty_read = 37 ##Stall penalty (in cycles) for any incoming read
transaction
@slave2_l2c0.penalty_write = 37 ##Stall penalty (in cycles) for any incoming write
transaction
@slave2_l2c0.penalty_read_next = 22 ##Stall penalty (in cycles) for a read transaction
issued by the cache to the next level cache. Rounding error, value should be 7.
@slave2_l2c0.penalty_write_next = 22 ##Stall penalty (in cycles) for a write transactions
issued by the cache to the next level cache. Rounding error, value should be 7
@slave2_l2c0.timing_model = slave2_staller
##L1- Instruction Cache (ic0) : 32Kb
@slave2_ic0 = pre_conf_object("slave2_ic0", "g-cache")
@slave2_ic0.cpus = slave2_cpu0
@slave2_ic0.config_line_number = 512
@slave2_ic0.config_line_size = 64 ##64 blocks. Implies 512 lines
@slave2_ic0.config_assoc = 8
@slave2_ic0.config_virtual_index = 0
@slave2_ic0.config_virtual_tag = 0
@slave2_ic0.config_write_back = 0
@slave2_ic0.config_write_allocate = 0
@slave2_ic0.config_replacement_policy = 'lru'
@slave2_ic0.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read
transaction
@slave2_ic0.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write
transaction
@slave2_ic0.penalty_read_next = 9 ##Stall penalty (in cycles) for a read transaction
issued by the cache to the next level cache. Rounding error, value should be 7.
@slave2_ic0.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions
issued by the cache to the next level cache. Rounding error, value should be 7
@slave2_ic0.timing_model = slave2_l2c0
# L1 - Data Cache (dc0) : 32KB Write Through
@slave2_dc0 = pre_conf_object("slave2_dc0", "g-cache")
@slave2_dc0.cpus = slave2_cpu0
@slave2_dc0.config_line_number = 512
@slave2_dc0.config_line_size = 64 ##64 blocks. Implies 512 lines
@slave2_dc0.config_assoc = 8
@slave2_dc0.config_virtual_index = 0
@slave2_dc0.config_virtual_tag = 0
@slave2_dc0.config_write_back = 0
39
@slave2_dc0.config_write_allocate = 0
@slave2_dc0.config_replacement_policy = 'lru'
@slave2_dc0.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read
transaction
@slave2_dc0.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write
transaction
@slave2_dc0.penalty_read_next = 9 ##Stall penalty (in cycles) for a read transaction
issued by the cache to the next level cache. Rounding error, value should be 7.
@slave2_dc0.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions
issued by the cache to the next level cache. Rounding error, value should be 7
@slave2_dc0.timing_model = slave2_l2c0
# Transaction splitter for L1 instruction cache for slave2_cpu0
@slave2_ts_i0 = pre_conf_object("slave2_ts_i0", "trans-splitter")
@slave2_ts_i0.cache = slave2_ic0
@slave2_ts_i0.timing_model = slave2_ic0
@slave2_ts_i0.next_cache_line_size = 64
# transaction splitter for L1 data cache for slave2_cpu0
@slave2_ts_d0 = pre_conf_object("slave2_ts_d0", "trans-splitter")
@slave2_ts_d0.cache = slave2_dc0
@slave2_ts_d0.timing_model = slave2_dc0
@slave2_ts_d0.next_cache_line_size = 64
# ID splitter for L1 cache for slave2_cpu0
@slave2_id0 = pre_conf_object("slave2_id0", "id-splitter")
@slave2_id0.ibranch = slave2_ts_i0
@slave2_id0.ibranch = slave2_ts_d0
#Add Configuration
@SIM_add_configuration([slave2_staller, slave2_l2c0, slave2_ic0, slave2_dc0,
slave2_ts_i0, slave2_ts_d0, slave2_id0], None);
@slave2_cpu0.physical_memory.timing_model = conf.slave2_id0
#End of slave2
##Add L1 and L2 caches to slave3 Node
## Transaction staller to represent memory latency. Stall instructions 239 cycles to
simulate memory latency
@slave3_staller = pre_conf_object("slave3_staller", "trans-staller", stall_time = 239)
##Latency of (L2 + RAM) in CPU cycles
## Slave3 core
@slave3_cpu0 = conf.slave3.motherboard.processor0.core[0][0]
40
## L2 cache(l2c0) for cpu0: 256KB with write-back
@slave3_l2c0 = pre_conf_object("slave3_l2c0", "g-cache")
@slave3_l2c0.cpus = slave3_cpu0
@slave3_l2c0.config_line_number = 4096
@slave3_l2c0.config_line_size = 64 ##64 blocks. Implies 512 lines
@slave3_l2c0.config_assoc = 8
@slave3_l2c0.config_virtual_index = 0
@slave3_l2c0.config_virtual_tag = 0
@slave3_l2c0.config_write_back = 1
@slave3_l2c0.config_write_allocate = 1
@slave3_l2c0.config_replacement_policy = 'lru'
@slave3_l2c0.penalty_read = 37 ##Stall penalty (in cycles) for any incoming read
transaction
@slave3_l2c0.penalty_write = 37 ##Stall penalty (in cycles) for any incoming write
transaction
@slave3_l2c0.penalty_read_next = 22 ##Stall penalty (in cycles) for a read transaction
issued by the cache to the next level cache. Rounding error, value should be 7.
@slave3_l2c0.penalty_write_next = 22 ##Stall penalty (in cycles) for a write transactions
issued by the cache to the next level cache. Rounding error, value should be 7
@slave3_l2c0.timing_model = slave3_staller
##L1- Instruction Cache (ic0) : 32Kb
@slave3_ic0 = pre_conf_object("slave3_ic0", "g-cache")
@slave3_ic0.cpus = slave3_cpu0
@slave3_ic0.config_line_number = 512
@slave3_ic0.config_line_size = 64 ##64 blocks. Implies 512 lines
@slave3_ic0.config_assoc = 8
@slave3_ic0.config_virtual_index = 0
@slave3_ic0.config_virtual_tag = 0
@slave3_ic0.config_write_back = 0
@slave3_ic0.config_write_allocate = 0
@slave3_ic0.config_replacement_policy = 'lru'
@slave3_ic0.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read
transaction
@slave3_ic0.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write
transaction
@slave3_ic0.penalty_read_next = 9 ##Stall penalty (in cycles) for a read transaction
issued by the cache to the next level cache. Rounding error, value should be 7.
@slave3_ic0.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions
issued by the cache to the next level cache. Rounding error, value should be 7
@slave3_ic0.timing_model = slave3_l2c0
41
# L1 - Data Cache (dc0) : 32KB Write Through
@slave3_dc0 = pre_conf_object("slave3_dc0", "g-cache")
@slave3_dc0.cpus = slave3_cpu0
@slave3_dc0.config_line_number = 512
@slave3_dc0.config_line_size = 64 ##64 blocks. Implies 512 lines
@slave3_dc0.config_assoc = 8
@slave3_dc0.config_virtual_index = 0
@slave3_dc0.config_virtual_tag = 0
@slave3_dc0.config_write_back = 0
@slave3_dc0.config_write_allocate = 0
@slave3_dc0.config_replacement_policy = 'lru'
@slave3_dc0.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read
transaction
@slave3_dc0.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write
transaction
@slave3_dc0.penalty_read_next = 9 ##Stall penalty (in cycles) for a read transaction
issued by the cache to the next level cache. Rounding error, value should be 7.
@slave3_dc0.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions
issued by the cache to the next level cache. Rounding error, value should be 7
@slave3_dc0.timing_model = slave3_l2c0
# Transaction splitter for L1 instruction cache for slave3_cpu0
@slave3_ts_i0 = pre_conf_object("slave3_ts_i0", "trans-splitter")
@slave3_ts_i0.cache = slave3_ic0
@slave3_ts_i0.timing_model = slave3_ic0
@slave3_ts_i0.next_cache_line_size = 64
# transaction splitter for L1 data cache for slave3_cpu0
@slave3_ts_d0 = pre_conf_object("slave3_ts_d0", "trans-splitter")
@slave3_ts_d0.cache = slave3_dc0
@slave3_ts_d0.timing_model = slave3_dc0
@slave3_ts_d0.next_cache_line_size = 64
# ID splitter for L1 cache for slave3_cpu0
@slave3_id0 = pre_conf_object("slave3_id0", "id-splitter")
@slave3_id0.ibranch = slave3_ts_i0
@slave3_id0.ibranch = slave3_ts_d0
#Add Configuration
@SIM_add_configuration([slave3_staller, slave3_l2c0, slave3_ic0, slave3_dc0,
slave3_ts_i0, slave3_ts_d0, slave3_id0], None);
@slave3_cpu0.physical_memory.timing_model = conf.slave3_id0
#End of slave3
42
APPENDIX D. Python Script to Collect Simulation Data
This script defines a hap function, which is called by the magic instruction included in the
matrix multiplication program. This script uses Simics API to get the CPU time and run
the command to start and stop capturing the network traffic.
Python script to collect processors and network traffic statistics (matrix_100.py)
from cli import *
from simics import *
def hap_callback(user_arg, cpu, arg):
if arg == 1:
print "cpu name: ", cpu.name
print "Start at= ", SIM_time(cpu)
SIM_run_alone(run_command, "ethernet_switch0.pcap-dump
matrix_100.txt")
if arg == 2:
print "cpu name: ", cpu.name
print "Start at= ", SIM_time(cpu)
if arg == 3:
print "cpu name: ", cpu.name
print "End at= ", SIM_time(cpu)
if arg == 4:
print "cpu name: ", cpu.name
print "End at= ", SIM_time(cpu)
SIM_run_alone(run_command, "ethernet_switch0.pcap-dump-stop")
SIM_hap_add_callback("Core_Magic_Instruction", hap_callback, None)
43
APPENDIX E. SSH-Agent script
The ssh-agent script was found on the Internet in the following website URL:
http://www.cygwin.com/ml/cygwin/2001-06/msg00537.html. This script will be added to
the Linux shell startup file of the MPI user.
A request to Joseph Reagle, the author of the code, was submitted asking authorization to
use his script to automate the ssh-agent at the logging time. Below are the screenshots of
the email requesting and authorizing permission to use the ssh-agent script.
Email Sent to Request Permission to ssh-agent script.
Email Received from Joseph Reagle granting authorization to use his script.
44
The .bash_profile file contains the ssh-agent script, which is executed at login time.
In addition, the .bash_profile and .bash_profile include the lines to add the Open MPI
libraries and executables to the user’s path.
.bash_profile File
# .bash_profile
# Get the aliases and functions
if [ -f ~/.bashrc ]; then
. ~/.bashrc
fi
# User specific environment and startup programs
SSH_ENV="$HOME/.ssh/environment"
function start_agent {
echo "Initialising new SSH agent..."
/usr/bin/ssh-agent | sed 's/^echo/#echo/' > "${SSH_ENV}"
echo succeeded
chmod 600 "${SSH_ENV}"
. "${SSH_ENV}" > /dev/null
/usr/bin/ssh-add;
}
# Source SSH settings, if applicable
if [ -f "${SSH_ENV}" ]; then
. "${SSH_ENV}" > /dev/null
#ps ${SSH_AGENT_PID} doesn't work under cywgin
ps -ef | grep ${SSH_AGENT_PID} | grep ssh-agent$ > /dev/null || {
start_agent;
}
else
start_agent;
fi
PATH=$PATH:$HOME/bin
export PATH=/home/mpiu/openmpi/bin:$PATH
export LD_LIBRARY_PATH=/home/mpiu/openmpi/lib:$LD_LIBRARY_PATH
export PATH
.bashrc File
# .bashrc
# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
# User specific aliases and functions
export PATH=/home/mpiu/openmpi/bin:$PATH
export LD_LIBRARY_PATH=/home/mpiu/openmpi/lib:$LD_LIBRARY_PATH
45
APPENDIX F. User Guide
MP simulation system using Simics User’s Guide
This user guide describes the steps to install Open MPI in a Simics target
simulated machine. This installation is used as a craff file to open several
simulated machine connected through the network inside one Simics
session. This user guide has been carefully prepared to avoid extra steps
by automatizing configuration and settings that could be used repeatedly.
Table of Contents
INSTALL SIMICS
II.
SIMICS SUPPORT
III. NETWORK SIMULATION IN SIMICS
IV. REQUIRED COMPONENTS AND PREREQUISITES
V. OPEN MPI INSTALLATION AND CONFIGURATION
VI. CREATING A NEW CRAFF FILE
VII. STARTING MPI NETWORK SIMULATION WITH SIMICS SCRIPTS
VIII. RUNNING MPI PROGRAMS
I. INSTALL SIMICS
1. Download Simics files
Go to https://www.simics.net/pub/simics/4.6_wzl263/
You can go back to this link after your first installation and check for new versions
and repeat the steps below. A newer version of Simics will be installed inside the
Simics directory in a new separated directory. You will need to update the Simics
icon to access the newer version.
Download the following packages based on your operating system
ï‚· Simics Base: simics-pkg-1000
This is the base product package that contains Simics Hindsight, Simics
Accelerator, Simics Ethernet Networking, Simics Analyzer, and others
functionalities. Other packages are optional add-on products
46
ï‚·
x86-440BX Target: simics-pkg-2018
This package will allow users to model various PC systems based on Intel 440BX
AGPset system.
ï‚·
Firststeps: simics-pkg-4005.
Installation of this package is recommended if you are just starting with Simics.
This package will allow users to model various PC systems based on Freescale
MPC8641. Most of the examples on Simics Hindsight and Simics Networking
and others configurations referred to the virtual systems from this package.
2. Simics Package Installation
Run the packages in the following order and enter their required “Decryption key”
1. simics-pkg-1000
2. simics-pkg-2018
3. simics-pkg-4005
II. SIMICS SUPPORT
ï‚·
For Documentation use Help option on Simics Control window
ï‚·
For support about Simics 4.0 (and later versions)
https://www.simics.net/mwf/board_show.pl?bid=401
ï‚·
For support related to licensing and installing licenses
https://www.simics.net/mwf/board_show.pl?bid=400
III. NETWORK SIMULATION IN SIMICS
Simics provides a variety of pre-built virtual platform models. One of the target
processor architectures available for academia users is Tango. Tango is one of the
models included in the x86-440bx target group, which models various PCs with x86
or AMD64 processor based on the Intel 440BX AGPset [8]. A default configuration
of Tango is Fedora 5 operating system, a single 2000 MHz Pentium Pro 4 processor,
256 KB of memory, 19GB IDE disk, and IDE CD-ROM and DEC21143 fast Ethernet
controller with direct interface to the PCI bus.
A single Tango simulated hardware is used to install an MPI implementation and to
be configured to run MPI programs. We create a craff file from this first simulated
machine that contains all the software, programs and settings and use it with a Simics
script to open all the nodes for the specific simulated network. The disk image for all
the nodes will be exactly the same. We also indicate in the Simics script to open 4
nodes, 8 nodes or 16 nodes by specifying their respective parameters such as: MAC
47
addresses, IP addresses and hostnames for each individual node.
Simics allows the interconnection of several simulated computers using Ethernet links
inside the same session. The Simics eth-links module provides an Ethernet switch
component. The Ethernet switch works at the frame level and functions like a switch
by learning what MAC addresses the different computers have. The create_network
parameter must be set to yes to allow creation of an Ethernet link and connects the
primary adapter of each node to it. We can use a Simics script to set this parameter.
See Appendix A.
The following commands are used in the Simics script to create the network switch
and respective connections
if not defined create_network {$create_network = "yes"}
load-module eth-links
Simics allows running network services using a service node. The Simics stdcomponents module provides a service node component. The service node provides a
virtual network node that acts as a server for a number of TCP/IP-based protocols and
as an IP router between simulated networks. One service node is used, which is
connected to the Ethernet switch. The following command is used in the Simics script
on Appendix A, to load the std-components module.
load-module std-components
As described above, Simics provides all the required components to simulate an
entire virtual network. For more detailed information of the simulation components,
the Configuration Browser and Object Browser tools can be utilized. These tools are
accessible via Simics Control window Tools menu.
IV. REQUIRED COMPONENTS AND PREREQUISITES
To successfully complete the steps in this guide, a basic knowledge of Linux
command line usage, network protocols, Simics commands, MPI and clustering
concepts are recommended. By carefully following step-by-step instructions from this
guide you should not incur any problems in achieving a successful MPI Cluster
installation and setup.
In this documentation interaction with Simics in Command Line Windows and the
simulated target are presented in “consolas font”. User input is presented in “bold
font”. Comments are presented in “italic font”.
For this User Guide you need the Simics Command Line. In your system start Simics
and select Tools  Command Line Window to open it.
48
1. Enabling Video Text Consoles
A Simics limitation is that a Windows host cannot run more than one graphic
console at a time. To run multiple machines in Windows we need to switch from
graphic console to text-console. Video-text consoles are enabled with the
text_console variable.
simics> $text_console = "yes"
2. Setting the Real-Time Clock
The pre-built simulated machines provided by Simics have the date set at 2008. To
update the date and time of the real-time clock at boot we use the rtc_time command.
If we skip this step we will get an error during Open MPI installation specifying that
the configuration files created during the Open MPI configuration are older than the
binary files.
simics> $rtc_time
= "2012-10-27 00:10:00 UTC"
3. Running a machine configuration script
All the Simics machine configuration scripts are located in the “targets” folder inside
the workspace. Simics can load a Simics script with the command run-commandfile.
The following command is used to start a Tango machine.
simics> run-command-file targets/x86-440bx/tango-common.simics
4. Starting the simulated machine
Now you can start the simulation by clicking Run Forward, or by entering the
command “continue” or just “c” at the prompt.
simics> c
Figure F.1 shows the entering commands prior the start of running a simulated
machine
Figure F.1 Required Simics Commands
49
5. Log in on the Target Machine
The target OS will be loaded on the target machine. It will take a few minutes, then
you will be presented with a login prompt.
tango login: root
Password:
simics
6. Enabling Real Time Mode
In some cases, simulated time may run faster than real time. This can happen after
the OS is loaded and the machine is idle, if you attempt to type the password the
machine times out too quickly. To avoid having the virtual time progress too
quickly, you can activate the real-time mode feature.
simics> enable-real-time-mode
The enable-real-time-mode command will prevent the virtual time from progressing
faster than the real time. Once you have your environment the way you want it, you
can turn off real-time-mode with disable-real-time-mode.
7. Creating MPI user
To run MPI programs, each machine must have a user with the same name and the
same home folder on all the machines. In this user guide, “mpiu” is the user name
given to the user.
Type the following commands to create a new user with a password. I chose to enter
“simics” as mpiu user password. UNIX will give you a warning prompt if your
password is weak, but you can ignore the message.
root@tango# useradd mpiu
root@tango# passwd mpiu
New UNIX password: simics //you will required to enter password twice
Figure F.2 shows the entering of the commands to create an MPI user and setup its
password.
Figure F.2 Creating the mpiu user
50
8. Getting the required files into the Simulated Machine
In order to install Open MPI and run MPI programs we need to transfer the Open
MPI installation file and all the source code required from the host machine to the
target machine.
When a target machine mounts the host machine, the mounting is done at the root
level. A good recommendation is to arrange the files to be mounted in the host
machine. A very accessible location to place open MPI binaries is the c:\ folder.
Also, it is recommended that all the source code be placed inside a folder.
Figure F.3 displays a screenshot of Windows Explorer showing the file arrangement
in the c:\ folder. Notice that a “programs” folder contains all the code to be mounted
in the target machine.
Figure F.3 Organization of the Open MPI binaries and source codes in Windows host machine
9. Mounting the host machine
SimicsFS allows users to access the file system of the host computer from the
simulated machine. SimicsFS is already installed on the simulated machines
distributed with Simics.
To be able to run the mount command the user must have administrative privileges.
Then login with root account is required.
root@tango# mount /host
51
10. Log-in on the target machine with user “mpiu”
To avoid future permissions settings when running open MPI commands or
accessing files, a safe recommendation is to copy all files needed and to perform the
Open MPI installation after logging in as a user (e.g.mpiu)
root@tango# su – mpiu
11. Creating new directories
It is recommended that two working directories be created: 1) “openmpi” directory
where Open MPI will be installed and 2) “programs” directory where working files
will be placed.
mpiu@tango$ mkdir openmpi
mpiu@tango$ mkdir programs
12. Copying files on MPI user’s home directory
We copy the Open MPI tar file directly to mpiu user’s home directory. Also we copy
the content of the host machine’s programs folders to the simulated machine’s
programs directory.
mpiu@tango$ cp /host/openmpi-1.2.9.tar /home/mpiu
mpiu@tango$ cp /host/programs/* /home/mpiu/programs/
13. Unmounting the host machine file system
We need to login with root user in order to unmount the host machine from the
simulated target machine. Once we login as root we can enter the umount command.
mpiu@tango$ su - root
root@tango# umount /host
14. Setting up SSH for communication between nodes
MPI uses Secure SHell Network Protocol SSH to send and receive data to and from
different machines. You must login with mpiu account to configure SSH.
root@tango# su – mpiu
A personal private/public key pair is generated using the ssh-keygen command.
When prompted for the file to save the SSH key click enter to use the default
location and enter your own passphrase. In this user guide we use “simics” as the
passphrase. To specify the type of key to create the option “-t” is used. It is
recommended to use RSA passphrases, as they are more secure than DSA
passphrases.
52
mpiu@tango$ ssh-keygen -t rsa
<takes few minutes>
Enter file in which to save the key (/home/mpiu/.ssh/id_rsa) <enter>
Enter passphrase (empty for no passphrase): simics
Enter same passphrase again: simics
Next we copy the key generated by ssh-keygen command to the authorized_keys file
inside ./ssh directory
mpiu@tango$ cd .ssh
mpiu@tango$ cat id_rsa.pub >> authorized_keys
mpiu@tango$ cd ..
We also need to correct the file permissions to allow the user to connect remotely to
the other nodes
mpiu@tango$ chmod 700 ~/.ssh
mpiu@tango$ chmod 644 ~/.ssh/authorized_keys
Figure F.4 shows a screenshot of the simulated machine running all the commands to
configure SSH.
Figure F.4 Setting SSH in the simulated machine
The first time SSH is used to connect to a target machine the host authentication is
required. Because the simulated network consists of several nodes, performing the
host authentication on each node could be time consuming.
To avoid this step the SSH configuration file must be modified to set
StrictHostKeyChecking to no. In order to change this configuration we must login as
root user.
mpiu@tango$ su – root
Password:
simics
53
Then we need to edit the SSH file configuration. Locate the StrictHostKeyChecking
option, uncomment it and set it to no. To edit the ssh_config file use the command
below.
root@tango# vi /etc/ssh/ssh_config
Figure F.5 shows a screenshot of the ssh_config file being edited to set
StrictHostKeyChecking to no
Figure F.5 Editing SSH Configuration File
15. Setting ssh-agent to run upon login
Because Open MPI will use SSH to connect to each of the machines and run MPI
programs, we need to ensure that the passphrase doesn’t have to be entered for each
connection. The ssh-agent program allows us to type the passphrase once, and after
that all the following SSH invocations will be automatically authenticated.
Appendix E presents a modified .bash_profile including the script to run ssh-agent
automatically when you logged in as MPI user. It is recommended to copy and paste
the files from Appendix E and place them in separate files on the “programs” folder
in the host machine. These startup files will be mounted into the target machine.
Then you need to replace the original startup files “.bash_profile” and “.bashrc” with
their respective modified files. Figure F.3 shows the startup files placed inside the
programs folder in the host machine.
You can use the following commands to replace the startup files.
mpiu@tango$ cd
mpiu@tango$ cp
mpiu@tango$ cp
mpiu@tango$ cd
programs
.bash_profile /home/mpiu/.bash_profile
.bashrc /home/mpiu/.bashrc
..
54
V. OPEN MPI INSTALLATION AND CONFIGURATION
Open MPI installation files can be directly downloaded from Open MPI Site at URL:
http://www.open-mpi.org/software/ompi/v1.6/. SimicsFS will be required to copy the
Open MPI download file from the host to the Tango target machine. SimicsFS is
already available in the tango craff file and it will allow you to mount the host into
the target machine and copy files from the host to target.
The OS version on the Tango target machine is Fedora 5, which is about 10 years old
and the GNU Compiler Collection version is 2.96. The recommendation from both
the Open MPI forum is to upgrade the Linux OS version to something more up to
date rather than just upgrade the GCC version. It may be more complicated to install a
new version of GCC because it might open an endless, “can of worms” with the
package, resulting in library dependencies that will be hard to resolve. But also, the
upgrade of the Linux version will require extra work in Simics to update to the Linux.
Consequently, the approach to successfully install Open MPI over the Tango target is
to start with the latest Open MPI version (1.6.2) and work backwards in the release
series and see which versions work. The installation attempts were v1.6.2, v1.4.5,
which failed, however with version 1.2.9 Open MPI was successfully installed.
The installation process is similar, as you will perform any package installation in
Linux: download, extract, configure and install. The main difference is the amount of
time spent; configuration and installation will take about three hours in Simics.
The final step to be able to run an Open MPI program is to add Open MPI executable
files and libraries in the MPI user shell startup files. This step is very important
because Open MPI must be able to find its executables in the MPI user’s PATH on
every node
In this user-guide, the hostname where the mpi program will be invoked is called
“master” and the rest of nodes are identified as “slaves”.
1. Installing Open MPI
The Open MPI installation consists of three steps: 1) unpack the tar file, 2) run the
provided configure script and 3) run the “make all install” command.
Enter the tar command to decompress the Open MPI file.
root@tango$ su - mpiu
mpiu@tango$ tar xf openmpi-1.2.9.tar
The tar command creates a directory with the same name as the tar file where all the
installations files are being decompressed. We need to access the new directory in
order to configure the file for the Open MPI installation.
55
The configure scripts support different command line options. The “--prefix” option
tells Open MPI installer where to install the Open MPI library. In this user guide we
install Open MPI libraries under the directory “openmpi” that we created in Section
IV step 12.
mpiu@tango$ cd openmpi-1.2.9
mpiu@tango$ ./configure --prefix=/home/mpiu/openmpi
<...lots of output...> takes about 35 minutes
Figure F.6 is a screenshot of entering the tar and configure commands
Figure F.6 Decompressing and Configuring the Open MPI installation files
Last step to install the Open MPI libraries is to run the “make all install” command.
This step collects all the required executables and scripts in the bin subdirectory of
the directory specified by the prefix option in the configure command.
mpiu@tango$ make all install
<...lots of output...>
takes about 1hr 50 minutes
2. Adding Open MPI to user’s PATH
Open MPI requires that its executables are in the MPI user’s PATH on every node
that you run an MPI program. Because we installed Open MPI with the prefix
/home/mpiu/openmpi in Step1, the following should be in the mpiu user’s PATH
and LD_LIBRARY_PATH.
export PATH=/home/mpiu/openmpi/bin:$PATH
export LD_LIBRARY_PATH=/home/mpiu/openmpi/lib:$LD_LIBRARY_PATH
You can use the “vi” editor command as below to add the above two lines
mpiu@tango$ vi .bash_profile
mpiu@tango$ vi .bashrc
56
3. Moving the magic-instruction.h header file
A magic-instruction.h is found in the simics/src/include installation directory. This
file is already placed in the “programs” folder when it is copied into the target in
Section step13.
Magic instructions have to be compiled into the binaries of the Open MPI. Then this
file must be moved into the openmpi/include directory.
mpiu@tango$ cd programs
mpiu@tango$ mv magic-instruction.h /home/mpiu/openmpi/include/
4. Testing the Open MPI Installation
In order to run any MPI command we need the binaries and libraries in the mpiu user
path directory that was added in Section V step2. Then we need to logout and login
again to get the directory path working. Figure F.7 shows the entering the commands
to verify that the installation finished successfully.
mpiu@tango$ su - mpiu
Password: simics
mpiu@tango$ which mpicc
mpiu@tango$ which mpirun
Figure F.7 Commands to Test Open MPI installation
5. Compiling the MPI programs
In this step we compile all MPI programs to have the programs ready to execute when
the 4, 8 or 16 node machines are created. It is recommended to have a copy of the
program for each matrix size to avoid making changes on each target machine later.
Figure F.8 shows the matrix multiplication MPI programs , one for each matrix size.
Figure F.8 MPI programs
57
The command used to compile is mpicc. Figure F.9 shows entering of the mpicc
command to compile an MPI program.
mpiu@tango$ mpicc /home/mpiu/programs/matrix_100.c –o matrix1_100
Figure F.9 Compiling MPI program
Figure F.10 shows all the contents of the home directory of the mpiu user, the hidden
shell profile files, the executable MPI programs, the openmpi and programs
directories and the secure shell directory created after configuring SSH.
Figure F.10 Content of the mpiu user’s home directory
58
VI. CREATING A NEW CRAFF FILE
In Simics images are read-only. This means that modifications in a target are not
written to the image file. A way to save changes of an image is to shutdown the target
machine and then run use the save-persistent command. This will create a craff file
with all the changes. We will use this file to create a new craff file.
Most of the OS images provided by Simics are files in the “craff” format. The craff
utility is used to convert files to and from craff format, and to merge multiple craff
files into a single craff file. In this project, we are going to use the craff utility to
merge the original simulated target machine OS with the persistent state file that
contains the Open MPI installation and the necessary configuration to run MPI
programs.
The merged output file is used as the new OS image for all the nodes in the simulated
MPI Cluster Network. By using a new craff file we are only required to install and
configure one simulated machine, instead of repeating the entire configuration steps
in each individual node.
1. Saving a Persistent State with Open MPI installed
The first step to create a craff file is to shutdown the target properly using the
appropriate commands depending upon the target OS. By shutting down the system
all target changes are flushed to the simulated disk. Simics will stop after the target
system is powered off. At this point the “save-persistent-state” command is used to
save the state of the machine with the Open MPI installation and settings previously
performed.
In order to shutdown the target machine, login as root user because mpiu user has not
been granted administrative privileges. Figure F.11 shows the shutdown command.
mpiu@tango$ su - root
Password: simics
root@tango# shutdown –h 1
Figure F.11 Shutting down the target machine
59
“Save-persistent-state” command will dump the entire disk image of the target
machine to the host disk. You can run this command using the Simics Command Line
Window or Simics Control Window
simics> save-persistent-state <file name>
In the Simics Control Window: go to File, select Save Persistent State, give a name to
the file and exit Simics
2. Using the craff utility to create a craff file
The craff utility is found inside a bin folder in each simics-workspace. You will need
to have the craff program file, the original target machine craff file (downloaded from
the Simics website), and a copy of the save disk image from the recent target machine
you shutdown in the previous step.
Figure F.12 shows the required files being placed in the same directory prior to
running the craff utility.
Figure F.12 Files needed to create a new CRAFF file
In your Windows host open Windows Command Line, go to the folder location where
you placed the three files mentioned above and execute the following command
c:\path_to_your_directory\ craff –o <new-file-name.craff> tango1-
fedora5.craff tango.disk.hd_image.craff
Figure F.13 shows the craff utility command and its completion
Figure F.13 Running the CRAFF Utility
60
Once the new craff file is created place the craff file inside the images folder.
Select the new craff file, then move the file into the image folder inside the Simics
installation path. Figure F.14 shows the new “tango-openmpi.craff” file inside the
images folder.
Figure F.14 The new craff file inside the images folder
3. Using the new craff file
You can directly access this new craff file by entering in Simics Command Line the
following commands:
simics> $disk_image="tango-openmpi.craff"
simics> run-command-file targets/x86-440bx/tango-common.simics
simics> c
If you need to make modifications to this new craff file and create a second craff file
from this one, you can repeat the steps of this section.
61
VII. STARTING MPI NETWORK SIMULATION WITH SIMICS SCRIPTS
Simics provides scripting capabilities by using the Python language. All Simics
commands are implemented as Python functions. Target machine configurations are
configured using Python scripts.
In Simics there are two ways of scripting. One way is by writing scripts that contains
Simics commands; similar to typing commands to the command-line interface. The other
way is by writing scripts in the Python language. These two types of scripting can be
combined, because Python instructions can be invoked from the command-line interface,
and the command-line instructions can be issued from Python.
All target machine setup scripts are located in the read-only installation Simics folder;
these scripts should not be modified. However, Simics allows users to add new
components and modify configuration parameters in scripts placed inside the “targets”
folder of the user workspace.
Appendix A contains a new machine script used for this project. This script changes
configuration settings and uses the new disk image we created in Section VI of this User
Guide. This section explains the configuration parameters used in that script and covers
how to run this script.
1. Using a new disk image
To use the new disk image that contains Open MPI installation and settings, we use
the command below indicating the name of the new craff file.
$disk_image="tango-openmpi.craff"
2. Changing the simulated machine parameters
Before running the simulation, assign to each target machine in the simulated network
their respective hostname, MAC address and IP address.
The following parameters are used to specify the individual settings.
$host_name
= "master"
$mac_address = "10:10:10:10:10:31"
$ip_address
= "10.10.0.13"
3. Changing other configuration parameters
This step is optional. You can change the default configuration like memory size,
clock frequency or disk size. You can do this at later point but you will save time by
doing it now if you decide to. You can enter the following parameters to accomplish
these changes.
62
simics> $memory_megs = 1024
simics> $freq_mhz = 2800
The amount of memory is expressed in MB and the clock frequency in MHZ.
4. Running the Simics simulated machine script
The tango-common.simics script defines a complete simulated machine. The common
script calls 3 different scripts: 1) “system.include” script to define the hardware, 2)
“setup.include” script to define the software configuration and 3) “eth-link.include” to
define the network settings.
run-command-file "%script%/tango-common.simics"
5. Setting the Memory Limit
Simics can run out of host memory if very big images are used, or the software
running on the simulated system is bigger than the host memory. To prevent these
kind of problems, Simics implements a global image memory limitation controlled by
the set-memory-limit command [7].
Simics sets a default memory limit at startup based on the amount of memory and
number of processors of the host machine. The set-memory-limit command will show
the amount of memory available to run Simics in your host machine. Figure F.15
shows the set-memory-limit command input and its output.
simics> set-memory-limit
Figure F.15 Set-memory-limit command
To prevent the simulation crash it is recommended that you check the amount of
memory available and always set a memory-limit value lower than the default.
On the Simics script from Appendix A the set-memory-limit is set to 980MB. You
should change this parameter based on your memory. The amount of memory is
specified as follow.
set-memory-limit 980
63
6. Starting up a 16-node MPI Cluster Network in Simics
To start a 16-node-networked simulation you can copy and paste the script from
Appendix A. Otherwise, you can modify the script and change the number of nodes
desired.
It is recommended that the Simics script be placed inside the “targets” folder of the
workspace being used.
7. Log in into each node
In section VI step 1 the simulated machines was shutdown in order to create the craff
file. Now when we start running the simulation the OS will be booted in each node.
After the operating system is loaded in each node, log into each simulated machine as
root first, and then login as mpiu in each node.
login: root
Password: <simics>
root@master# su – mpiu
A way to establish an SSH connection without re-entering the password is by using
SSH-Agent. SSH-Agent is a program that will remember the password while logged
in as a specific user.
To ensure that SSH does not ask for a password when running the MPI programs it is
suggested that ssh-agent be used to remember the password while logged in as mpiu.
The following commands are required to enter each time you login as mpiu user.
mpiu@master$ eval `ssh-agent`
mpiu@master$ ssh-add ~/.ssh/id_dsa
The best way to start ssh-agent is to add the above commands to the mpiu user
.bash_profile. In this way, all programs started in the mpiu user login shell will see
the environment variables, and be able to locate ssh-agent and query it for keys as
needed.
Appendix E. contains the .bash_profile file including the ssh-agent program.
Previously in Section V. Step 2, instructions to load this file into the target machine
were indicated. Then when entering the su- mpiu command the ssh-agent program
located in the mpiu user startup-shell file will run. Then it is required to enter the
passphrase setup on Section IV Step15. Figure F.16 shows the ssh-agent running after
logging as mpiu user.
64
Figure F.16 Logging into the simulated machine as mpiu user
8. Saving a Checkpoint
Instead of booting the nodes each time we run the script and repeating the steps in this
section, use the checkpointing feature.
A checkpoint contains the entire state of the system. Simics enables loading the
checkpoint at the exact place the simulation was stopped when the checkpoint was
saved.
To save a checkpoint stop the simulation, click on the Save Checkpoint icon and
give it a name. A checkpoint directory will be created containing multiple
configuration files.
9. Adding L1 and L2 Cache Memory to Simulated target machines
The suggested way to add memory caches to Simics simulated machines is to use a
checkpoint of a fully booted and configured machine. Also, adding cache when
booting a simulation takes a significant amount of time.
Appendix C includes a Simics script to add an L1 and an L2 memory cache to the
simulated target machines. The script will simulate a system with a separate
instruction and data caches at level 1 backed by a level 2 cache with memory latency
of 239 cycles. The values of cache memory size, cache line size, number of blocks,
and read and write penalty cycles have been taken from the “Performance Analysis
Of A Hardware Queue In Simics” project prepared by Mukta Siddharth Jain in
Summer 2012.
To add memory cache, open the checkpoint saved in the previous step and before
start the simulation click on File and select Append from Script browse to find the
Simics script and select it. Then start the simulation. You can verify that memory
caches have been added by looking the Object Browser Tool or by running the
following commands for each simulated target machine.
simics> master_l2c0.status
simics> master_l2c0.statistics
simics> master_dc0.statistics
65
VIII. RUNNING MPI PROGRAMS
1. Create a host file
To let open MPI know in which processors to run MPI programs, a file with the
machine names must be created. You can use “vi” editor to create the file and add the
hostnames.
mpiu@master$ vi nodes
Figure F.17 shows the “cat nodes” command to display the content of the hostfile
Figure F.17 The hostfile indicates which hosts will run MPI programs
2. Collecting Simics Statistics
To collect data, run a Python file prior starting the program execution. From the File
tab in Simics Control Windows click on “Run Python File” to run a Python script in
the current session.
Appendix D contains a Python script that uses Simics API to define a callback that
triggers Core Magic Instructions to collect CPU processing time and start and stop
network traffic capturing. Four magic instruction functions have been added in the
matrix multiplication code, to trigger master and slave nodes start and finish program
execution tasks.
Load the Python script before starting the executing the MPI program to be able to
capture CPU time and network traffic.
The CPU processing time will be displayed on Simics Command line. See Figure
F.18
66
Figure F.18 Simics output data
3. Running MPI programs
On the master node type the following commands to run MPI programs. Figure F.19
shows the mpirun input command.
mpiu@master$
mpirun –np 16 –hostfile nodes matrix1_500
Figure F.19 Running an MPI program with 16 processes
To assign processes to the nodes in a round-robin fashion until the processes are
exhausted, the “—bynode” option can be entered with the mpirun command. See
Figure F.20
mpiu@master$ mpirun –np 16 –hostfile nodes –-bynode matrix1_500
Figure F.20 Running an MPI program using –bynode option
67
APPENDIX G. Simulation Data
4cpu MPI Network Simulation Data
Matrix: 100x100
MPI_Start
Master
Slave3
Slave1
Slave2
722.365702
746.419807
751.385091
752.385222
End
Computation
746.475810
751.440992
752.441118
MPI_End
766.083543
767.082269
767.081550
767.081832
Computation
Time
0.05600260
0.05590115
0.05589621
MPI_Time
43.71784104
20.66246190
15.69645875
14.69661042
Matrix: 200x200
MPI_Start
Master
Slave3
Slave2
Slave1
2378.286509
2414.397635
2415.370481
2416.370238
End
Computation
2418.888566
2427.850499
2428.861047
MPI_End
2474.091269
2475.090210
2475.090070
2475.089694
Computation
Time
4.49093051
12.48001824
12.49080936
MPI_Time
95.80475925
60.69257428
59.71958884
58.71945600
Matrix: 400x400
MPI_Start
Master
Slave1
Slave2
Slave3
1489.546897
1590.701789
1595.709734
1598.709066
End
Computation
1616.960821
1617.970315
1620.963406
MPI_End
1686.406596
1687.405984
1687.405984
1687.406260
Computation
Time
26.25903226
22.26058160
22.25433983
MPI_Time
196.85969970
96.70419566
91.69625050
88.69719363
Matrix: 500x500
MPI_Start
Master
Slave1
Slave2
Slave3
862.173132
1005.441582
1030.452400
1039.453641
End
Computation
1033.636250
1058.638103
1067.639563
MPI_End
1116.978143
1117.974979
1117.975812
1117.975760
Computation
Time
28.19466734
28.18570287
28.18592278
MPI_Time
254.80501086
112.53339692
87.52341195
78.52211974
Matrix: 600x600
MPI_Start
Master
Slave1
Slave2
Slave3
1533.081170
1792.468321
1797.479235
1800.480287
End
Computation
1870.902267
1837.878533
1840.880464
MPI_End
1972.819810
1973.818846
1973.819214
1973.819494
Computation
Time
78.43394612
40.39929791
40.40017697
MPI_Time
439.73863987
181.35052495
176.33997826
173.33920792
68
Matrix: 800x800
MPI_Start
Master
Slave1
Slave2
Slave3
1778.569250
2062.089247
2065.096370
2068.091930
End
Computation
2130.028067
2133.378945
2138.029637
MPI_End
2273.258913
2274.258198
2274.258586
2274.258911
Computation
Time
67.93881903
68.28257501
69.93770745
MPI_Time
494.68966245
212.16895092
209.16221590
206.16698079
Matrix: 1000x1000
MPI_Start
Master
Slave1
Slave2
Slave3
3125.391112
3489.433002
3492.447349
3495.442630
End
Computation
3590.165339
3609.212790
3610.223270
MPI_End
3792.837398
3793.835218
3793.836341
3793.836196
Computation
Time
100.73233659
116.76544179
114.78063929
MPI_Time
667.44628658
304.40221575
301.38899256
298.39356539
8cpu MPI Network Simulation Data
Matrix: 100x100
MPI_Start
Master
Slave7
Slave2
Slave1
Slave4
Slave6
Slave3
Slave5
1229.67042716
1258.68981003
1260.77170613
1261.69382450
1267.77225009
1267.80813873
1268.69427626
1268.73028337
End
Computation
1258.71646696
1260.79805126
1261.72032408
1267.79865797
1267.83451904
1268.72075278
1268.75665251
MPI_End
1280.35195825
1281.35212409
1281.35024746
1281.34990063
1281.35081107
1281.35130992
1281.35048816
1281.35133556
Computation
Time
0.02665693
0.02634513
0.02649958
0.02640788
0.02638031
0.02647652
0.02636914
MPI_Time
50.68153109
22.66231406
20.57854133
19.65607613
13.57856098
13.54317119
12.65621190
12.62105219
Matrix: 200x200
MPI_Start
Master
Slave1
Slave2
Slave3
Slave4
Slave5
Slave6
Slave7
1626.29713568
1672.36681399
1671.36703746
1683.35151857
1682.35178777
1683.38737713
1682.38766433
1677.34618447
End
Computation
1672.59335311
1671.59338341
1683.57789223
1682.58220647
1683.61851696
1682.61424883
1677.57204794
MPI_End
1700.53169460
1701.52885618
1701.52931159
1701.52967234
1701.53018634
1701.53007522
1701.53104728
1701.53139983
Computation
Time
0.22653912
0.22634595
0.22637366
0.23041870
0.23113983
0.22658450
0.22586347
MPI_Time
74.23455892
29.16204219
30.16227413
18.17815377
19.17839857
18.14269809
19.14338295
24.18521536
69
Matrix: 400x400
MPI_Start
Master
Slave7
Slave2
Slave5
Slave1
Slave6
Slave3
Slave4
871.91066166
937.03865279
939.05092996
939.05759105
940.05103014
940.05757316
942.04219551
943.04201990
End
Computation
942.88445734
956.89924832
954.89942531
957.89987632
955.89884380
957.88374382
958.88396380
MPI_End
1070.10285269
1071.10191833
1071.10004529
1071.10119812
1071.09969739
1071.10154341
1071.10039223
1071.10072383
Computation
Time
5.84580455
17.84831836
15.84183426
17.84884617
15.84127064
15.84154831
15.84194390
MPI_Time
198.19219103
134.06326554
132.04911533
132.04360707
131.04866725
131.04397026
129.05819672
128.05870393
Matrix: 500x500
MPI_Start
Master
Slave7
Slave2
Slave6
Slave1
Slave3
Slave5
Slave4
1719.73229724
1794.95598571
1796.96934522
1797.95127388
1797.95470917
1799.94924857
1800.93787157
1800.94680752
End
Computation
1804.55938086
1822.58502584
1817.55571333
1823.56646722
1819.55218409
1820.54165492
1820.55046424
MPI_End
1952.07326800
1953.07216773
1953.07030933
1953.07217221
1953.06980916
1953.07077363
1953.07117095
1953.07106586
Computation
Time
9.60339515
25.61568062
19.60443945
25.61175805
19.60293552
19.60378335
19.60365672
MPI_Time
232.34097076
158.11618202
156.10096411
155.12089833
155.11509999
153.12152506
152.13329938
152.12425834
Matrix: 600x600
MPI_Start
Master
Slave7
Slave2
Slave1
Slave6
Slave4
Slave5
Slave3
3123.34844119
3208.59624951
3212.58802805
3213.58839260
3213.58925662
3213.59907751
3214.58960817
3214.59065264
End
Computation
3221.07851834
3247.07913131
3248.07189130
3238.06224079
3238.06807530
3239.06457682
3239.07085680
MPI_End
3397.21104070
3398.20994055
3398.20807244
3398.20761698
3398.20956417
3398.20875652
3398.20908689
3398.20841173
Computation
Time
12.48226883
34.49110326
34.48349870
24.47298417
24.46899779
24.47496865
24.48020416
MPI_Time
273.86259951
189.61369104
185.62004439
184.61922438
184.62030755
184.60967901
183.61947872
183.61775909
Matrix: 800x800
MPI_Start
Master
Slave1
Slave2
Slave3
Slave4
Slave5
Slave6
Slave7
284.25757219
801.93260784
826.94300260
845.95492160
854.96371434
863.95408786
870.96506228
873.96771140
End
Computation
876.65523592
863.63877207
882.81892302
891.66248462
900.73314571
907.87970827
910.78251641
MPI_End
1074.64012311
1075.63734757
1075.63742559
1075.63799198
1075.63783111
1075.63858965
1075.64687475
1075.64746584
Computation
Time
74.72262808
36.69576947
36.86400143
36.69877028
36.77905786
36.91464599
36.81480501
MPI_Time
790.38255092
273.70473973
248.69442299
229.68307039
220.67411677
211.68450179
204.68181247
201.67975444
70
Matrix: 1000x1000
MPI_Start
Master
Slave1
Slave2
Slave3
Slave4
Slave5
Slave6
Slave7
1522.91702460
2336.94109024
2359.94818475
2394.96030872
2411.97553305
2430.98142645
2439.98709711
2444.98853204
End
Computation
2442.54765154
2417.52660733
2454.51962727
2469.53793908
2488.53526851
2497.55376958
2504.54846994
MPI_End
2700.00905902
2701.00549542
2701.00568689
2701.00570714
2701.00671076
2701.00719029
2701.00731751
2701.00798103
Computation
Time
105.60656130
57.57842258
59.55931855
57.56240603
57.55384206
57.56667247
59.55993790
MPI_Time
1177.09203442
364.06440518
341.05750214
306.04539842
289.03117771
270.02576384
261.02022040
256.01944899
16cpu MPI Network Simulation Data
Matrix: 100x100
MPI_Start
Master
Slave2
Slave1
Slave15
Slave4
Slave3
Slave6
Slave5
Slave8
Slave12
Slave7
Slave11
Slave10
Slave14
Slave9
Slave13
1993.58129168
2026.72543563
2027.64725516
2031.64490296
2033.72933284
2034.65115793
2039.73346672
2040.65528163
2040.72675336
2040.76566321
2041.64871663
2041.68769368
2046.73361315
2046.77024893
2047.65574569
2047.69219107
End
Computation
2026.75168710
2027.67361931
2031.67160120
2033.75552625
2034.67754021
2039.75995439
2040.68164647
2040.75321101
2040.79192126
2041.67496412
2041.71407727
2046.75989281
2046.79652894
2047.68213582
2047.71860572
MPI_End
2059.32665899
2060.32033448
2060.31981172
2060.32516928
2060.32068622
2060.32099322
2060.32167049
2060.32140158
2060.32231414
2060.32399194
2060.32193627
2060.32372970
2060.32315891
2060.32465165
2060.32320862
2060.32487018
Computation
Time
0.02625147
0.02636415
0.02669824
0.02619341
0.02638228
0.02648767
0.02636484
0.02645765
0.02625805
0.02624749
0.02638359
0.02627966
0.02628001
0.02639013
0.02641465
MPI_Time
65.74536731
33.59489885
32.67255656
28.68026632
26.59135338
25.66983529
20.58820377
19.66611995
19.59556078
19.55832873
18.67321964
18.63603602
13.58954576
13.55440272
12.66746293
12.63267911
Matrix: 200x200
MPI_Start
Master
Slave2
Slave1
Slave4
Slave3
Slave15
Slave6
Slave5
Slave8
Slave12
Slave7
3268.08385629
3313.19804672
3314.19161181
3324.18439300
3325.18392085
3326.17686595
3326.20413662
3327.20410713
3331.18546993
3331.22084630
3332.18526357
End
Computation
3313.32585749
3314.32037508
3324.31313181
3325.31685032
3326.31164634
3326.33584576
3327.33245527
3331.31426741
3331.34848416
3332.31902701
MPI_End
3353.22010746
3354.21475174
3354.21387719
3354.21556876
3354.21542403
3354.21910366
3354.21580929
3354.21547729
3354.21681386
3354.21902169
3354.21654440
Computation
Time
0.12781077
0.12876327
0.12873881
0.13292947
0.13478039
0.13170914
0.12834814
0.12879748
0.12763786
0.13376344
MPI_Time
85.13625117
41.01670502
40.02226538
30.03117576
29.03150318
28.04223771
28.01167267
27.01137016
23.03134393
22.99817539
22.03128083
71
Matrix: 200x200
MPI_Start
Slave11
Slave10
Slave14
Slave9
Slave13
3332.22040924
3337.19866067
3337.22516527
3338.19849702
3338.22499663
End
Computation
3332.34901736
3337.33340938
3337.35897701
3338.33143329
3338.35316491
MPI_End
3354.21783930
3354.21760452
3354.26858100
3354.21750009
3354.21911963
Computation
Time
0.12860812
0.13474871
0.13381174
0.13293627
0.12816828
MPI_Time
21.99743006
17.01894385
17.04341573
16.01900307
15.99412300
Matrix: 400x400
MPI_Start
Master
Slave1
Slave2
Slave3
Slave4
Slave5
Slave6
Slave7
Slave8
Slave9
Slave10
Slave11
Slave12
Slave13
Slave14
Slave15
1669.96760274
1713.10472271
1722.12646982
1731.13064021
1740.13596937
1749.14096623
1758.14673722
1765.26355876
1776.15687086
1785.15863479
1794.16359585
1803.16815393
1812.17588832
1821.17720943
1830.17268992
1831.18145679
End
Computation
1722.24884089
1733.27656146
1742.27521609
1751.28567462
1760.28568895
1769.29819731
1776.40793429
1787.30687274
1796.30358620
1805.31369708
1814.31288272
1832.32146277
1832.32146277
1841.32786388
1832.33624252
MPI_End
1906.29251744
1907.26392218
1907.26421054
1907.26436907
1907.26516969
1907.26528472
1907.26584324
1907.26660398
1907.26657200
1907.26674351
1907.26708432
1907.26768403
1907.26792267
1907.26893542
1907.26916532
1906.30553260
Computation
Time
9.14411818
11.15009164
11.14457588
11.14970525
11.14472272
11.15146009
11.14437553
11.15000188
11.14495141
11.15010123
11.14472879
20.14557445
11.14425334
11.15517396
1.15478573
MPI_Time
236.32491470
194.15919947
185.13774072
176.13372886
167.12920032
158.12431849
149.11910602
142.00304522
131.10970114
122.10810872
113.10348847
104.09953010
95.09203435
86.09172599
77.09647540
75.12407581
Matrix: 500x500
MPI_Start
Master
Slave1
Slave2
Slave3
Slave4
Slave5
Slave6
Slave7
Slave8
Slave9
Slave10
Slave11
Slave12
Slave13
Slave14
Slave15
2231.67023532
2284.88848109
2291.90391551
2300.90452053
2309.91079733
2314.93637724
2325.92303026
2332.91889714
2343.92107338
2350.93480421
2359.93753845
2368.94149554
2377.94327473
2384.94717535
2393.93566879
2394.93668118
End
Computation
2296.74258867
2303.75805757
2312.75872769
2321.76539874
2326.79081288
2337.77682365
2344.77278215
2355.77498362
2362.78990935
2371.79206804
2380.79551390
2389.79703215
2396.80154269
2405.79032685
2408.78922192
MPI_End
2492.36249418
2493.35673763
2493.35749369
2493.35783404
2493.35824820
2493.35880853
2493.35871283
2493.35896358
2493.35997552
2493.35987774
2493.36033007
2493.36087274
2493.36110547
2493.36172079
2493.36199705
2493.36249887
Computation
Time
11.85410758
11.85414206
11.85420716
11.85460141
11.85443564
11.85379339
11.85388501
11.85391024
11.85510514
11.85452959
11.85401836
11.85375742
11.85436734
11.85465806
13.85254074
MPI_Time
260.69225886
208.46825654
201.45357818
192.45331351
183.44745087
178.42243129
167.43568257
160.44006644
149.43890214
142.42507353
133.42279162
124.41937720
115.41783074
108.41454544
99.42632826
98.42581769
72
Matrix: 600x600
MPI_Start
Master
Slave1
Slave2
Slave3
Slave4
Slave5
Slave6
Slave7
Slave8
Slave9
Slave10
Slave11
Slave12
Slave13
Slave14
Slave15
2321.48705911
2382.80632360
2395.78427512
2406.79969996
2417.79990693
2428.80836619
2439.81930768
2452.81946747
2461.82359410
2470.86287964
2485.84165891
2496.84633005
2507.85812801
2518.94034520
2529.91874286
2530.91914360
End
Computation
2391.69300389
2412.66287500
2423.67547568
2434.67689186
2445.69091754
2456.69751829
2469.69696428
2478.70001509
2487.73904893
2502.72455471
2513.72216210
2524.73543172
2535.82257599
2546.79605027
2547.80207608
MPI_End
2630.98981395
2631.98380440
2631.98439996
2631.98491515
2631.98507321
2631.98570722
2631.98621068
2631.98633674
2631.98647162
2631.98670718
2631.98752783
2631.98784259
2631.98810105
2631.98868509
2631.98892925
2631.98930952
Computation
Time
8.88668029
16.87859988
16.87577572
16.87698493
16.88255135
16.87821061
16.87749681
16.87642099
16.87616929
16.88289580
16.87583205
16.87730371
16.88223079
16.87730741
16.88293248
MPI_Time
309.50275484
249.17748080
236.20012484
225.18521519
214.18516628
203.17734103
192.16690300
179.16686927
170.16287752
161.12382754
146.14586892
135.14151254
124.12997304
113.04833989
102.07018639
101.07016592
Matrix: 800x800
MPI_Start
Master
Slave1
Slave2
Slave3
Slave4
Slave5
Slave6
Slave7
Slave8
Slave9
Slave10
Slave11
Slave12
Slave13
Slave14
Slave15
1764.57617370
1854.07358432
1865.07911709
1876.08806460
1887.09279447
1898.09497187
1909.09998482
1918.11403623
1931.12276182
1938.15990048
1953.13520228
1962.11486616
1975.15379495
1982.18493073
1997.15278888
1998.15525624
End
Computation
1882.14500514
1888.44694720
1899.79853059
1911.56699445
1921.49781456
1932.94678633
1942.21379390
1956.58445661
1961.65359396
1977.63628562
1985.99358173
1998.67850110
2005.85221011
2021.03744474
2021.66745458
MPI_End
2158.02067293
2159.01369455
2159.01447136
2159.01431210
2159.01480510
2159.01532997
2159.01581722
2159.01617790
2159.01685689
2159.01701940
2159.01731192
2159.01804465
2159.01833244
2159.01836916
2159.01890707
2159.01944119
Computation
Time
28.07142082
23.36783011
23.71046599
24.47419998
23.40284269
23.84680151
24.09975767
25.46169479
23.49369348
24.50108334
23.87871557
23.52470615
23.66727938
23.88465586
23.51219834
MPI_Time
393.44449923
304.94011023
293.93535427
282.92624750
271.92201063
260.92035810
249.91583240
240.90214167
227.89409507
220.85711892
205.88210964
196.90317849
183.86453749
176.83343843
161.86611819
160.86418495
73
Matrix: 1000x1000
MPI_Start
Master
Slave1
Slave2
Slave3
Slave4
Slave5
Slave6
Slave7
Slave8
Slave9
Slave10
Slave11
Slave12
Slave13
Slave14
Slave15
3275.68051529
4395.80169101
4410.80356001
4421.80934211
4445.97720510
4463.82758659
4472.82831643
4487.83763063
4504.84396442
4521.84643371
4536.85516959
4543.85991717
4552.86620094
4567.87296207
4584.87956604
4591.88133015
End
Computation
4440.97843196
4445.97720510
4456.97700815
4485.99339561
4499.00344678
4508.00208222
4523.00864127
4540.01671372
4557.01450955
4572.02277033
4579.05846781
4588.03586814
4603.04761829
4620.04697404
4627.06450868
MPI_End
4787.31447475
4788.30805880
4788.30900741
4788.30921830
4788.30985496
4788.31030996
4788.31071203
4788.31111848
4788.31134976
4788.31160182
4788.31197106
4788.31261940
4788.31291667
4788.31361662
4788.31353722
4788.31351373
Computation
Time
45.17674095
35.17364509
35.16766604
40.01619051
35.17586019
35.17376579
35.17101064
35.17274930
35.16807584
35.16760074
35.19855064
35.16966720
35.17465622
35.16740800
35.18317853
MPI_Time
1511.63395946
392.50636779
377.50544740
366.49987619
342.33264986
324.48272337
315.48239560
300.47348785
283.46738534
266.46516811
251.45680147
244.45270223
235.44671573
220.44065455
203.43397118
196.43218358
Table 6 Processing Time and Network Traffic Data Collected
100x100
100x100
100x100
200x200
200x200
200x200
400x400
400x400
400x400
500x500
500x500
500x500
600x600
600x600
600x600
800x800
800x800
800x800
1000x1000
1000x1000
1000x1000
Nodes
Matrix
4
8
16
4
8
16
4
8
16
4
8
16
4
8
16
4
8
16
4
8
16
cpu_time
(sec)
43.72
50.68
65.75
95.80
74.23
85.14
196.86
198.19
234.35
254.81
232.34
260.69
439.74
273.86
309.50
494.69
790.38
393.44
667.45
1,177.09
1,511.63
Process
ing
Time
per
nodes
10.93
6.34
4.11
23.95
9.28
5.32
49.21
24.77
14.65
63.70
29.04
16.29
109.93
34.23
19.34
123.67
98.80
24.59
166.86
147.14
94.48
Avg.
Comp
utation
Time
(sec)
0.06
0.03
0.03
9.82
0.23
0.13
23.59
14.99
11.01
28.19
19.89
10.95
53.08
25.62
16.35
68.72
42.21
24.19
110.76
25.62
24.15
Total Bytes
453,918
834,269
1,713,104
1,756,748
3,209,844
6,139,355
6,800,694
12,306,152
23,753,325
10,597,448
19,196,017
36,550,814
15,219,443
27,651,090
52,020,906
27,045,266
48,799,556
92,816,860
42,197,191
76,366,413
145,552,789
Bytes per
node
No. of
Packet
s
113,480
641
104,284
1,256
107,069
2,910
439,187
1,860
401,231
3,492
383,710
6,974
1,700,174
5,487
1,538,269 10,534
1,484,583 20,661
2,649,362
8,118
2,399,502 15,493
2,284,426 30,445
3,804,861 11,457
3,456,386 21,911
3,251,307 42,209
6,761,317 19,897
6,099,945 36,856
5,801,054 73,645
10,549,298 30,574
9,545,802 56,899
9,097,049 114,989
No.
of
Packe
ts per
Node
160
157
182
465
437
436
1,372
1,317
1,291
2,030
1,937
1,903
2,864
2,739
2,638
4,974
4,607
4,603
7,644
7,112
7,187
Time
between
first/last
packet
(sec)
42.68
48.64
64.70
94.75
74.18
84.50
195.79
197.12
233.23
253.64
230.18
259.53
438.55
272.71
308.28
493.43
789.02
392.08
665.91
1,175.55
1,510.35
Avg
Avg.
packet packet
/sec
/size
15
26
45
20
47
83
28
53
89
32
67
117
26
80
137
40
47
188
46
48
76
708
664
589
944
919
880
1,239
1,168
1,150
1,305
1,239
1,201
1,328
1,262
1,232
1,359
1,324
1,260
1,380
1,342
1,266
Avg.
bytes/sec
Avg.
Mbit
/sec
10,636
17,152
26,478
18,541
43,270
72,657
34,735
62,431
101,846
41,781
83,396
140,837
34,704
101,394
168,745
54,811
61,848
236,728
63,368
64,962
96,370
0.09
0.14
0.21
0.15
0.35
0.58
0.28
0.50
0.82
0.33
0.67
1.13
0.28
0.81
1.35
0.44
0.50
1.89
0.51
0.52
0.77
74
75
Table 7 Processing Time, Total Bytes and Number of Packets Ratios
Matrix
Ratios
Processing
Time Ratio
Bytes Ratio
100x100
100x100
200x200
200x200
400x400
400x400
500x500
500x500
600x600
600x600
800x800
800x800
1000x1000
1000x1000
8 vs. 4
16 vs. 4
8 vs. 4
16 vs. 4
8 vs. 4
16 vs. 4
8 vs. 4
16 vs. 4
8 vs. 4
16 vs. 4
8 vs. 4
16 vs. 4
8 vs. 4
16 vs. 4
0.5796
0.3760
0.3874
0.2222
0.5034
0.2976
0.4559
0.2557
0.3114
0.1760
0.7988
0.1988
0.8818
0.5662
0.9190
0.9435
0.9136
0.8737
0.9048
0.8732
0.9057
0.8622
0.9084
0.8545
0.9021
0.8580
0.9049
0.8623
No. of
Packets
Ratio
0.9797
1.1349
0.9387
0.9374
0.9599
0.9414
0.9542
0.9375
0.9562
0.9210
0.9261
0.9253
0.9305
0.9403
Table 8 Time before the start of 1st slave
Matrix
4-node
8-node
16-node
100x100 200x200
24.05
29.02
33.14
36.11
46.07
45.11
400x400 500x500
101.15
65.13
43.14
143.27
75.22
53.22
600x600 800x800
259.39
85.25
61.32
283.52
517.68
89.50
1000x1000
364.04
814.02
1120.12
76
BIBLIOGRAPHY
[1] David E. Culler, Jaswinder Pal Singh “Parallel Computer Architecture A
Hardware /Software Approach”. Morgan Kaufmann Publishers, 1999.
[2] Wind River Simics, URL: http://www.simics.net.
[3] MPI Group Management & Communicator, URL:
http://static.msi.umn.edu/tutorial/scicomp/general/MPI/communicator.html
[4] Message Passing Interface (MPI) :Overview and Goals, URL:
www.mcs.anl.gov/research/projects/mpi/mpi-standard/mpi-report1.1/node2.htm#Node2
[5] FAQ: General information about the Open MPI Project
Section: 3. What are the goals of the Open MPI Project?
http://www.open-mpi.org/faq/?category=general
[6] An Assessment of Beowulf-class Computing for NASA Requirements: Initial
Findings from the First NASA Workshop on Beowulf-class Clustered Computing.
[7] Wind River Simics, “Hindsight User Guide.pdf”, Simics version 4.6, Revision
4076, Date 2012-10-11, pp. 207, URL: http://www.simics.net.
[8] Message Passing Interface (MPI), URL: https://computing.llnl.gov/tutorials/mpi/
[9] Wind River Simics, “Target Guide x86.pdf”, Simics version 4.6, Revision 4071,
Date 2012-09-06, pp. 9, URL: http://www.simics.net.
[10] Simics Forum, UR: https://www.simics.net/mwf/forum_show.pl
77
[11] Wind River Simics, “Ethernet Networking User Guide.pdf”, Simics version 4.6,
Revision 4076, Date 2012-10-11, pp. 207, URL: http://www.simics.net.
[12] K computer. Specifications: Network, URL:
http://en.wikipedia.org/wiki/K_computer
[13] MPICH2 Frequently Asked Questions, URL:
http://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions
[14] Setting up a Beowulf Cluster Using Open MPI on Linux, URL:
http://techtinkering.com/2009/12/02/setting-up-a-beowulf-cluster-using-openmpi-on-linux/
[15] Considerations in Specifying Beowulf Clusters, URL:
http://h18002.www1.hp.com/alphaserver/download/Beowulf_Clusters.PDF
[16] FAQ: What kinds of systems / networks / run-time environments does Open MPI
support? Section 4: 4. What run-time environments does Open MPI support?
http://www.open-mpi.org/faq/?category=supported-systems
[17] Jain, Mukta Siddharth, “Performance analysis of a hardware queue in Simics”
URL: http://csus-dspace.calstate.edu/xmlui/handle/10211.9/1857
[18] MPI example programs, URL:
http://users.abo.fi/Mats.Aspnas/PP2010/examples/MPI/