Lecture #10 slides

advertisement
‫מבוא לעיבוד מקבילי‬
‫הרצאה מס' ‪10‬‬
‫‪24/12/2001‬‬
‫תרגיל בית מס' ‪3‬‬
‫• ניתן להגיש עד ליום ה' ה‪27/12/2001 -‬‬
‫פרוייקטי גמר‬
‫• קבוצות ‪ 1-10‬מתבקשות להכין את המצגות‬
‫שלהן לשיעור בעוד שבועיים‪.‬‬
‫• נא להעביר את קבצי המצגות בפורמט ‪Point‬‬
‫‪ Power‬לפני ההרצאה או לבוא לשיעור עם‬
‫‪ CDROM‬צרוב‪.‬‬
‫הבחנים‬
‫• בדיקת הבחנים תסתיים עד ליום ו'‪.‬‬
‫• התוצאות יפורסמו בשיעור הבא‪.‬‬
‫נושאי ההרצאה‬
• Today’s topics:
–
–
–
–
Shared Memory
Cilk, OpenMP
MPI – Derived Data Types
How to Build a Beowulf
Shared Memory
• Goto PDF presentation:
Chapter 8 from Wilkinson & Allan’s book.
“Programming with Shared Memory”
Summary
•
•
•
•
•
•
Process creation
The thread concept
Pthread routines
How data can be created as shared
Condition Variables
Dependency analysis: Bernstein’s
conditions
Cilk
http://supertech.lcs.mit.edu/cilk
Cilk
• A language for multithreaded parallel
programming based on ANSI C.
• Cilk is designed for general-purpose
parallel programming language
• Cilk is especially effective for exploiting
dynamic, highly asynchronous
parallelism.
A serial C
program to
compute the nth
Fibonacci
number.
A parallel Cilk program
to compute the nth
Fibonacci number.
Cilk - continue
• Compiling:
$ cilk -O2 fib.cilk -o fib
• Executing:
$ fib --nproc 4 30
OpenMP
Next 5 slides taken from the SC99 tutorial
Given by:
Tim Mattson, Intel Corporation and
Rudolf Eigenmann, Purdue University
‫לקריאה נוספת‬
High-Performance Computing
Part III
Shared Memory Parallel
Processors
Back to MPI
Collective Communication
Broadcast
Collective Communication
Reduce
Collective Communication
Gather
Collective Communication
Allgather
Collective Communication
Scatter
Collective Communication
There are more collective communication commands…
MPI -‫נושאים מתקדמים ב‬
• MPI – Derived Data Types
• MPI-2 – Parallel I/O
‫‪User Defined Types‬‬
‫• מלבד ה‪ types -‬המוגדרים מראש‪ ,‬יכול‬
‫המשתמש ליצור טיפוסים חדשים‬
‫‪• Compact pack/unpack.‬‬
Predefined Types
MPI_DOUBLE
double
MPI_FLOAT
float
MPI_INT
signed int
MPI_LONG
signed long int
MPI_LONG_DOUBLE
long double
MPI_LONG_LONG_INT
signed long long int
MPI_SHORT
signed short int
MPI_UNSIGNED
unsigned int
MPI_UNSIGNED_CHAR
unsigned char
MPI_UNSIGNED_LONG
unsigned long int
MPI_UNSIGNED_SHORT
unsigned short int
MPI_BYTE
Motivation
•What if you want to specify:
•non-contiguous data of a single type?
•contiguous data of mixed types?
•non-contiguous data of mixed types?
Derived datatypes save memory, are faster,
more portable, and elegant.
3 Steps
1. Construct the new datatype using appropriate MPI
routines:
MPI_Type_contiguous, MPI_Type_vector,
MPI_Type_struct, MPI_Type_indexed,
MPI_Type_hvector, MPI_Type_hindexed
2. Commit the new datatype
MPI_Type_commit
3. Use the new datatype in sends/receives, etc.
Use
#include<mpi.h>
void main(int argc, char *argv[]) {
int rank;
MPI_status status;
struct{ int x; int y; int z; }point;
MPI_Datatype ptype;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Type_contiguous(3,MPI_INT,&ptype);
MPI_Type_commit(&ptype);
if(rank==3){
point.x=15; point.y=23; point.z=6;
MPI_Send(&point,1,ptype,1,52,MPI_COMM_WORLD);
}
else
if(rank==1) {
MPI_Recv(&point,1,ptype,3,52,MPI_COMM_WORLD,&status);
printf("P:%d received coords are (%d,%d,%d)
\n",rank,point.x,point.y,point.z);
}
MPI_Finalize();
}
User Defined Types
• MPI_TYPE_STRUCT
• MPI_TYPE_CONTIGUOUS
• MPI_TYPE_VECTOR
• MPI_TYPE_HVECTOR
• MPI_TYPE_INDEXED
• MPI_TYPE_HINDEXED
MPI_TYPE_STRUCT
is the most general way to construct an MPI
derived type because it allows the length,
location, and type of each component to be
specified independently.
int MPI_Type_struct (int count, int
*array_of_blocklengths, MPI_Aint
*array_of_displacements, MPI_Datatype
*array_of_types, MPI_Datatype *newtype)
Struct Datatype Example
count = 2
array_of_blocklengths[0] = 1
array_of_types[0] = MPI_INT
array_of_blocklengths[1] = 3
array_of_types[1] = MPI_DOUBLE
MPI_TYPE_CONTIGUOUS
is the simplest of these, describing a contiguous
sequence of values in memory.
For example,
MPI_Type_contiguous(2,MPI_DOUBLE,&MPI_2D_P
OINT);
MPI_Type_contiguous(3,MPI_DOUBLE,&MPI_3D_P
OINT);
int MPI_Type_contiguous(int count,
MPI_Datatype oldtype, MPI_Datatype
*newtype)
MPI_TYPE_CONTIGUOUS
creates new type indicators MPI_2D_POINT and
MPI_3D_POINT. These type indicators allow you to
treat consecutive pairs of doubles as point
coordinates in a 2-dimensional space and
sequences of three doubles as point coordinates in
a 3-dimensional space.
MPI_TYPE_VECTOR
describes several such sequences evenly spaced
but not consecutive in memory.
MPI_TYPE_HVECTOR is similar to
MPI_TYPE_VECTOR except that the distance
between successive blocks is specified in bytes
rather than elements.
MPI_TYPE_INDEXED describes sequences that
may vary both in length and in spacing.
MPI_TYPE_VECTOR
int MPI_Type_vector(int count, int
blocklength, int stride, MPI_Datatype
oldtype, MPI_Datatype *newtype)
count = 2,
blocklength = 3, stride = 5
:‫תכנית לדוגמא‬
#include<mpi.h>
void main(int argc, char *argv[]) {
int rank,i,j;
MPI_status status;
double x[4][8];
MPI_Datatype coltype;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Type_vector(4,1,8,MPI_DOUBLE,&coltype);
MPI_Type_commit(&coltype);
if(rank==3){
for(i=0;i<4;++i)
for(j=0;j<8;++j)
x[i][j]=pow(10.0,i+1)+j;
MPI_Send(&x[0][7],1,coltype,1,52,MPI_COMM_WORLD);
}
else if(rank==1) {
MPI_Recv(&x[0][2],1,coltype,3,52,MPI_COMM_WORLD,
&status);
for(i=0;i<4;++i)
printf("P:%d my x[%d][2]=%1f\n",rank,i,x[i][2]);
}
MPI_Finalize();
}
:‫הפלט‬
P:1 my x[0][2]=17.000000
P:1 my x[1][2]=107.000000
P:1 my x[2][2]=1007.000000
P:1 my x[3][2]=10007.000000
Committing a datatype
int MPI_Type_commit (MPI_Datatype *datatype)
Obtaining Information About
Derived Types
•MPI_TYPE_LB and MPI_TYPE_UB can provide
the lower and upper bounds of the type.
•MPI_TYPE_EXTENT can provide the extent of
the type. In most cases, this is the amount of
memory a value of the type will occupy.
•MPI_TYPE_SIZE can provide the size of the
type in a message. If the type is scattered in
memory, this may be significantly smaller
than the extent of the type.
MPI_TYPE_EXTENT
MPI_Type_extent (MPI_Datatype datatype, MPI_Aint
*extent)
Correction:
Deprecated. Use MPI_Type_get_extent instead!
Ref: Ian Foster’s
book: “DBPP”
MPI-2
MPI-2 is a set of extensions to the MPI standard.
It was finalized by the MPI Forum in June, 1997.
MPI-2
•
•
•
•
•
•
•
New Datatype Manipulation Functions
Info Object
New Error Handlers
Establishing/Releasing Communications
Extended Collective Operations
Thread Support
Fault Tolerant
MPI-2 Parallel I/O
• Motivation:
– The ability to parallelize I/O can offer
significant performance
improvements.
– User-level checkpointing is contained
within the program itself.
Parallel I/O
• MPI-2 supports both blocking and
nonblocking I/O
• MPI-2 supports both collective and
non-collective I/O
Complementary Filetypes
Simple File Scatter/Gather Problem
MPI-2 Parallel I/O
‫• נושאים הקשורים בנושא שלא ילמדו במסגרת הקורס‬
:‫הנוכחי‬
• MPI-2 file structure
• Initializing MPI-2 File I/O
• Defining a View
• Data Access - Reading Data
• Data Access - Writing Data
• Closing MPI-2 file I/O
How to Build a Beowulf
What is a Beowulf?
• A new strategy in High-Performance
Computing (HPC) that exploits massmarket technology to overcome the
oppressive costs in time and money of
supercomputing.
What is a Beowulf?
A Collection of personal computers
interconnected by widely available
networking technology running one of
several open-source Unix-like operating
systems.
• COTS – Commodity-off-the-shelf
components
• Interconnection networks: LAN/SAN
Price/Performance
How to Run Application Faster
There
are 3 ways to improve performance:
–1. Work Harder
–2. Work Smarter
–3. Get Help
Computer
Analogy
–1. Use faster hardware: e.g. reduce the time per
instruction (clock cycle).
–2. Optimized algorithms and techniques
–3. Multiple computers to solve problem: That is,
increase no. of instructions executed per clock cycle.
Motivation for using Clusters
• The communications bandwidth
between workstations is increasing as
new networking technologies and
protocols are implemented in LANs and
WANs.
• Workstation clusters are easier to
integrate into existing networks than
special parallel computers.
Beowulf-class Systems
A New Paradigm for the Business of Computing
• Brings high end computing to broad ranged problems
– new markets
• Order of magnitude Price-Performance advantage
• Commodity enabled
– no long development lead times
• Low vulnerability to vendor-specific decisions
– companies are ephemeral; Beowulfs are forever
• Rapid response technology tracking
• Just-in-place user-driven configuration
– requirement responsive
• Industry-wide, non-proprietary software environment
Beowulf Project - A Brief History
• Started in late 1993
• NASA Goddard Space Flight Center
– NASA JPL, Caltech, academic and industrial collaborators
• Sponsored by NASA HPCC Program
• Applications: single user science station
– data intensive
– low cost
• General focus:
– single user (dedicated) science and engineering applications
– system scalability
– Ethernet drivers for Linux
Beowulf System at JPL (Hyglac)
• 16 Pentium Pro PCs, each with 2.5 Gbyte disk, 128 Mbyte
memory, Fast Ethernet card.
• Connected using 100Base-T network,
through a 16-way crossbar switch.


Theoretical peak
performance: 3.2 GFlop/s.
Achieved sustained
performance: 1.26 GFlop/s.
Cluster Computing - Research Projects
(partial list)
•
•
•
•
•
•
•
•
•
•
Beowulf (CalTech and NASA) - USA
Condor - Wisconsin State University, USA
HPVM -(High Performance Virtual Machine),UIUC&now UCSB,US
MOSIX - Hebrew University of Jerusalem, Israel
MPI (MPI Forum, MPICH is one of the popular implementations)
NOW (Network of Workstations) - Berkeley, USA
NIMROD - Monash University, Australia
NetSolve - University of Tennessee, USA
PBS (Portable Batch System) - NASA Ames and LLNL, USA
PVM - Oak Ridge National Lab./UTK/Emory, USA
Motivation for using Clusters
• Surveys show utilisation of CPU cycles of
desktop workstations is typically <10%.
• Performance of workstations and PCs is
rapidly improving
• As performance grows, percent utilisation
will decrease even further!
• Organisations are reluctant to buy large
supercomputers, due to the large expense and
short useful life span.
Motivation for using Clusters
• The development tools for workstations are more
mature than the contrasting proprietary solutions
for parallel computers - mainly due to the nonstandard nature of many parallel systems.
• Workstation clusters are a cheap and readily
available alternative to specialised High
Performance Computing (HPC) platforms.
• Use of clusters of workstations as a distributed
compute resource is very cost effective incremental growth of system!!!
Original Food Chain Picture
1984 Computer Food Chain
Mainframe
Mini Computer
Vector Supercomputer
Workstation
PC
1994 Computer Food Chain
(hitting wall soon)
Mini Computer
Workstation
(future is bleak)
Mainframe
Vector Supercomputer
MPP
PC
Computer Food Chain (Now and Future)
Parallel Computing
Cluster Computing
MetaComputing
Pile of PCs
NOW/COW
Beowulf
NT-PC Cluster
Tightly
Coupled
Vector
WS Farms/cycle harvesting
DASHMEM-NUMA
PC Clusters:
small, medium, large…
Computing Elements
Applications
Threads Interface
Operating System
Micro kernel
Multi-Processor Computing System
P
P
P
P Processor
P
Thread
P

P
Process
Hardware
Networking
• Topology
• Hardware
• Cost
• Performance
Cluster Building Blocks
Channel Bonding
Myrinet
Myrinet 2000 switch
Myrinet 2000 NIC
Example: 320-host Clos topology
of 16-port switches
64 hosts
64 hosts
64 hosts
64 hosts
64 hosts
(From Myricom)
Myrinet
•Full-duplex 2+2 Gigabit/second data rate links, switch ports, and
interface ports.
•Flow control, error control, and "heartbeat" continuity monitoring
on every link.
•Low-latency, cut-through, crossbar switches, with monitoring for
high-availability applications.
•Switch networks that can scale to tens of thousands of hosts, and
that can also provide alternative communication paths between hosts.
•Host interfaces that execute a control program to interact directly
with host processes ("OS bypass") for low-latency communication,
and directly with the network to send, receive, and buffer packets.
Myrinet
• Sustained one-way data rate for large
messages: 1.92mbps
• Latency for short messages: 9msec.
Gigabit Ethernet
Cajun 550
Cajun P882
Switches by 3COM and Avaya
Cajun M770
Network Topology
Network Topology
Network Topology
Topology of the Velocity+
Cluster at CTC
Software: all this list for free!
•
•
•
•
•
•
•
•
Compilers: FORTRAN, C/C++
Java: JDK from Sun, IBM and others
Scripting: Perl, Python, awk…
Editors: vi, (x)emacs, kedit, gedit…
Scientific writing: LaTex, Ghostview…
Plotting: gnuplot
Image processing: xview,
…and much more!!!
‫בניית מערך מקבילי‬
‫• ‪ 32‬מעבדים ‪top of the line‬‬
‫• רשת תקשורת מהירה‬
Hardware
Dual P4 2HGz
‫כמה זה עולה לנו?‬
‫מחשב פנטיום‪ 4-‬דואלי עם ‪ 2GB‬זיכרון מהיר‬
‫‪$3,000 :RDRAM‬‬
‫‪1GB memory/CPU‬‬
‫• מערכת הפעלה‪)Linux( $0 :‬‬
?‫כמה זה עולה לנו‬
• PCI64B @ 133MHz, Myrinet2000 NIC
with 2M memory:
$1,195
• Myrinet-2000 fiber cables, 3m long: $110
• 16-port switch with Fiber ports:
$5,625
?‫כמה זה עולה לנו‬
• KVM: 16port. ~$1,000
• Avocent (Cybex) using cat5 IP over
Ethernet
‫כמה זה עולה לנו?‬
‫•‬
‫•‬
‫•‬
‫•‬
‫•‬
‫•‬
‫‪$3000*16=$48,000‬‬
‫מחשב‪:‬‬
‫כרטיס רשת‪(1,195+110)*16=$20,880 :‬‬
‫מתג תקשורת‪$5,625:‬‬
‫‪$1,000‬‬
‫‪:KVM‬‬
‫מסך ‪ +‬שונות‪$500:‬‬
‫‪$76,005‬‬
‫סה"כ‪:‬‬
:‫• כוח חישוב תיאורטי שיאי‬
• 2*32=64GFLOPS
• $76,000/64=1,187$/GFLOP
Less than 1.2$/MFLOP!!!
‫מה עוד נדרש?‬
‫•‬
‫•‬
‫•‬
‫•‬
‫•‬
‫מקום!‪ ,‬מיזוג אויר (קירור)‪ ,‬מערכת חשמל לגיבוי (אל‪-‬‬
‫פסק)‪.‬‬
‫נוח שאחת התחנות תשמש כשרת קבצים ‪(NFS or‬‬
‫)‪other files sharing system‬‬
‫ניהול המשתמשים )‪ (users‬בכלי כגון ‪.NIS‬‬
‫קישור לרשת חיצונית‪ :‬אחת התחנות עושה ‪routing‬‬
‫ממרחב כתובות ‪ IP‬פנימי לחיצוני‪.‬‬
‫כלי ‪ Monitoring‬כדוגמת ‪.bWatch‬‬
‫התקנת המערכת‬
‫• תחילה ניתן להתקין מחשב יחיד‬
‫• את יתר המחשבים ניתן להתקין על‪-‬ידי שיכפול‬
‫הדיסק הקשיח של המחשב הראשון (לדוגמא‬
‫ע"י תכנה כגון ‪.)Ghost‬‬
XXX ‫התקנת תוכנה‬
)MPI ‫(למשל‬
•
•
•
•
•
Download xxx.tar.gz
Uncompress: gzip –d xxx.tar.gz
Untar: tar xvf xxx.tar
Prepare makefile: ./configure
Make (Makefile)
…‫תכנות מיקבול צריכות‬
• “rlogin” must be allowed (xinitd:
disable=no)
• Create “.rhosts” file
• Parallel administration tools: “brsh”, “prsh”
and self-made scripts.
References
• Beowulf: http://www.beowulf.org
• Computer Architecture:
http://www.cs.wisc.edu/~arch/www/
‫בשבוע הבא‬
‫•‬
‫•‬
‫•‬
‫•‬
‫נושאים נוספים ב‪MPI-‬‬
‫‪Grid Computing‬‬
‫חישובים מקביליים בבעיות מדעיות‬
‫סיכום‬
‫נא להתחיל לעבוד על הפרויקטים!‬
‫המצגות מתחילות בעוד שבועיים!‬
Download