מבוא לעיבוד מקבילי הרצאה מס' 10 24/12/2001 תרגיל בית מס' 3 • ניתן להגיש עד ליום ה' ה27/12/2001 - פרוייקטי גמר • קבוצות 1-10מתבקשות להכין את המצגות שלהן לשיעור בעוד שבועיים. • נא להעביר את קבצי המצגות בפורמט Point Powerלפני ההרצאה או לבוא לשיעור עם CDROMצרוב. הבחנים • בדיקת הבחנים תסתיים עד ליום ו'. • התוצאות יפורסמו בשיעור הבא. נושאי ההרצאה • Today’s topics: – – – – Shared Memory Cilk, OpenMP MPI – Derived Data Types How to Build a Beowulf Shared Memory • Goto PDF presentation: Chapter 8 from Wilkinson & Allan’s book. “Programming with Shared Memory” Summary • • • • • • Process creation The thread concept Pthread routines How data can be created as shared Condition Variables Dependency analysis: Bernstein’s conditions Cilk http://supertech.lcs.mit.edu/cilk Cilk • A language for multithreaded parallel programming based on ANSI C. • Cilk is designed for general-purpose parallel programming language • Cilk is especially effective for exploiting dynamic, highly asynchronous parallelism. A serial C program to compute the nth Fibonacci number. A parallel Cilk program to compute the nth Fibonacci number. Cilk - continue • Compiling: $ cilk -O2 fib.cilk -o fib • Executing: $ fib --nproc 4 30 OpenMP Next 5 slides taken from the SC99 tutorial Given by: Tim Mattson, Intel Corporation and Rudolf Eigenmann, Purdue University לקריאה נוספת High-Performance Computing Part III Shared Memory Parallel Processors Back to MPI Collective Communication Broadcast Collective Communication Reduce Collective Communication Gather Collective Communication Allgather Collective Communication Scatter Collective Communication There are more collective communication commands… MPI -נושאים מתקדמים ב • MPI – Derived Data Types • MPI-2 – Parallel I/O User Defined Types • מלבד ה types -המוגדרים מראש ,יכול המשתמש ליצור טיפוסים חדשים • Compact pack/unpack. Predefined Types MPI_DOUBLE double MPI_FLOAT float MPI_INT signed int MPI_LONG signed long int MPI_LONG_DOUBLE long double MPI_LONG_LONG_INT signed long long int MPI_SHORT signed short int MPI_UNSIGNED unsigned int MPI_UNSIGNED_CHAR unsigned char MPI_UNSIGNED_LONG unsigned long int MPI_UNSIGNED_SHORT unsigned short int MPI_BYTE Motivation •What if you want to specify: •non-contiguous data of a single type? •contiguous data of mixed types? •non-contiguous data of mixed types? Derived datatypes save memory, are faster, more portable, and elegant. 3 Steps 1. Construct the new datatype using appropriate MPI routines: MPI_Type_contiguous, MPI_Type_vector, MPI_Type_struct, MPI_Type_indexed, MPI_Type_hvector, MPI_Type_hindexed 2. Commit the new datatype MPI_Type_commit 3. Use the new datatype in sends/receives, etc. Use #include<mpi.h> void main(int argc, char *argv[]) { int rank; MPI_status status; struct{ int x; int y; int z; }point; MPI_Datatype ptype; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Type_contiguous(3,MPI_INT,&ptype); MPI_Type_commit(&ptype); if(rank==3){ point.x=15; point.y=23; point.z=6; MPI_Send(&point,1,ptype,1,52,MPI_COMM_WORLD); } else if(rank==1) { MPI_Recv(&point,1,ptype,3,52,MPI_COMM_WORLD,&status); printf("P:%d received coords are (%d,%d,%d) \n",rank,point.x,point.y,point.z); } MPI_Finalize(); } User Defined Types • MPI_TYPE_STRUCT • MPI_TYPE_CONTIGUOUS • MPI_TYPE_VECTOR • MPI_TYPE_HVECTOR • MPI_TYPE_INDEXED • MPI_TYPE_HINDEXED MPI_TYPE_STRUCT is the most general way to construct an MPI derived type because it allows the length, location, and type of each component to be specified independently. int MPI_Type_struct (int count, int *array_of_blocklengths, MPI_Aint *array_of_displacements, MPI_Datatype *array_of_types, MPI_Datatype *newtype) Struct Datatype Example count = 2 array_of_blocklengths[0] = 1 array_of_types[0] = MPI_INT array_of_blocklengths[1] = 3 array_of_types[1] = MPI_DOUBLE MPI_TYPE_CONTIGUOUS is the simplest of these, describing a contiguous sequence of values in memory. For example, MPI_Type_contiguous(2,MPI_DOUBLE,&MPI_2D_P OINT); MPI_Type_contiguous(3,MPI_DOUBLE,&MPI_3D_P OINT); int MPI_Type_contiguous(int count, MPI_Datatype oldtype, MPI_Datatype *newtype) MPI_TYPE_CONTIGUOUS creates new type indicators MPI_2D_POINT and MPI_3D_POINT. These type indicators allow you to treat consecutive pairs of doubles as point coordinates in a 2-dimensional space and sequences of three doubles as point coordinates in a 3-dimensional space. MPI_TYPE_VECTOR describes several such sequences evenly spaced but not consecutive in memory. MPI_TYPE_HVECTOR is similar to MPI_TYPE_VECTOR except that the distance between successive blocks is specified in bytes rather than elements. MPI_TYPE_INDEXED describes sequences that may vary both in length and in spacing. MPI_TYPE_VECTOR int MPI_Type_vector(int count, int blocklength, int stride, MPI_Datatype oldtype, MPI_Datatype *newtype) count = 2, blocklength = 3, stride = 5 :תכנית לדוגמא #include<mpi.h> void main(int argc, char *argv[]) { int rank,i,j; MPI_status status; double x[4][8]; MPI_Datatype coltype; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Type_vector(4,1,8,MPI_DOUBLE,&coltype); MPI_Type_commit(&coltype); if(rank==3){ for(i=0;i<4;++i) for(j=0;j<8;++j) x[i][j]=pow(10.0,i+1)+j; MPI_Send(&x[0][7],1,coltype,1,52,MPI_COMM_WORLD); } else if(rank==1) { MPI_Recv(&x[0][2],1,coltype,3,52,MPI_COMM_WORLD, &status); for(i=0;i<4;++i) printf("P:%d my x[%d][2]=%1f\n",rank,i,x[i][2]); } MPI_Finalize(); } :הפלט P:1 my x[0][2]=17.000000 P:1 my x[1][2]=107.000000 P:1 my x[2][2]=1007.000000 P:1 my x[3][2]=10007.000000 Committing a datatype int MPI_Type_commit (MPI_Datatype *datatype) Obtaining Information About Derived Types •MPI_TYPE_LB and MPI_TYPE_UB can provide the lower and upper bounds of the type. •MPI_TYPE_EXTENT can provide the extent of the type. In most cases, this is the amount of memory a value of the type will occupy. •MPI_TYPE_SIZE can provide the size of the type in a message. If the type is scattered in memory, this may be significantly smaller than the extent of the type. MPI_TYPE_EXTENT MPI_Type_extent (MPI_Datatype datatype, MPI_Aint *extent) Correction: Deprecated. Use MPI_Type_get_extent instead! Ref: Ian Foster’s book: “DBPP” MPI-2 MPI-2 is a set of extensions to the MPI standard. It was finalized by the MPI Forum in June, 1997. MPI-2 • • • • • • • New Datatype Manipulation Functions Info Object New Error Handlers Establishing/Releasing Communications Extended Collective Operations Thread Support Fault Tolerant MPI-2 Parallel I/O • Motivation: – The ability to parallelize I/O can offer significant performance improvements. – User-level checkpointing is contained within the program itself. Parallel I/O • MPI-2 supports both blocking and nonblocking I/O • MPI-2 supports both collective and non-collective I/O Complementary Filetypes Simple File Scatter/Gather Problem MPI-2 Parallel I/O • נושאים הקשורים בנושא שלא ילמדו במסגרת הקורס :הנוכחי • MPI-2 file structure • Initializing MPI-2 File I/O • Defining a View • Data Access - Reading Data • Data Access - Writing Data • Closing MPI-2 file I/O How to Build a Beowulf What is a Beowulf? • A new strategy in High-Performance Computing (HPC) that exploits massmarket technology to overcome the oppressive costs in time and money of supercomputing. What is a Beowulf? A Collection of personal computers interconnected by widely available networking technology running one of several open-source Unix-like operating systems. • COTS – Commodity-off-the-shelf components • Interconnection networks: LAN/SAN Price/Performance How to Run Application Faster There are 3 ways to improve performance: –1. Work Harder –2. Work Smarter –3. Get Help Computer Analogy –1. Use faster hardware: e.g. reduce the time per instruction (clock cycle). –2. Optimized algorithms and techniques –3. Multiple computers to solve problem: That is, increase no. of instructions executed per clock cycle. Motivation for using Clusters • The communications bandwidth between workstations is increasing as new networking technologies and protocols are implemented in LANs and WANs. • Workstation clusters are easier to integrate into existing networks than special parallel computers. Beowulf-class Systems A New Paradigm for the Business of Computing • Brings high end computing to broad ranged problems – new markets • Order of magnitude Price-Performance advantage • Commodity enabled – no long development lead times • Low vulnerability to vendor-specific decisions – companies are ephemeral; Beowulfs are forever • Rapid response technology tracking • Just-in-place user-driven configuration – requirement responsive • Industry-wide, non-proprietary software environment Beowulf Project - A Brief History • Started in late 1993 • NASA Goddard Space Flight Center – NASA JPL, Caltech, academic and industrial collaborators • Sponsored by NASA HPCC Program • Applications: single user science station – data intensive – low cost • General focus: – single user (dedicated) science and engineering applications – system scalability – Ethernet drivers for Linux Beowulf System at JPL (Hyglac) • 16 Pentium Pro PCs, each with 2.5 Gbyte disk, 128 Mbyte memory, Fast Ethernet card. • Connected using 100Base-T network, through a 16-way crossbar switch. Theoretical peak performance: 3.2 GFlop/s. Achieved sustained performance: 1.26 GFlop/s. Cluster Computing - Research Projects (partial list) • • • • • • • • • • Beowulf (CalTech and NASA) - USA Condor - Wisconsin State University, USA HPVM -(High Performance Virtual Machine),UIUC&now UCSB,US MOSIX - Hebrew University of Jerusalem, Israel MPI (MPI Forum, MPICH is one of the popular implementations) NOW (Network of Workstations) - Berkeley, USA NIMROD - Monash University, Australia NetSolve - University of Tennessee, USA PBS (Portable Batch System) - NASA Ames and LLNL, USA PVM - Oak Ridge National Lab./UTK/Emory, USA Motivation for using Clusters • Surveys show utilisation of CPU cycles of desktop workstations is typically <10%. • Performance of workstations and PCs is rapidly improving • As performance grows, percent utilisation will decrease even further! • Organisations are reluctant to buy large supercomputers, due to the large expense and short useful life span. Motivation for using Clusters • The development tools for workstations are more mature than the contrasting proprietary solutions for parallel computers - mainly due to the nonstandard nature of many parallel systems. • Workstation clusters are a cheap and readily available alternative to specialised High Performance Computing (HPC) platforms. • Use of clusters of workstations as a distributed compute resource is very cost effective incremental growth of system!!! Original Food Chain Picture 1984 Computer Food Chain Mainframe Mini Computer Vector Supercomputer Workstation PC 1994 Computer Food Chain (hitting wall soon) Mini Computer Workstation (future is bleak) Mainframe Vector Supercomputer MPP PC Computer Food Chain (Now and Future) Parallel Computing Cluster Computing MetaComputing Pile of PCs NOW/COW Beowulf NT-PC Cluster Tightly Coupled Vector WS Farms/cycle harvesting DASHMEM-NUMA PC Clusters: small, medium, large… Computing Elements Applications Threads Interface Operating System Micro kernel Multi-Processor Computing System P P P P Processor P Thread P P Process Hardware Networking • Topology • Hardware • Cost • Performance Cluster Building Blocks Channel Bonding Myrinet Myrinet 2000 switch Myrinet 2000 NIC Example: 320-host Clos topology of 16-port switches 64 hosts 64 hosts 64 hosts 64 hosts 64 hosts (From Myricom) Myrinet •Full-duplex 2+2 Gigabit/second data rate links, switch ports, and interface ports. •Flow control, error control, and "heartbeat" continuity monitoring on every link. •Low-latency, cut-through, crossbar switches, with monitoring for high-availability applications. •Switch networks that can scale to tens of thousands of hosts, and that can also provide alternative communication paths between hosts. •Host interfaces that execute a control program to interact directly with host processes ("OS bypass") for low-latency communication, and directly with the network to send, receive, and buffer packets. Myrinet • Sustained one-way data rate for large messages: 1.92mbps • Latency for short messages: 9msec. Gigabit Ethernet Cajun 550 Cajun P882 Switches by 3COM and Avaya Cajun M770 Network Topology Network Topology Network Topology Topology of the Velocity+ Cluster at CTC Software: all this list for free! • • • • • • • • Compilers: FORTRAN, C/C++ Java: JDK from Sun, IBM and others Scripting: Perl, Python, awk… Editors: vi, (x)emacs, kedit, gedit… Scientific writing: LaTex, Ghostview… Plotting: gnuplot Image processing: xview, …and much more!!! בניית מערך מקבילי • 32מעבדים top of the line • רשת תקשורת מהירה Hardware Dual P4 2HGz כמה זה עולה לנו? מחשב פנטיום 4-דואלי עם 2GBזיכרון מהיר $3,000 :RDRAM 1GB memory/CPU • מערכת הפעלה)Linux( $0 : ?כמה זה עולה לנו • PCI64B @ 133MHz, Myrinet2000 NIC with 2M memory: $1,195 • Myrinet-2000 fiber cables, 3m long: $110 • 16-port switch with Fiber ports: $5,625 ?כמה זה עולה לנו • KVM: 16port. ~$1,000 • Avocent (Cybex) using cat5 IP over Ethernet כמה זה עולה לנו? • • • • • • $3000*16=$48,000 מחשב: כרטיס רשת(1,195+110)*16=$20,880 : מתג תקשורת$5,625: $1,000 :KVM מסך +שונות$500: $76,005 סה"כ: :• כוח חישוב תיאורטי שיאי • 2*32=64GFLOPS • $76,000/64=1,187$/GFLOP Less than 1.2$/MFLOP!!! מה עוד נדרש? • • • • • מקום! ,מיזוג אויר (קירור) ,מערכת חשמל לגיבוי (אל- פסק). נוח שאחת התחנות תשמש כשרת קבצים (NFS or )other files sharing system ניהול המשתמשים ) (usersבכלי כגון .NIS קישור לרשת חיצונית :אחת התחנות עושה routing ממרחב כתובות IPפנימי לחיצוני. כלי Monitoringכדוגמת .bWatch התקנת המערכת • תחילה ניתן להתקין מחשב יחיד • את יתר המחשבים ניתן להתקין על-ידי שיכפול הדיסק הקשיח של המחשב הראשון (לדוגמא ע"י תכנה כגון .)Ghost XXX התקנת תוכנה )MPI (למשל • • • • • Download xxx.tar.gz Uncompress: gzip –d xxx.tar.gz Untar: tar xvf xxx.tar Prepare makefile: ./configure Make (Makefile) …תכנות מיקבול צריכות • “rlogin” must be allowed (xinitd: disable=no) • Create “.rhosts” file • Parallel administration tools: “brsh”, “prsh” and self-made scripts. References • Beowulf: http://www.beowulf.org • Computer Architecture: http://www.cs.wisc.edu/~arch/www/ בשבוע הבא • • • • נושאים נוספים בMPI- Grid Computing חישובים מקביליים בבעיות מדעיות סיכום נא להתחיל לעבוד על הפרויקטים! המצגות מתחילות בעוד שבועיים!