Concurrent computing, lecture 2 Some performance issues Previously we said that it is important that the individual nodes must be fast (i.e., powerful). What is the potential for increased computational speed with cluster computing as opposed to a single computer? Observe: - a parallel program is composed of some number of processes, say N. - in a message passing computer, some time will have to be spent in processes communicating with each other tcomp = computation time tcomm = communication time fine granularity - each process executes a few instructions before communicating course granularity - each process executes many instructions before communicating The computation/communication ratio = tcomp/tcomm is one measure of granularity. One measure of the relative performance between a multiprocessor system and a single processor system is the speedup factor, S(n) S(n) = execution time on one processor/execution time on n processors t1/tn For example, suppose my application runs in one hour on my PC. Suppose that it runs in 10 minutes on my 8-node cluster. Then I would say that the speedup is S(8) = 60min/10min = 6. Calculation can be done - experimentally (using actual measurement) - theoretically, by counting computation steps. Use the best sequential algorithm/program for the single processor case. linear speedup: S(n) = n superlinear speedup: S(n) > n (can happen when the application fits into cache on more processors) Usually, S(n) < n because: 1) there are some periods of the problem that cannot be parallelized and when one or only a few processors are busy (e.g., setup time, input time) 2) extra computation may be needed in the parallel version, for example to calculate local values 3) communication takes time that is not spent processing Example parallel program: master/slave time ---- master reads setup data ---- ---- output results ---- master gets results ---- bcast to slaves ---------------------------- / --slaves compute---/ (send results to master) ^^^^^^^^^ ^^^^^^^^^^^^^^^^ serial part ts = time to do serial part tp = time to do parallel part if it were done by one process Amdahl's Law helps us calculate the effect of the serial part Let f = ts/(ts+tp) = fraction that is serial S(n) = (ts + tp) / (ts + tp/ n) /* Amdahl's Law */ = (f + (1-f) ) / ( f + (1-f)/n ) Let n go to infinity 1/f ==> high performance computing is only possible on parallel computers where the individual nodes have some level of "high performance" !! For example, if 5% of the application is serial, then max S(n) = 20. On n=10 processors, S(10) = 1/(.05+.95/10) = 6.9 Efficiency, E = execution time with one processor / (execution time with n * number of processors) E = t1/(tn*n) = S(n)/n*100% In my example with a 1 hour application, E = 6/8 = 75% (not that great) A system/algorithm is scalable if, when you increase the problem size, you can increase the system size and still compute with "good" efficiency. Gustafson observed that the serial part may not change in size as the problem gets bigger. Rewrite S(n) = (f + n*(1-f)) /(f+(1-f)) = f + n*(1-f) If 5% of the program is not parallelizable when run on a single computer, then S(n) = .05 + n*(.95), for 10 processors S(10) = 9.55 (a better prediction than Amdahl) Try this on your own: Suppose that I get a speedup of 8 when I run my application on 10 processors. According to Amdahl's Law, what portion is serial? What is the speedup on 20 processors? What is the best speedup that I could hope for? Suppose that 4% of my application is serial. What is my predicted speedup according to Gustafson's Law on 5 processors? How to Build A Beowulf Cluster - Some commodity cluster nodes An OS A network A mechanism for passing messages between application processes Dedicated cluster with a master node Dedi cated Clu ster User Comp ute node s Master no de Up l ink Switch 2nd Ethernet int erface Exter na l ne twork What is in a commodity cluster node? Basically, it is a computer! - Processor - Cache - Memory - Disk controller - Disks - Motherboard - Bus - Network Interface Controller (NIC) – Gigabit Ethernet, Fast Ethernet - Power Supply You can build it from parts, or you can get it all in one package from Dell, Gateway, … Software: Linux or Windows? OS and software building blocks Linux - a free, open source OS, developed as a project by Linus Torvalds, has progress to a robust, multitasking operating system We use the distribution by RedHat on our machines, but many exist. A Linux kernel is really Linux, A distribution includes the kernel plus windowing, tools for configuration, Review of some basic OS concepts: - process state diagram - processes run, wait to be scheduled, block for I/O - a scheduling discipline is used to schedule the next process Scheduler/ Dispatch Ready Running Time out/ Context switch I/O completion I/O request Wait Process state diagram processes can be created, suspended, killed Try these in Linux: ps ps aux vmstat top pstree Unix/Linux fork() is used to create processes /* Example of use of fork system call */ #include <stdio.h> main() { int pid; pid = fork(); if (pid < 0) { /* error occurred */ fprintf(stderr, "Fork failed!\n"); exit(-1); } else if (pid==0) { /* child process */ printf("I am the child, pid=%d\n", pid); execlp("/bin/ps", "ps", NULL); } else { /* parent process */ printf("I am the parent, pid=%d\n", pid); printf("Child complete!\n"); exit(0); } } Quick network tutorial, http://csce.uark.edu/~aapon/courses/concurrent/notes/shortnetworktutorial.ppt Application TCP IP Data link layer (Ethernet CSMA/CD, …) Physical layer (Manchester encoding, …) App: Uses MPI to solve my parallel programming problem. MPI: Message Passing Interface - a messaging protocol, sometimes built over TCP/IP, that understands things like integers, endian conversion, and has special routines for scientific programming. TCP: Transmission Control Protocol - reliable, in-order, connected point-to-point delivery of a data stream across a network A typical client: open connection to an IP address send/receive until done close connection Doesn't recognize data types, byte ordering of numeric values, end of messages IP: Internet Protocol - unreliable, connectionless (datagram) delivery of packets Every machine has an IP address, most have an IP name. A Domain Name Server (DNS) can convert from one to the other IP is a protocol to send IP packets from one IP address to another. IP has routing protocol to be sure the packets get to the right place. data link layer: error free delivery of a frame on the same network. In Fast Ethernet, the maximum frame size is about 1500 bytes. The transmission rate is 100Mbps. Special networks built for clusters may transmit at over 10Gbps. physical layer: delivery of raw bits over a communication channel