Administration Tools for Managing Large Scale Linux Cluster CRC KEK Japan S.Kawabata, A.Manabe atsushi.manabe@kek.jp Linux PC Clusters in KEK PC Cluster 2 PenIII 800MHz 80CPU (40 nodes) PC Cluster 1 PenIII Xeon 500MHz 144 CPUs (36 nodes) PC Cluster 3 (Belle) Pentium III Xeon 700MHz 320CPU (80 nodes) PC cluster 4 (Neutron simulation) Fujitsu TS225 50 nodes Pentium III 1GHz x 2CPU 512MB memory 31GB disk 100BaseTX x 2 1U rack-mount model RS232C x2 Remote BIOS setting Remote reset/power-off PC Cluster 5 (Belle) 1U server Pentium III 1.2GHz 256 CPU (128 nodes) 3U PC Cluster 6 Blade server: LP Pentium III 700MHz 40CPU (40 nodes) PC clusters Already more than 400 (>800CPUs) nodes Linux PC clusters were installed. Only >middle size PC cluster are counted. A major exp. (Belle) group plan to install several x100 nodes of blade server in this year. All PC clusters are managed by individual user group themselves. 2002/6/26 ACAT2002 8 Center Machine (KEK CRC) Currently machines in KEK Computer Center(CRC) are UNIX(solaris,AIX) servers. Plan to have >1000 nodes Linux computing cluster in near future (~2004). Will be installed under `~4years rental’ contract. (every 2 years HW update ?) 2002/6/26 ACAT2002 9 Center Machine The system will be share among many user groups. (don’t dedicate to one gr. only) Their demand for CPU power vary with months. (High demand before int’l-conference or so on) Of course, we use load-balancing Batch system. Big groups uses their own software frame work. Their jobs only run under some restricted version of OS(Linux) /middle-ware/configuration. 2002/6/26 ACAT2002 10 R&D system Frequent change of system configuration/ cpu partition. To manage such size of PC cluster and such user request, we need to have some sophisticated admin. tools. 2002/6/26 ACAT2002 11 Necessary admin. tools System (SW) Installation /update Configuration Status Monitoring/ System Health Check Command Execution 2002/6/26 ACAT2002 12 Installation tool 2002/6/26 ACAT2002 13 Installation tool Two types of `installation tool’ Disk Cloning Application Package Installer system(kernel) is an application in this term. 2002/6/26 ACAT2002 14 Installation tool (cloning) Image Cloning Install system/application on a `master host’. Copy disk partition image to nodes 2002/6/26 ACAT2002 15 Installation tool (package installer) request Package server Image and control Package Information DB 2002/6/26 Clients Package archive ACAT2002 16 Remote Installation via NW Cloning disk image SystemImager (VA) http://systemimager.sourceforge.net/ CATS-i (soongsil Univ.) CloneIt http://www.ferzkopp.net/Software/CloneIt/ Comercial: ImageCast, Ghost,….. Packages/Applications installation Kickstart + rpm (RedHat) LUI (IBM) http://oss.software.ibm.com/developerworks/projects/lui Lucie (TiTec) http://matsu-www.is.titech.ac.jp/~takamiya/lucie/ LCFGng, Arusha Public Domin Software 2002/6/26 ACAT2002 17 Dolly+ We developed ‘image cloning via NW’ installer `dolly+’. WHY ANOTHER? We install/update maybe frequently (according to user needs) 100~1000 nodes simultaneously. Making packages for our own softwares is boring. Traditional Server/Client type software suffer server bottleneck. Multicast copy with ~GB image seems unstable. (No free soft ? ) 2002/6/26 ACAT2002 18 (few) Server - (Many) Client model Server could be a daemon process. (you don‘t need to start it by hand) Performance is not scalable against # of nodes. S • Server bottle neck. Network congestion Multicasting or Broadcasting No server bottle neck. Get max performance of network which support multicasting in switch fabrics. Nodes failure does not affect to all the process very much, it could be robust. Since failed node need re-transfer. Speed is governed by the slowest node as in RING topology. Not TCP but UDP, so application must take care of transfer reliability. Dolly and Dolly+ Dolly A Linux application software to copy/clone files or/and disk images among many PCs through a network. Dolly is originally developed by CoPs project in ETH (Swiss) and an open software. Dolly+ features Sequential files (no limitation of over 2GB) and/or normal files (optinal:decompress and untar on the fly) transfer/copy via TCP/IP network. Virtual RING network connection topology to cope with server bottleneck problem. Pipeline and multi-threading mechanism for speed-up. Fail recovery mechanism for robust operation. Dolly: Virtual Ring Topology Master = host having original image node PC network hub switch physical connection Logical (virtual) connection • Physical network connection is as you like. • Logically ‘Dolly’ makes a node ring chain which is specified by dolly’s config file and send data node by node bucket relay. • Though transfer is only between its two adjacent nodes, it can utilize max. performance ability of switching network of full duplex ports. • Good for network complex of many switches. Cascade Topology Server bottle neck could be overcome. Cannot get maximum network performance but better than many client to only one serv. topology. Week against a node failure. Failure will spread in cascade way as well and difficult to recover. BOF 1 23 4 5 6 7 8 PIPELINING & EOF multi threading 9 ….. File chunk =4MB 6 9 8 Server 7 network 8 6 5 7 Node 1 network 3 thread in parallel 5 7 6 Node 2 Next node Performance of dolly+ Elapsed time for cloning vs number of nodes Less than 5min! for 100 nodes expected Elapsed time (min) 15 measured by TS225 4MB chunk size, ~10MB/s transfer speed 10 total 4GB disk image cloning 5 total 2GB disk image cloning 0 1 5 10 50 100 500 HW: FujitsuTS225 PenIII 1GHz x2, SCSI disk, 512MB mem, 100BaseT NW Number of hosts 2002/6/26 ACAT2002 24 Dolly+ transfer speed scalability with size of image 600 transfered bytes (MB) 1500 PC: Hardware spec. (server & nodes) 1GHz PentiumIII x 2 IDE-ATA/100 disk 100BASE-TX net 256MB memory 500 400 30040 50 60 70 1000 setup 500 0 0 elapsed time 1server-1nodes 1server-2nodes 1server-7nodes 1server-10nodes 100 elapsed time (sec) 200 230sec 252sec 266sec 260sec speed 8.2MB/s 7.4MB/s x2 7.0MB/s x7 7.2MB/s x10 Fail recovery mechanism • Only one node failure could be “show stopper” in RING (=series connection) topology. • Dolly+ provides automatic ‘short cut’ mechanism against a node trouble. • In a node trouble, the upper stream node detect it by sending time out. • The upper stream node negotiate with the lower stream node for reconnection and retransfer of a file chunk. • RING topology makes its 2002/6/26implementation easy. ACAT2002 S time out Short cutting 26 Re-transfer in short cutting BOF EOF 1 2 3 4 5 6 7 8 9 ….. File chunk =4MB 6 9 8 Server 7 network 8 6 5 7 Node 1 network 5 7 Works with even Sequential file. 6 Node 2 Next node Dolly+: How do you start it on linux Server side (which has the original file) % dollyS [-v] -f config_file Nodes side % dollyC [-v] Config file example iofiles 3 /dev/hda1 > /tmp/dev/hda1 /data/file.gz >> /data/file boot.tar.Z >> /boot server n000.kek.jp firstclient n001.kek.jp lastclient n020.kek.jp client 20 n001 n002 : n020 endconfig # of files to Xfer master name # of client nodes clients names end code The left of ‘>’ is input file in the server. The right is output file in clients. '>' means dolly+ does not modify the image. '>>' indicate dolly+ should cook (decompress , untar ..) the file according to2002/6/26 the name of the file. ACAT2002 28 How does dolly+ clone the system after booting. Nodes broadcast over the LAN in search of an installation server (Pre-eXecution Environment). PXE/DHCP server respond to nodes with information about the nodes IP and kernel download server. The kernel and `ram disk image’ are Multicast TFTP’ed to the nodes and the kernel gets start. The kernel hands off to an installation script which run a disk tool and ‘dolly+ ’. (scripts and appli. are in the ram disk image) 2002/6/26 ACAT2002 29 How does dolly+ start after rebooting. The code partitions the hard drive, creates file systems and start `dolly+’ client on the node. You start `dolly+’ master on the master host to start up a disk clone process. The code then configure unique node information such as Host name, IP addess from DHCP information. ready to boot from its hard drive for the first time. 2002/6/26 ACAT2002 30 PXE Trouble BY THE WAY we suffered sometimes PXE mtftp transfer failure in the case of >20 nodes booting simultaneously. If you have same trouble, mail me please. We start rewriting mtftp client code of RedHat Linux PXE server. 2002/6/26 ACAT2002 31 Configuration 2002/6/26 ACAT2002 32 (Sub)system Configuration Linux (Unix) has a lot of configuration file to configure sub-systems. If you have 1000nodes, you have to manage (many)x1000 config. files. To manage them, three types of solution Cetralized information service server (like NIS). Need support by sub-system (nsswitch) Automatic remote editing raw config. files (like cfengine). 2002/6/26 Must care about each node’s file separately. ACAT2002 33 Configuration --new proposal from CS. Program (configure) whole system with a source code by O.O way. Systematic & uniform way configuration. Source reuse (inheritance) as much as possible. Template override to other-site’s configuration. Arusha (http://ark.sourceforge.net) LCFGng (http://www.lcfg.org) 2002/6/26 ACAT2002 34 LCFGng (Univ. Edinburgh) Ack New Compile Notify Fetch new profile Configuration files & control commands exec. LCFGng Good things Author says that it works on ~1000 nodes. Fully automatic. (you just edit source code and compile it in a host.) Differences of sub-systems are hidden from user (administrator). (or move to `components (DB->actual config file)’) 2002/6/26 ACAT2002 36 LCFGng Configuration Language is too primitive. Hostname.Component.Parameter Value Components are not so many or you must write your own components scripts for each sub-system by yourself. far easier writing config. file itself than writing component. Activating timing of the config. change could not be controlled. 2002/6/26 ACAT2002 37 Status monitoring 2002/6/26 ACAT2002 38 Status Monitoring System state monitoring CPU/memory/disk/network utilization Ganglia*1,plantir*2 (Sub-)system service sanity check Pikt*3/Pica*4/cfengine *1 http://ganglia.sourceforge.net *2 http://www.netsonde.com *3 http://pikt.org *4 http://pica.sourceforge.net/wtf.html 2002/6/26 ACAT2002 39 Ganglia ( Univ. Calfornia) Gmond (each node) All node `multicast’ each system status info. each other and each node has current status of all nodes. -> good redundancy and robust declare that it works on ~1000 nodes Meta-deamon (Web server) stores volatile data of gmond in Round-robin DB and represent XML image of all nodes activity Web Interface 2002/6/26 ACAT2002 40 Plantir (Network adaption ) Quick understanding of system status from One Web Page. 2002/6/26 ACAT2002 42 Remote Execution 2002/6/26 ACAT2002 43 Remote execution Administrator sometimes need to issue a command to all (part of ) nodes urgently. Remote execution could be rsh/ssh/pikt/cfengine/SUT(mpich)* /gexec.. Points are To make it easy to know the execution result (fail or success) at a glance. Parallel execution among nodes. Otherwise If it takes 1sec. at each node, then 1000 sec for 1000 nodes. *) Scalable Unix tools for cluster http://www-unix.mcs.anl.gov/sut/ 2002/6/26 ACAT2002 44 WANI WEB base remote command executer. Easy to select nodes concerned. Easy to specify script or to type-in command lines to execute in nodes. Issue the commands to nodes in parallel. Collect result with error/failure detection. Currently, the software is in prototyping by combinations of existing protocol and tools. (Anyway it works!) 2002/6/26 ACAT2002 45 WANI is implemented on `Webmin’ GUI Start Command input Node selection 2002/6/26 ACAT2002 46 Switch to another page Command execution result Host name Results from 200nodes in 1 Page 2002/6/26 ACAT2002 47 Error detection Flame color represents; White: initial Yellow: command starts Black: finished 1 2 3 4 BG color Exit code “fail/error” word `grep –i`. *sys_errlist[] (perror) list check. `strings /bin/sh` output check 2002/6/26 ACAT2002 48 Stdout output Click here Click here Stderr output Node hosts execution WEB Browser Piktc_svc Result Pages Command PIKT server Webmin server Piktc print_filter Error marked Result lpr Lpd Result error detector Summary I reviewed admin. tools which can be used against ~1000 nodes Linux PC cluster. Installation: dolly+ Install/Update/Switch hosts >100 nodes very quickly. Configuration manager Not matured yet. But can expect a lot from DataGrid research. Status monitor seems several good software already exists. Extra daemons and network traffic. 2002/6/26 ACAT2002 51 Summary Remote Command Execution `Result at a glance’ is important for quick iteration. Parallel execution is important. Some programs and links is /will be http://corvus.kek.jp/~manabe Thank you for your listening. 2002/6/26 ACAT2002 52 Synchronizing time by rsync (dir = 4096, filesize ~20kB # of file=43680 total size=1.06GB) 80 elapsed time (sec) y=Σan x n a0=8.68524263e-01 a1=4.24465056e-01 2.04576224e+00 |r|=9.97385098e-01 60 40 20 total 1GB 43680files ~20kB/file total 2GB ~50kB/file 0 2002/6/26 100 200 aggregate of modified file size (MB) ACAT2002 53