PPT - ACAT'2002

Administration Tools for Managing Large Scale Linux Cluster CRC KEK Japan S.Kawabata, A.Manabe atsushi.manabe@kek.jp Linux PC Clusters in KEK PC Cluster 2 PenIII 800MHz 80CPU (40 nodes) PC Cluster 1 PenIII Xeon 500MHz 144 CPUs (36 nodes) PC Cluster 3 (Belle) Pentium III Xeon 700MHz 320CPU (80 nodes) PC cluster 4 (Neutron simulation) Fujitsu TS225 50 nodes Pentium III 1GHz x 2CPU 512MB memory 31GB disk 100BaseTX x 2 1U rack-mount model RS232C x2 Remote BIOS setting Remote reset/power-off PC Cluster 5 (Belle) 1U server Pentium III 1.2GHz 256 CPU (128 nodes) 3U PC Cluster 6 Blade server: LP Pentium III 700MHz 40CPU (40 nodes) PC clusters Already more than 400 (>800CPUs) nodes Linux PC clusters were installed. Only >middle size PC cluster are counted. A major exp. (Belle) group plan to install several x100 nodes of blade server in this year. All PC clusters are managed by individual user group themselves. 2002/6/26 ACAT2002 8 Center Machine (KEK CRC) Currently machines in KEK Computer Center(CRC) are UNIX(solaris,AIX) servers. Plan to have >1000 nodes Linux computing cluster in near future (~2004). Will be installed under `~4years rental’ contract. (every 2 years HW update ?) 2002/6/26 ACAT2002 9 Center Machine The system will be share among many user groups. (don’t dedicate to one gr. only) Their demand for CPU power vary with months. (High demand before int’l-conference or so on) Of course, we use load-balancing Batch system. Big groups uses their own software frame work. Their jobs only run under some restricted version of OS(Linux) /middle-ware/configuration. 2002/6/26 ACAT2002 10 R&D system Frequent change of system configuration/ cpu partition. To manage such size of PC cluster and such user request, we need to have some sophisticated admin. tools. 2002/6/26 ACAT2002 11 Necessary admin. tools System (SW) Installation /update Configuration Status Monitoring/ System Health Check Command Execution 2002/6/26 ACAT2002 12 Installation tool 2002/6/26 ACAT2002 13 Installation tool Two types of `installation tool’ Disk Cloning Application Package Installer system(kernel) is an application in this term. 2002/6/26 ACAT2002 14 Installation tool (cloning) Image Cloning Install system/application on a `master host’. Copy disk partition image to nodes 2002/6/26 ACAT2002 15 Installation tool (package installer) request Package server Image and control Package Information DB 2002/6/26 Clients Package archive ACAT2002 16 Remote Installation via NW Cloning disk image SystemImager (VA) http://systemimager.sourceforge.net/ CATS-i (soongsil Univ.) CloneIt http://www.ferzkopp.net/Software/CloneIt/ Comercial: ImageCast, Ghost,….. Packages/Applications installation Kickstart + rpm (RedHat) LUI (IBM) http://oss.software.ibm.com/developerworks/projects/lui Lucie (TiTec) http://matsu-www.is.titech.ac.jp/~takamiya/lucie/ LCFGng, Arusha Public Domin Software 2002/6/26 ACAT2002 17 Dolly+ We developed ‘image cloning via NW’ installer `dolly+’. WHY ANOTHER? We install/update maybe frequently (according to user needs) 100~1000 nodes simultaneously. Making packages for our own softwares is boring. Traditional Server/Client type software suffer server bottleneck. Multicast copy with ~GB image seems unstable. (No free soft ? ) 2002/6/26 ACAT2002 18 (few) Server - (Many) Client model  Server could be a daemon process. (you don‘t need to start it by hand)  Performance is not scalable against # of nodes. S • Server bottle neck. Network congestion Multicasting or Broadcasting  No server bottle neck.  Get max performance of network which support multicasting in switch fabrics.  Nodes failure does not affect to all the process very much, it could be robust.  Since failed node need re-transfer. Speed is governed by the slowest node as in RING topology.  Not TCP but UDP, so application must take care of transfer reliability. Dolly and Dolly+ Dolly  A Linux application software to copy/clone files or/and disk images among many PCs through a network.  Dolly is originally developed by CoPs project in ETH (Swiss) and an open software. Dolly+ features Sequential files (no limitation of over 2GB) and/or normal files (optinal:decompress and untar on the fly) transfer/copy via TCP/IP network. Virtual RING network connection topology to cope with server bottleneck problem. Pipeline and multi-threading mechanism for speed-up. Fail recovery mechanism for robust operation. Dolly: Virtual Ring Topology Master = host having original image node PC network hub switch physical connection Logical (virtual) connection • Physical network connection is as you like. • Logically ‘Dolly’ makes a node ring chain which is specified by dolly’s config file and send data node by node bucket relay. • Though transfer is only between its two adjacent nodes, it can utilize max. performance ability of switching network of full duplex ports. • Good for network complex of many switches. Cascade Topology Server bottle neck could be overcome. Cannot get maximum network performance but better than many client to only one serv. topology. Week against a node failure. Failure will spread in cascade way as well and difficult to recover. BOF 1 23 4 5 6 7 8 PIPELINING & EOF multi threading 9 ….. File chunk =4MB 6 9 8 Server 7 network 8 6 5 7 Node 1 network 3 thread in parallel 5 7 6 Node 2 Next node Performance of dolly+ Elapsed time for cloning vs number of nodes Less than 5min! for 100 nodes expected Elapsed time (min) 15 measured by TS225 4MB chunk size, ~10MB/s transfer speed 10 total 4GB disk image cloning 5 total 2GB disk image cloning 0 1 5 10 50 100 500 HW: FujitsuTS225 PenIII 1GHz x2, SCSI disk, 512MB mem, 100BaseT NW Number of hosts 2002/6/26 ACAT2002 24 Dolly+ transfer speed scalability with size of image 600 transfered bytes (MB) 1500 PC: Hardware spec. (server & nodes) 1GHz PentiumIII x 2 IDE-ATA/100 disk 100BASE-TX net 256MB memory 500 400 30040 50 60 70 1000 setup 500 0 0 elapsed time 1server-1nodes 1server-2nodes 1server-7nodes 1server-10nodes 100 elapsed time (sec) 200 230sec 252sec 266sec 260sec speed 8.2MB/s 7.4MB/s x2 7.0MB/s x7 7.2MB/s x10 Fail recovery mechanism • Only one node failure could be “show stopper” in RING (=series connection) topology. • Dolly+ provides automatic ‘short cut’ mechanism against a node trouble. • In a node trouble, the upper stream node detect it by sending time out. • The upper stream node negotiate with the lower stream node for reconnection and retransfer of a file chunk. • RING topology makes its 2002/6/26implementation easy. ACAT2002 S time out Short cutting 26 Re-transfer in short cutting BOF EOF 1 2 3 4 5 6 7 8 9 ….. File chunk =4MB 6 9 8 Server 7 network 8 6 5 7 Node 1 network 5 7 Works with even Sequential file. 6 Node 2 Next node Dolly+: How do you start it on linux Server side (which has the original file) % dollyS [-v] -f config_file Nodes side % dollyC [-v] Config file example iofiles 3 /dev/hda1 > /tmp/dev/hda1 /data/file.gz >> /data/file boot.tar.Z >> /boot server n000.kek.jp firstclient n001.kek.jp lastclient n020.kek.jp client 20 n001 n002 : n020 endconfig # of files to Xfer master name # of client nodes clients names end code The left of ‘>’ is input file in the server. The right is output file in clients. '>' means dolly+ does not modify the image. '>>' indicate dolly+ should cook (decompress , untar ..) the file according to2002/6/26 the name of the file. ACAT2002 28 How does dolly+ clone the system after booting. Nodes broadcast over the LAN in search of an installation server (Pre-eXecution Environment). PXE/DHCP server respond to nodes with information about the nodes IP and kernel download server. The kernel and `ram disk image’ are Multicast TFTP’ed to the nodes and the kernel gets start. The kernel hands off to an installation script which run a disk tool and ‘dolly+ ’. (scripts and appli. are in the ram disk image) 2002/6/26 ACAT2002 29 How does dolly+ start after rebooting. The code partitions the hard drive, creates file systems and start `dolly+’ client on the node. You start `dolly+’ master on the master host to start up a disk clone process. The code then configure unique node information such as Host name, IP addess from DHCP information. ready to boot from its hard drive for the first time. 2002/6/26 ACAT2002 30 PXE Trouble BY THE WAY we suffered sometimes PXE mtftp transfer failure in the case of >20 nodes booting simultaneously. If you have same trouble, mail me please. We start rewriting mtftp client code of RedHat Linux PXE server. 2002/6/26 ACAT2002 31 Configuration 2002/6/26 ACAT2002 32 (Sub)system Configuration Linux (Unix) has a lot of configuration file to configure sub-systems. If you have 1000nodes, you have to manage (many)x1000 config. files. To manage them, three types of solution  Cetralized information service server (like NIS). Need support by sub-system (nsswitch) Automatic remote editing raw config. files (like cfengine). 2002/6/26 Must care about each node’s file separately. ACAT2002 33 Configuration --new proposal from CS. Program (configure) whole system with a source code by O.O way. Systematic & uniform way configuration. Source reuse (inheritance) as much as possible. Template override to other-site’s configuration. Arusha (http://ark.sourceforge.net) LCFGng (http://www.lcfg.org) 2002/6/26 ACAT2002 34 LCFGng (Univ. Edinburgh) Ack New Compile Notify Fetch new profile Configuration files & control commands exec. LCFGng Good things Author says that it works on ~1000 nodes. Fully automatic. (you just edit source code and compile it in a host.) Differences of sub-systems are hidden from user (administrator). (or move to `components (DB->actual config file)’) 2002/6/26 ACAT2002 36 LCFGng Configuration Language is too primitive. Hostname.Component.Parameter Value Components are not so many or you must write your own components scripts for each sub-system by yourself. far easier writing config. file itself than writing component. Activating timing of the config. change could not be controlled. 2002/6/26 ACAT2002 37 Status monitoring 2002/6/26 ACAT2002 38 Status Monitoring System state monitoring CPU/memory/disk/network utilization Ganglia*1,plantir*2 (Sub-)system service sanity check Pikt*3/Pica*4/cfengine *1 http://ganglia.sourceforge.net *2 http://www.netsonde.com *3 http://pikt.org *4 http://pica.sourceforge.net/wtf.html 2002/6/26 ACAT2002 39 Ganglia ( Univ. Calfornia) Gmond (each node) All node `multicast’ each system status info. each other and each node has current status of all nodes. -> good redundancy and robust declare that it works on ~1000 nodes Meta-deamon (Web server) stores volatile data of gmond in Round-robin DB and represent XML image of all nodes activity Web Interface 2002/6/26 ACAT2002 40 Plantir (Network adaption ) Quick understanding of system status from One Web Page. 2002/6/26 ACAT2002 42 Remote Execution 2002/6/26 ACAT2002 43 Remote execution Administrator sometimes need to issue a command to all (part of ) nodes urgently. Remote execution could be rsh/ssh/pikt/cfengine/SUT(mpich)* /gexec.. Points are To make it easy to know the execution result (fail or success) at a glance. Parallel execution among nodes. Otherwise If it takes 1sec. at each node, then 1000 sec for 1000 nodes. *) Scalable Unix tools for cluster http://www-unix.mcs.anl.gov/sut/ 2002/6/26 ACAT2002 44 WANI WEB base remote command executer. Easy to select nodes concerned. Easy to specify script or to type-in command lines to execute in nodes. Issue the commands to nodes in parallel. Collect result with error/failure detection. Currently, the software is in prototyping by combinations of existing protocol and tools. (Anyway it works!) 2002/6/26 ACAT2002 45 WANI is implemented on `Webmin’ GUI Start Command input Node selection 2002/6/26 ACAT2002 46 Switch to another page Command execution result Host name Results from 200nodes in 1 Page 2002/6/26 ACAT2002 47 Error detection Flame color represents; White: initial Yellow: command starts Black: finished 1 2 3 4 BG color Exit code “fail/error” word `grep –i`. *sys_errlist[] (perror) list check. `strings /bin/sh` output check 2002/6/26 ACAT2002 48 Stdout output Click here Click here Stderr output Node hosts execution WEB Browser Piktc_svc Result Pages Command PIKT server Webmin server Piktc print_filter Error marked Result lpr Lpd Result error detector Summary I reviewed admin. tools which can be used against ~1000 nodes Linux PC cluster. Installation: dolly+ Install/Update/Switch hosts >100 nodes very quickly. Configuration manager Not matured yet. But can expect a lot from DataGrid research. Status monitor seems several good software already exists. Extra daemons and network traffic. 2002/6/26 ACAT2002 51 Summary Remote Command Execution `Result at a glance’ is important for quick iteration. Parallel execution is important. Some programs and links is /will be http://corvus.kek.jp/~manabe Thank you for your listening. 2002/6/26 ACAT2002 52 Synchronizing time by rsync (dir = 4096, filesize ~20kB # of file=43680 total size=1.06GB) 80 elapsed time (sec) y=Σan x n a0=8.68524263e-01 a1=4.24465056e-01 2.04576224e+00 |r|=9.97385098e-01 60 40 20 total 1GB 43680files ~20kB/file total 2GB ~50kB/file 0 2002/6/26 100 200 aggregate of modified file size (MB) ACAT2002 53

PPT - ACAT'2002

Related documents

Products

Support

PPT - ACAT'2002

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib