PPT - ACAT'2002

advertisement
Administration Tools for
Managing Large Scale
Linux Cluster
CRC KEK Japan
S.Kawabata, A.Manabe
atsushi.manabe@kek.jp
Linux PC Clusters in KEK
PC Cluster 2
PenIII 800MHz
80CPU (40 nodes)
PC Cluster 1
PenIII Xeon 500MHz
144 CPUs (36 nodes)
PC Cluster 3 (Belle)
Pentium III Xeon 700MHz 320CPU
(80 nodes)
PC cluster 4 (Neutron simulation)
Fujitsu TS225 50 nodes
Pentium III 1GHz x 2CPU
512MB memory
31GB disk
100BaseTX x 2
1U rack-mount model
RS232C x2
Remote BIOS setting
Remote reset/power-off
PC Cluster 5
(Belle)
1U server
Pentium III
1.2GHz
256 CPU
(128 nodes)
3U
PC Cluster 6 Blade server:
LP Pentium III 700MHz
40CPU (40 nodes)
PC clusters
Already more than 400 (>800CPUs) nodes
Linux PC clusters were installed.
Only >middle size PC cluster are counted.
A major exp. (Belle) group plan to install
several x100 nodes of blade server in this
year.
All PC clusters are managed by individual
user group themselves.
2002/6/26
ACAT2002
8
Center Machine (KEK CRC)
Currently machines in KEK Computer
Center(CRC) are UNIX(solaris,AIX) servers.
Plan to have >1000 nodes Linux
computing cluster in near future (~2004).
Will be installed under `~4years rental’
contract. (every 2 years HW update ?)
2002/6/26
ACAT2002
9
Center Machine
The system will be share among many user
groups. (don’t dedicate to one gr. only)
Their demand for CPU power vary with months.
(High demand before int’l-conference or so on)
Of course, we use load-balancing Batch system.
Big groups uses their own software frame work.
Their jobs only run under some restricted version of
OS(Linux) /middle-ware/configuration.
2002/6/26
ACAT2002
10
R&D system
Frequent change of
system configuration/ cpu partition.
To manage such size of PC cluster and
such user request, we need to have some
sophisticated admin. tools.
2002/6/26
ACAT2002
11
Necessary admin. tools
System (SW) Installation /update
Configuration
Status Monitoring/ System Health Check
Command Execution
2002/6/26
ACAT2002
12
Installation tool
2002/6/26
ACAT2002
13
Installation tool
Two types of `installation tool’
Disk Cloning
Application Package Installer
system(kernel) is an application in this term.
2002/6/26
ACAT2002
14
Installation tool (cloning)
Image Cloning
Install system/application
on a `master host’.
Copy disk partition image to nodes
2002/6/26
ACAT2002
15
Installation tool
(package installer)
request
Package server
Image and control
Package
Information
DB
2002/6/26
Clients
Package
archive
ACAT2002
16
Remote Installation via NW
Cloning disk image
SystemImager (VA) http://systemimager.sourceforge.net/
CATS-i (soongsil Univ.)
CloneIt http://www.ferzkopp.net/Software/CloneIt/
Comercial: ImageCast, Ghost,…..
Packages/Applications installation
Kickstart + rpm (RedHat)
LUI (IBM) http://oss.software.ibm.com/developerworks/projects/lui
Lucie (TiTec) http://matsu-www.is.titech.ac.jp/~takamiya/lucie/
LCFGng, Arusha
Public Domin Software
2002/6/26
ACAT2002
17
Dolly+
We developed ‘image cloning via NW’ installer
`dolly+’.
WHY ANOTHER?
We install/update
maybe frequently (according to user needs)
100~1000 nodes simultaneously.
Making packages for our own softwares is boring.
Traditional Server/Client type software suffer server
bottleneck.
Multicast copy with ~GB image seems unstable.
(No free soft ? )
2002/6/26
ACAT2002
18
(few) Server - (Many) Client model
 Server could be a daemon process.
(you don‘t need to start it by hand)
 Performance is not scalable against # of
nodes.
S
• Server bottle neck. Network congestion
Multicasting or Broadcasting
 No server bottle neck.
 Get max performance of network which
support multicasting in switch fabrics.
 Nodes failure does not affect to all the
process very much, it could be robust.
 Since failed node need re-transfer.
Speed is governed by the slowest node
as in RING topology.
 Not TCP but UDP, so application must
take care of transfer reliability.
Dolly and Dolly+
Dolly
 A Linux application software to copy/clone files or/and
disk images among many PCs through a network.
 Dolly is originally developed by CoPs project in ETH
(Swiss) and an open software.
Dolly+ features
Sequential files (no limitation of over 2GB) and/or normal files
(optinal:decompress and untar on the fly)
transfer/copy via TCP/IP
network.
Virtual RING network connection topology to cope with
server bottleneck problem.
Pipeline and multi-threading mechanism for speed-up.
Fail recovery mechanism for robust operation.
Dolly: Virtual Ring Topology
Master = host having original image
node PC
network hub switch
physical connection
Logical (virtual) connection
• Physical network connection is as
you like.
• Logically ‘Dolly’ makes a node ring
chain which is specified by dolly’s
config file and send data node by
node bucket relay.
• Though transfer is only between
its two adjacent nodes, it can
utilize max. performance ability of
switching network of full duplex
ports.
• Good for network complex of
many switches.
Cascade Topology
Server bottle neck could be
overcome.
Cannot get maximum network
performance but better than
many client to only one serv.
topology.
Week against a node failure.
Failure will spread in cascade
way as well and difficult to
recover.
BOF
1 23 4 5 6 7 8
PIPELINING &
EOF
multi threading
9 …..
File chunk =4MB
6
9
8
Server
7
network
8
6
5
7
Node 1 network
3 thread in parallel
5
7
6
Node 2
Next node
Performance of dolly+
Elapsed time for cloning vs number of nodes
Less than 5min!
for 100 nodes
expected
Elapsed time (min)
15
measured by TS225
4MB chunk size, ~10MB/s transfer speed
10
total 4GB disk image cloning
5
total 2GB disk image cloning
0
1
5
10
50
100
500
HW: FujitsuTS225
PenIII 1GHz x2,
SCSI disk,
512MB mem,
100BaseT NW
Number of hosts
2002/6/26
ACAT2002
24
Dolly+ transfer speed scalability with size of image
600
transfered bytes (MB)
1500
PC: Hardware spec.
(server & nodes)
1GHz PentiumIII x 2
IDE-ATA/100 disk
100BASE-TX net
256MB memory
500
400
30040
50
60
70
1000
setup
500
0
0
elapsed time
1server-1nodes
1server-2nodes
1server-7nodes
1server-10nodes
100
elapsed time (sec)
200
230sec
252sec
266sec
260sec
speed
8.2MB/s
7.4MB/s x2
7.0MB/s x7
7.2MB/s x10
Fail recovery mechanism
• Only one node failure could be
“show stopper” in RING (=series
connection) topology.
• Dolly+ provides automatic ‘short
cut’ mechanism against a node
trouble.
• In a node trouble, the upper stream
node detect it by sending time out.
• The upper stream node negotiate
with the lower stream node for
reconnection and retransfer of a file
chunk.
• RING topology makes its
2002/6/26implementation easy.
ACAT2002
S
time out
Short cutting
26
Re-transfer in
short cutting
BOF
EOF
1 2 3 4 5 6 7 8 9 …..
File chunk =4MB
6
9
8
Server
7
network
8
6
5
7
Node 1 network
5
7
Works with even Sequential file.
6
Node 2
Next node
Dolly+: How do you start it
on linux
Server side
(which has the original file)
% dollyS [-v] -f config_file
Nodes side
% dollyC [-v]
Config file example
iofiles 3
/dev/hda1 > /tmp/dev/hda1
/data/file.gz >> /data/file
boot.tar.Z >> /boot
server n000.kek.jp
firstclient n001.kek.jp
lastclient n020.kek.jp
client 20
n001
n002
:
n020
endconfig
# of files to Xfer
master name
# of client nodes
clients names
end code
The left of ‘>’ is input file in the server. The right is output file in clients. '>' means dolly+ does
not modify the image. '>>' indicate dolly+ should cook (decompress , untar ..) the file according
to2002/6/26
the name of the file.
ACAT2002
28
How does dolly+ clone the
system after booting.
Nodes broadcast over the LAN in search of an
installation server (Pre-eXecution Environment).
PXE/DHCP server respond to nodes with
information about the nodes IP and kernel
download server.
The kernel and `ram disk image’ are Multicast
TFTP’ed to the nodes and the kernel gets start.
The kernel hands off to an installation script
which run a disk tool and ‘dolly+ ’.
(scripts and appli. are in the ram disk image)
2002/6/26
ACAT2002
29
How does dolly+ start after
rebooting.
The code partitions the hard drive, creates
file systems and start `dolly+’ client on the
node.
You start `dolly+’ master on the master
host to start up a disk clone process.
The code then configure unique node
information such as Host name, IP addess
from DHCP information.
ready to boot from its hard drive for the
first time.
2002/6/26
ACAT2002
30
PXE Trouble
BY THE WAY
we suffered sometimes PXE mtftp transfer
failure in the case of >20 nodes booting
simultaneously.
If you have same trouble, mail me please.
We start rewriting mtftp client code of
RedHat Linux PXE server.
2002/6/26
ACAT2002
31
Configuration
2002/6/26
ACAT2002
32
(Sub)system Configuration
Linux (Unix) has a lot of configuration file
to configure sub-systems. If you have
1000nodes, you have to manage (many)x1000
config. files.
To manage them, three types of solution
 Cetralized information service server
(like NIS).
Need support by sub-system (nsswitch)
Automatic remote editing raw config. files
(like cfengine).
2002/6/26
Must care about each node’s file separately.
ACAT2002
33
Configuration
--new proposal from CS.
Program (configure) whole system with a
source code by O.O way.
Systematic & uniform way configuration.
Source reuse (inheritance) as much as
possible.
Template
override to other-site’s configuration.
Arusha (http://ark.sourceforge.net)
LCFGng (http://www.lcfg.org)
2002/6/26
ACAT2002
34
LCFGng (Univ. Edinburgh)
Ack
New Compile
Notify
Fetch new profile
Configuration files & control commands exec.
LCFGng
Good things
Author says that it works on ~1000 nodes.
Fully automatic. (you just edit source code
and compile it in a host.)
Differences of sub-systems are hidden from
user (administrator).
(or move to `components (DB->actual config
file)’)
2002/6/26
ACAT2002
36
LCFGng
Configuration Language is too primitive.
Hostname.Component.Parameter Value
Components are not so many
or you must write your own components
scripts for each sub-system by yourself.
far easier writing config. file itself than
writing component.
Activating timing of the config. change
could not be controlled.
2002/6/26
ACAT2002
37
Status monitoring
2002/6/26
ACAT2002
38
Status Monitoring
System state monitoring
CPU/memory/disk/network utilization
Ganglia*1,plantir*2
(Sub-)system service sanity check
Pikt*3/Pica*4/cfengine
*1 http://ganglia.sourceforge.net *2 http://www.netsonde.com
*3 http://pikt.org *4 http://pica.sourceforge.net/wtf.html
2002/6/26
ACAT2002
39
Ganglia ( Univ. Calfornia)
Gmond (each node)
All node `multicast’ each system status info.
each other and each node has current status
of all nodes. -> good redundancy and robust
declare that it works on ~1000 nodes
Meta-deamon (Web server)
stores volatile data of gmond in Round-robin
DB and represent XML image of all nodes
activity
Web Interface
2002/6/26
ACAT2002
40
Plantir (Network adaption )
Quick
understanding
of system status
from One Web
Page.
2002/6/26
ACAT2002
42
Remote Execution
2002/6/26
ACAT2002
43
Remote execution
Administrator sometimes need to issue a
command to all (part of ) nodes urgently.
Remote execution could be
rsh/ssh/pikt/cfengine/SUT(mpich)* /gexec..
Points are
To make it easy to know the execution result
(fail or success) at a glance.
Parallel execution among nodes.
Otherwise If it takes 1sec. at each node, then 1000
sec for 1000 nodes.
*) Scalable Unix tools for cluster http://www-unix.mcs.anl.gov/sut/
2002/6/26
ACAT2002
44
WANI
WEB base remote command executer.
Easy to select nodes concerned.
Easy to specify script or to type-in command
lines to execute in nodes.
Issue the commands to nodes in parallel.
Collect result with error/failure detection.
Currently, the software is in prototyping
by combinations of existing protocol and
tools. (Anyway it works!)
2002/6/26
ACAT2002
45
WANI is implemented on
`Webmin’ GUI
Start
Command input
Node selection
2002/6/26
ACAT2002
46
Switch to another page
Command execution result
Host name
Results
from 200nodes
in 1 Page
2002/6/26
ACAT2002
47
Error detection
Flame color represents;
White: initial
Yellow: command starts
Black: finished
1
2
3
4
BG color
Exit code
“fail/error” word `grep –i`.
*sys_errlist[] (perror) list check.
`strings /bin/sh` output check
2002/6/26
ACAT2002
48
Stdout output
Click here
Click here
Stderr output
Node hosts
execution
WEB Browser
Piktc_svc
Result Pages Command
PIKT server
Webmin server
Piktc
print_filter Error marked Result
lpr
Lpd
Result
error detector
Summary
I reviewed admin. tools which can be used
against ~1000 nodes Linux PC cluster.
Installation: dolly+
Install/Update/Switch hosts >100 nodes very quickly.
Configuration manager
Not matured yet. But can expect a lot from DataGrid
research.
Status monitor
seems several good software already exists.
Extra daemons and network traffic.
2002/6/26
ACAT2002
51
Summary
Remote Command Execution
`Result at a glance’ is important for quick iteration.
Parallel execution is important.
Some programs and links is /will be
http://corvus.kek.jp/~manabe
Thank you for your listening.
2002/6/26
ACAT2002
52
Synchronizing time by rsync
(dir = 4096, filesize ~20kB # of file=43680 total size=1.06GB)
80
elapsed time (sec)
y=Σan x
n
a0=8.68524263e-01
a1=4.24465056e-01
2.04576224e+00
|r|=9.97385098e-01
60
40
20
total 1GB 43680files ~20kB/file
total 2GB
~50kB/file
0
2002/6/26
100
200
aggregate of modified file size (MB)
ACAT2002
53
Download