Slides

advertisement
Liverpool HEP - Site Report
June 2008
Robert Fay, John Bland
Staff Status
One members of staff left in the past year:
• Paul Trepka, left March 2008
Two full time HEP system administrators
• John Bland, Robert Fay
One full time Grid administrator currently being hired
* Closing date for applications was Friday 13th, 15 applications received
One part time hardware technician
• Dave Muskett
Current Hardware
Desktops
• ~100 Desktops: Scientific Linux 4.3, Windows XP
• Minimum spec of 2GHz x86, 1GB RAM + TFT Monitor
Laptops
• ~60 Laptops: Mixed architectures, specs and OSes.
Batch Farm
• Software repository (0.7TB), storage (1.3TB)
• Old ‘batch’ queue has 10 SL3 dual 800MHz P3s with 1GB RAM
• ‘medium’, ‘short’ queues consist of 40 SL4 MAP-2 nodes (3GHz P4s)
• 5 interactive nodes (dual Xeon 2.4GHz)
• Using Torque/PBS
• Used for general analysis jobs
Current hardware – continued
Matrix
• 1 dual 2.40GHz Xeon, 1GB RAM
• 6TB RAID array
• Used for CDF batch analysis and data storage
HEP Servers
* 4 core servers
• User file store + bulk storage via NFS (Samba front end for Windows)
• Web (Apache), email (Sendmail) and database (MySQL)
• User authentication via NIS (+Samba for Windows)
• Dual Xeon 2.40GHz shell server and ssh server
• Core servers have a failover spare
Current Hardware - continued
LCG Servers
• CE, SE upgraded to new hardware:
• CE now 8-core Xeon 2 GHz, 8GB RAM
• SE now 4-core Xeon 2.33GHz, 8GB RAM, Raid 10 array
• CE, SE, UI all SL4, GLite 3.1
• Mon still SL3, GLite 3.0
• BDII SL4, Glite 3.0
Current Hardware – continued
MAP2 Cluster
• 24 rack (960 node) (Dell PowerEdge 650) cluster
• 4 racks (280 nodes) shared with other departments
• Each node has 3GHz P4, 1GB RAM, 120GB local storage
• 19 racks (680 nodes) primarily for LCG jobs (5 racks currently allocated
for local ATLAS/T2K/Cockcroft batch processing)
• 1 rack (40 nodes) for general purpose local batch processing
• Front end machines for ATLAS, T2K, Cockcroft
• Each rack has two 24 port gigabit switches
• All racks connected into VLANs via Force10 managed switch
Storage
RAID
• All file stores are using at least RAID5. Newer servers using RAID6.
• All RAID arrays using 3ware 7xxx/9xxx controllers on Scientific Linux
4.3.
• Arrays monitored with 3ware 3DM2 software.
File stores
• New User and critical software store, RAID6+HS, 2.25TB
• ~10B general purpose ‘hepstores’ for bulk storage
• 1.4TB + 0.7TB batchstore+batchsoft for the Batch farm cluster
• 1.4TB hepdata for backups
• 37TB RAID6 for LCG storage element
Storage (continued)
3ware Problems!
•3w-9xxx: scsi0: WARNING: (0x06:0x0037): Character ioctl (0x108) timed
out, resetting card.
•3w-9xxx: scsi0: ERROR: (0x06:0x001F): Microcontroller not ready during
reset sequence.
•3w-9xxx: scsi0: AEN: ERROR: (0x04:0x005F): Cache synchronization
failed; some data lost:unit=0.
•Leads to total loss of data access until system is rebooted.
•Sometimes leads to data corruption at array level.
•Seen under iozone load, normal production load, due to drive failure.
•Anyone else seen this?
Network
Topology
MAP2
2GB
2GB
WAN
firewall
Force10
Gigabit
Switch
LCG servers
Offices
Servers
1GB link
VLAN
Network (continued)
Core Force10 E600 managed switch.
• Now have 450 gigabit ports (240 at line rate)
• Used as central departmental switch, using VLANs
• Increased bandwidth to WAN using link aggregation to 2-3GBit/s
• Increased to departmental backbone to 2GBit/s
• Added departmental firewall/gateway
• Network intrusion monitoring with snort
• Most office PCs and laptops are on internal private network
•
Building network infrastructure is creaking
needs rewiring, old cheap hubs and
switches need replacing
Security & Monitoring
Security
• Logwatch (looking to develop filters to reduce ‘noise’)
• University firewall + local firewall + network monitoring (snort)
• Secure server room with swipe card access
Monitoring
• Core network traffic usage monitored with ntop and cacti (all traffic to
be monitored after network upgrade)
• Use sysstat on core servers for recording system statistics
• Rolling out system monitoring on all servers and worker nodes, using
SNMP, Ganglia, Cacti, and Nagios
• Hardware temperature monitors on water cooled racks, to be
supplemented by software monitoring on nodes via SNMP. Still
investigating other environment monitoring solutions.
System Management
•
•
•
Puppet used for configuration management
Dotproject used for general helpdesk
RT integrated with Nagios for system management
Nagios automatically creates/updates tickets on
acknowledgement
Each RT ticket serves as a record for an individual system
Plans
Additional storage for the Grid
• GridPP3 funded
• Will be approx. 60? TB
• May switch from dCache to DPM
Upgrades to local batch farm
• Plans to purchase several multi-core (most likely 8-core) nodes
Collaboration with local Computing Services Department
• Share of their newly commissioned multi-core cluster available
Download