Sean Brisbane
Particle Physics Systems Administrator
Room 661
Tel 73389 s.brisbane1@physics.ox.ac.uk
14th October 2014 Graduate Lectures 1
Strategy
Local Cluster Overview
Connecting to it
Grid Cluster
Computer Rooms
How to get help
14th October 2014 Graduate Lectures 2
Particle Physics Strategy
The Server / Desktop Divide
Virtual Machine Host
General
Purpose Unix
Server
Linux File
Servers
Linux
Worker nodes
Group
DAQ
Systems
Web
Server
NIS
Server torque
Server
Win 7
PC
Win 7
PC
Win 7
PC
Ubuntu
PC
Linux Desktop
Approx 200 Desktop PC’s with Exceed, putty or ssh/X windows used to access
PP Linux systems
14th October 2014 Graduate Lectures 3
Storage system:
Windows server
Central
Linux fileserver
PP fileserver
Client Windows
Recommended storage
H:\ drive
Windows storage
“H:\” drive or
“Y:\home”
PP Storage Y:/LinuxUsers/pplinux
/data/home
Central
Ubuntu
/home folder /home and /data
/physics/home
PP Linux folders
/data/home, /data/experiment
Central Linux
14th October 2014
Y:/LinuxUsers/home/ particle
/network/home/particle
Graduate Lectures 4
Unix Team (Room 661):
Pete Gronbech - Senior Systems Manager and GridPP Project Manager
Ewan MacMahon – Grid Systems Administrator
Kashif Mohammad – Grid and Local Support
Sean Brisbane
– Local Server and User Support
General purpose interactive Linux based systems for code development, short tests and access to Linux based office applications. These are accessed remotely.
Batch queues are provided for longer and intensive jobs. Provisioned to meet peak demand and give a fast turnaround for final analysis.
Systems run Scientific Linux (SL) which is a free Red Hat Enterprise based distribution.
The Grid & CERN have migrated to SL6. The majority of the local cluster is also on SL6, but some legacy SL5 systems are provided for those that need them.
We will be able to offer you the most help running your code on the newer SL6.
Some experimental software frameworks still require SL5.
14th October 2014 Graduate Lectures 5
Particle Physics Local Batch cluster
Oxfords Tier 2 Grid cluster
14th October 2014 Graduate Lectures 6
PP Linux Batch Farm Scientific Linux 6
Users log in to the interactive nodes pplxint8 & 9, the home directories and all the data disks (/home area or
/data/group ) are shared across the cluster and visible on the interactive machines and all the batch system worker nodes.
Approximately 300 cores (430 incl.
JAI/LWFA), each with 4GB of RAM memory.
The /home area is where you should keep your important text files such as source code, papers and thesis
The /data/ area is where you should put your big reproducible input and output data
14th October 2014 Graduate Lectures jailxwn02
64 * AMD cores jailxwn01
64 * AMD cores pplxwn59
16 * Intel cores pplxwn60
16 * Intel cores pplxwnnn pplxwn41 pplxwn38 pplxwnnn
16 * Intel 2650 cores
16 * Intel 2650 cores
12 * Intel 5650 cores
12 * Intel 5650 cores pplxwnnn
12 * Intel 5650 cores pplxwn32 pplxwn31
12 * Intel 5650 cores
12 * Intel 5650 cores pplxwn16 8 * Intel 5420 cores pplxwn15 8 * Intel 5420 cores pplxint9 pplxint8
Interactive login nodes
7
PP Linux Batch Farm Scientific Linux 5
Legacy SL5 jobs supported by smaller selection of worker nodes.
Currently eight servers with 16 cores each with 4GB of RAM memory per core.
All of your files area available from SL5 and
6, but the software environment will be different and therefore your code may not run if compiled for the other operating system.
pplxwn30 16 * AMD 6128 cores pplxwnnn 16 * AMD 6128 cores pplxwnnn 16 * AMD 6128 cores pplxwn24 16 * AMD 6128 cores pplxwn23 16 * AMD 6128 cores
14th October 2014 Graduate Lectures pplxint6 pplxint5
Interactive login nodes
8
PP Linux Batch Farm
NFS
Servers
40TB pplxfsn
40TB pplxfsn
30TB pplxfsn
19TB pplxfsn
Data
Areas
Data
Areas
Data
Areas
Home areas
14th October 2014
Data Storage
NFS is used to export data to the smaller experimental groups, where the partition size is less than the total size of a server.
The data areas are too big to be backed up. The servers have dual redundant PSUs, RAID 6 and are running on uninterruptible powers supplies. This safeguards against hardware failures, but does not help if you delete files.
The home areas are backed up by two different systems nightly. The Oxford ITS HFS service and a local back up system. If you delete a file tell us a soon as you can when you deleted it and it’s full name.
The latest nightly backup of any lost or deleted files from your home directory is available at the read-only location /data/homebackup/{username}
The home areas are quota’d but if you require more space ask us.
Store your thesis on /home NOT /data .
Graduate Lectures
9
Particle Physics Computing
The Lustre file system is used to group multiple file servers together to provide extremely large
Lustre OSS01 Lustre OSS02 Lustre OSS03 Lustre OSS04
18TB 18TB 44TB 44TB df -h /data/atlas
Filesystem
/lustre/atlas25/atlas
Size Used Avail Use% Mounted on
366T 199T 150T 58% /data/atlas df -h /data/lhcb
Filesystem Size Used Avail Use% Mounted on
/lhcb25 118T 79T 34T 71% /data/lhcb25 pplxint5
14th October 2014
SL5 Node SL6 Node SL6 Node pplxint8
Graduate Lectures 10
14th October 2014 Graduate Lectures 11
Use a strong password not open to dictionary attack!
fred123 – No good
U aspnotda !
09 – Much better
More convenient* to use ssh with a passphrased key stored on your desktop.
Once set up
14th October 2014 Graduate Lectures 12
Question: How many of you are using Windows? & Linux? On the desktop
Demo
1.
1.
Plain ssh terminal connection
From ‘outside of physics’
2.
From Office (no password)
2.
ssh with X windows tunnelled to passive exceed
3.
4.
ssh, X windows tunnel, passive exceed,
KDE Session
Passwordless access from ‘outside physics’
1.
See backup slides http://www2.physics.ox.ac.uk/it-services/ppunix/ppunix-cluster http://www.howtoforge.com/ssh_key_based_logins_putty
14th October 2014 Graduate Lectures 13
14th October 2014 Graduate Lectures 14
Oxford
RAL PPD
Cambridge
Birmingham
Bristol
Sussex
JET at Culham
14th October 2014 Graduate Lectures 15
Current capacity
Compute Servers
Twin and twin squared nodes
– 1770 CPU cores
Storage
Total of ~1300TB
The servers have between 12 and 36 disks, the more recent ones are 4TB capacity each. These use hardware RAID and UPS to provide resilience.
14th October 2014 Graduate Lectures 16
To: help@it.ox.ac.uk
Must remember to use the same PC to request and retrieve the
Dear Stuart Robeson and Jackie Hewitt,
Grid Certificate.
to approve my grid certificate request.
http://www.ngs.ac.uk/ukca
Thanks.
uses a JAVA based CERT WIZARD
14th October 2014 Graduate Lectures 17
chmod 700 .globus
cd .globus
openssl pkcs12 -in ../mycert.p12 -clcerts -nokeys
-out usercert.pem
openssl pkcs12 -in ../mycert.p12 -nocerts -out userkey.pem
chmod 400 userkey.pem
chmod 444 usercert.pem
14th October 2014
This is the Virtual Organisation such as
“Atlas”, so:
You are allowed to submit jobs using the infrastructure of the experiment
Access data for the experiment
Speak to your colleagues on the experiment about this. It is a different process for every experiment!
14th October 2014 Graduate Lectures 19
Your grid certificate identifies you to the grid as an individual user, but it's not enough on its own to allow you to run jobs; you also need to join a Virtual Organisation (VO).
These are essentially just user groups, typically one per experiment, and individual grid sites can choose to support (or not) work by users of a particular VO.
Most sites support the four LHC VOs, fewer support the smaller experiments.
The sign-up procedures vary from VO to VO, UK ones typically require a manual approval step, LHC ones require an active CERN account.
For anyone that's interested in using the grid, but is not working on an experiment with an existing VO, we have a local VO we can use to get you started.
14th October 2014 Graduate Lectures 20
Test your grid certificate:
> voms-proxy-init –voms lhcb.cern.ch
Enter GRID pass phrase:
Your identity: /C=UK/O=eScience/OU=Oxford/L=OeSC/CN=j bloggs
Creating temporary proxy
.....................................
Done
Consult the documentation provided by your experiment for ‘their’ way to submit and manage grid jobs
14th October 2014 Graduate Lectures 21
The New Computer room built at Begbroke Science Park jointly for the Oxford Super Computer and the Physics department, provides space for 55 (11KW) computer racks. 22 of which will be for Physics. Up to a third of these can be used for the Tier 2 centre. This £1.5M project was funded by SRIF and a contribution of ~£200K from Oxford Physics.
The room was ready in December 2007. Oxford Tier 2 Grid cluster was moved there during spring 2008. All new Physics High Performance Clusters will be installed here.
22 14th October 2014 Graduate Lectures
Completely separate from the Begbroke Science park a computer room with 100KW cooling and >200KW power has been built. ~£150K
Oxford Physics money.
Local Physics department Infrastructure computer room.
Completed September 2007.
This allowed local computer rooms to be refurbished as offices again and racks that were in unsuitable locations to be re housed.
14th October 2014 Graduate Lectures 23
Cold aisle containment
14th October 2014 Graduate Lectures 24
Oxford Advanced Research Computing
A shared cluster of CPU nodes, “just” like the local cluster here
GPU nodes
– Faster for ‘fitting’, toy studies and MC generation
– *IFF* code is written in a way that supports them
Moderate disk space allowance per experiment (<5TB) http://www.arc.ox.ac.uk/content/getting-started
Emerald
Huge farm of GPUs
http://www.cfi.ses.ac.uk/emerald/
Both needs a separate account and project
Come talk to us in RM 661
14th October 2014 Graduate Lectures 25
Now more details of use of the clusters
Help Pages
http://www.physics.ox.ac.uk/it/unix/default.htm
http://www2.physics.ox.ac.uk/research/particlephysics/particle-physics-computer-support
ARC
http://www.arc.ox.ac.uk/content/getting-started
pp_unix_admin@physics.ox.ac.uk
14th October 2014 Graduate Lectures 26
14th October 2014 Graduate Lectures 27
Puttygen to create an ssh key on Windows
(previous slide point #4)
Paste this into
~/.ssh/authorized_keys on pplxint
Enter a secure passphrase then :
- Enter a strong passphrase
- Save the private parts of the key to a subdirectory of your local drive.
14th October 2014 Graduate Lectures 28
Run Pageant once after login
Right-click on the pageant symbol and and “ Add key ” for your Private
(windows ssh key)
14th October 2014 Graduate Lectures 29