Dteam_210910 - Indico

advertisement
Dteam Meeting 210910
Pete Gronbech (Minutes)
Sam Skipsy
Duncan RAND
Stuart Purdie
Kashif Mohammad
Wahid Bhimji
Graeme Stewart
Rich Hellier
Stuart Wakefield
Derek Ross
Alessandra Forti
Duncan Rand
Jeremy Coles
Mingchao Ma
Gareth Smith
David Colling
Atlas:
GS interesting week.
Cloud offline this morning while RAL rerouted the FW and the LFC was going to be inaccessible for
t2s but down time was very short.
Problem with file access at RAL over the weekend.
A disk server was accidental put out of production on caster.
Security many sites off line over the weekend. half the grid was down.
Many t2s turned offline as did not have very good downtime info, published in the GOCDB, (eg
Oxford, and many others), Manchester came back and were switched on. PG apologised that they
had only marked on CE offline and not all.
QM and ? also came back over the weekend.
If pilots were not working, the site was turned off in panda.
RALPP disk server went down, and corrupted lots of files. Lots of DPD’s for HIGGS lost.
CB to provide a list of lost files. It’s a hard problem to deal with. Glasgow have suffered similar
problems before, need to decide whether it’s worth check summing the files or just decide if they
are all lost.
CMS
Kernel problem took lots of sites out will take rest of week to get upgraded.
Not a major problem for CMS.
LHCb
http://hepwww.rl.ac.uk/nraja/UKUploadProblems/index.html
1. Problem with running jobs at RAL - they have a high failure rate
primarily because of throughput issues with disk servers at the SE.
2. UK Upload problems at Brunel.
Stuart Purdie only seeing transient problems.
LHCb merge jobs causes high demand on the disk servers at T1.
Some issue with VOMS and t2k had some problems.
GS a lot of sites are blacklisted, so check in panda.
DR Atlas production that ran over the weekend, a lot of failures from transfers to RAL.
GS: They did have 48hrs to transfer before timing out. Can you have a look to see if there are enough
slots in the FTS channel.
Experiment resource usage.
Nothing for Atlas or CMS coming up.
ROC Update:
Kashif did a hand over report. Main issue was sites down for security, but a lot not down in GOCDB
so causes alarms to be generated.
CERN problems solved. ??? Kashif
GOCDB entry, what to do if you only shutdown some queues. Panda will put the site offline if all ce’s
are in downtime.
RALT1 site has
upgraded firmware on firewall and it was completed within
the window of schedule downtime. The only problem faced while starting FTS so
it took another hour to start FTS
CMS disk server failure this morning.
LHCb castor instance scheduled for 2.1.9 update starting next Monday for 3 days.
Patching, waiting for the formal patch to come out, will update UI’s etc. Batch system will be done in
a couple of parts. Site security have put a two week deadline to get it patched.
Waiting for an Oracle update for the backend nodes, will require some downtime castor, FTS, 3D
will be affected. Probably next week.
EGI SA1
glite 3.2 etc and is there any 3.1 service we cannot upgrade.
BDII problem on SL5 partially solved.
VOMS and grid site have been patched.
Top level BDII problem is fixed by the upgrade.
Glasgow have not yet done theirs due to new kit arriving.
Ticket status
***************
https://gus.fzk.de/download/escalationreports/roc/html/20100920_EscalationR
eport_ROCs.html??
?56316 - NGS-RAL not in CERN BDII. on-hold. (With J Churchill)
?58733 - RAL-PP no space left. Biomed. on-hold.
?59397 - RAL. LHCb. Perceived some slowness in SARA-RAL replication
We did not have time to go over the tickets in detail.
Security Update:
Latest update, RH have released the patch, SL are waiting for this patch. We hope the SL patch will
be available very quickly (<24Hrs).
CentOS have a test patch released, but official patch due soon.
Once all vendors have released the patch it is very likely that they will enforce the 7 days upgrade
policy.
If a site is not updated they will be banned.
From pakiti, can see some sites are still vulnerable, but pakiti was showing false positives.
Brunel, bham, ecdf, Glasgow, cam ucl , oxford, ..... but pakiti cannot tell if you have patched your
kernel.
RAL T1
Cambridge is running CERN kernel, as are most sites.
Bham not!
Expt reps tell the users than there is a security incident and therefore only report exceptional errors,
people are generally very understanding.
JC: We did not put anything out on gridpp users, should we have done so?
AF: we should use Face book, NGS do
(http://www.facebook.com/group.php?gid=7070195774&ref=ts )
JC: How is NGS responding, So far Oxford have responded. They are running SL4, exploit does not
work so have kept it up and running.
Other sites did not respond.
Quite difficult to get response from the other sites in general.
The Affiliate sites are basically the gridpp sites so do not have to report twice. It is generally the
same people behind both systems. Glasgow is different as they do have separate clusters, David
Martin is the NGS contact.
AOB
Any Feedback from the EGI technical forum. http://www.egi.eu/EGITF2010
Did not answer JC’s questions about operations.
It is struggling to make progress. Looking for volunteers to help on documentation needs to be
updated from EGEE to EGI.
Mark will help from Glasgow.
Where to get packages from, EGI, CERN etc there are about 20 repositories all marked current.
Stuart Purdie says they are not yet ready for production use, at some stage they would like
everybody to use the EGI repository.
Security training session well attended. ARC ce and UNICORE ce talks were good. Security open
forum as well. EMI security group and EGI.
Actions:
T2Cs to review security doc. to be closed.
Reviewers for web page and wiki
resilience page needs contributions
DN list
ROC designation.
Migration strategy for ROC, JC thought he had progress but NGS have been asked to delay till next
year. Andy Richards is leaving the NGS.
BNL VOMS certs accepted for atlas users. Close
post links from MMs hepsysman talk on wiki from mingcaho
CMS conferences page? Atlas have a useful conference talks web page http://atlas-speakerscommittee.web.cern.ch/atlas-speakers-committee/ConfTalks2010.html , which can be referred to,
to guess when increase use may be required. Stuart Purdies script to scrap this page is:
http://svr001.gla.scotgrid.ac.uk/cgi-bin/atlas.py
CMS don’t have the same sort of page.
No response from LHCb.
NGDF fixed. Took 29 days to identify a faulty line card in CERN OPN router. Atlas have requested a
post-mortem as it took far too long.
WMS monitoring at Glasgow, RAL and IC.
Download