Dteam Meeting 210910 Pete Gronbech (Minutes) Sam Skipsy Duncan RAND Stuart Purdie Kashif Mohammad Wahid Bhimji Graeme Stewart Rich Hellier Stuart Wakefield Derek Ross Alessandra Forti Duncan Rand Jeremy Coles Mingchao Ma Gareth Smith David Colling Atlas: GS interesting week. Cloud offline this morning while RAL rerouted the FW and the LFC was going to be inaccessible for t2s but down time was very short. Problem with file access at RAL over the weekend. A disk server was accidental put out of production on caster. Security many sites off line over the weekend. half the grid was down. Many t2s turned offline as did not have very good downtime info, published in the GOCDB, (eg Oxford, and many others), Manchester came back and were switched on. PG apologised that they had only marked on CE offline and not all. QM and ? also came back over the weekend. If pilots were not working, the site was turned off in panda. RALPP disk server went down, and corrupted lots of files. Lots of DPD’s for HIGGS lost. CB to provide a list of lost files. It’s a hard problem to deal with. Glasgow have suffered similar problems before, need to decide whether it’s worth check summing the files or just decide if they are all lost. CMS Kernel problem took lots of sites out will take rest of week to get upgraded. Not a major problem for CMS. LHCb http://hepwww.rl.ac.uk/nraja/UKUploadProblems/index.html 1. Problem with running jobs at RAL - they have a high failure rate primarily because of throughput issues with disk servers at the SE. 2. UK Upload problems at Brunel. Stuart Purdie only seeing transient problems. LHCb merge jobs causes high demand on the disk servers at T1. Some issue with VOMS and t2k had some problems. GS a lot of sites are blacklisted, so check in panda. DR Atlas production that ran over the weekend, a lot of failures from transfers to RAL. GS: They did have 48hrs to transfer before timing out. Can you have a look to see if there are enough slots in the FTS channel. Experiment resource usage. Nothing for Atlas or CMS coming up. ROC Update: Kashif did a hand over report. Main issue was sites down for security, but a lot not down in GOCDB so causes alarms to be generated. CERN problems solved. ??? Kashif GOCDB entry, what to do if you only shutdown some queues. Panda will put the site offline if all ce’s are in downtime. RALT1 site has upgraded firmware on firewall and it was completed within the window of schedule downtime. The only problem faced while starting FTS so it took another hour to start FTS CMS disk server failure this morning. LHCb castor instance scheduled for 2.1.9 update starting next Monday for 3 days. Patching, waiting for the formal patch to come out, will update UI’s etc. Batch system will be done in a couple of parts. Site security have put a two week deadline to get it patched. Waiting for an Oracle update for the backend nodes, will require some downtime castor, FTS, 3D will be affected. Probably next week. EGI SA1 glite 3.2 etc and is there any 3.1 service we cannot upgrade. BDII problem on SL5 partially solved. VOMS and grid site have been patched. Top level BDII problem is fixed by the upgrade. Glasgow have not yet done theirs due to new kit arriving. Ticket status *************** https://gus.fzk.de/download/escalationreports/roc/html/20100920_EscalationR eport_ROCs.html?? ?56316 - NGS-RAL not in CERN BDII. on-hold. (With J Churchill) ?58733 - RAL-PP no space left. Biomed. on-hold. ?59397 - RAL. LHCb. Perceived some slowness in SARA-RAL replication We did not have time to go over the tickets in detail. Security Update: Latest update, RH have released the patch, SL are waiting for this patch. We hope the SL patch will be available very quickly (<24Hrs). CentOS have a test patch released, but official patch due soon. Once all vendors have released the patch it is very likely that they will enforce the 7 days upgrade policy. If a site is not updated they will be banned. From pakiti, can see some sites are still vulnerable, but pakiti was showing false positives. Brunel, bham, ecdf, Glasgow, cam ucl , oxford, ..... but pakiti cannot tell if you have patched your kernel. RAL T1 Cambridge is running CERN kernel, as are most sites. Bham not! Expt reps tell the users than there is a security incident and therefore only report exceptional errors, people are generally very understanding. JC: We did not put anything out on gridpp users, should we have done so? AF: we should use Face book, NGS do (http://www.facebook.com/group.php?gid=7070195774&ref=ts ) JC: How is NGS responding, So far Oxford have responded. They are running SL4, exploit does not work so have kept it up and running. Other sites did not respond. Quite difficult to get response from the other sites in general. The Affiliate sites are basically the gridpp sites so do not have to report twice. It is generally the same people behind both systems. Glasgow is different as they do have separate clusters, David Martin is the NGS contact. AOB Any Feedback from the EGI technical forum. http://www.egi.eu/EGITF2010 Did not answer JC’s questions about operations. It is struggling to make progress. Looking for volunteers to help on documentation needs to be updated from EGEE to EGI. Mark will help from Glasgow. Where to get packages from, EGI, CERN etc there are about 20 repositories all marked current. Stuart Purdie says they are not yet ready for production use, at some stage they would like everybody to use the EGI repository. Security training session well attended. ARC ce and UNICORE ce talks were good. Security open forum as well. EMI security group and EGI. Actions: T2Cs to review security doc. to be closed. Reviewers for web page and wiki resilience page needs contributions DN list ROC designation. Migration strategy for ROC, JC thought he had progress but NGS have been asked to delay till next year. Andy Richards is leaving the NGS. BNL VOMS certs accepted for atlas users. Close post links from MMs hepsysman talk on wiki from mingcaho CMS conferences page? Atlas have a useful conference talks web page http://atlas-speakerscommittee.web.cern.ch/atlas-speakers-committee/ConfTalks2010.html , which can be referred to, to guess when increase use may be required. Stuart Purdies script to scrap this page is: http://svr001.gla.scotgrid.ac.uk/cgi-bin/atlas.py CMS don’t have the same sort of page. No response from LHCb. NGDF fixed. Took 29 days to identify a faulty line card in CERN OPN router. Atlas have requested a post-mortem as it took far too long. WMS monitoring at Glasgow, RAL and IC.