UCL Computational Resource Allocation Group (CRAG) MEETING MINUTES 8th February 2013 In Attendance: 1. Prof Nik Kaltsoyannis (Chair) - Molecular Quantum Dynamics and Electronic Structure 2. Dr Nicholas Achilleos - Astrophysics and Remote Sensing 3. Dr Vincent Plagnol – Next Generation Sequencing 4. Prof Dario Alfe - Thomas Young Centre (Materials Science) 5. Dr Ben Waugh - High Energy Physics 6. Dr Bruno Silva - Research Computing Platforms Team Leader (Service Lead), ISD 7. Jo Lampard - Senior Research IT Services Facilitator, ISD 8. Thomas Jones - Research Platforms Team Leader (Infrastructure Lead), ISD Apologies: 1. Prof Andrew Philips – Epidemiology 2. Dr Andrew Martin - Bioinformatics and Computational Biology 3. Clare Gryce - Head of Research Computing and Facilitating Services, ISD Note: Minutes below provide a high level summary of decisions taken and actions assigned by the Group. 1. Approval of Minutes of last meeting on 11th January 2013 The Group approved the Minutes of the January 11th 2013 meeting. 2. Update on status of current Actions The list of current Actions (below) was updated, and new Actions arising were added. 3. Any requests for additional resources: (a) scratch quota requests, (b) any other request. No requests for additional resources had been received. 4. Review of any Centre for Innovation (CfI) access requests (Chair) (Doc: CfI_Access_Application_v0 6) The following applications were submitted and approved: Francesco Gervasio The following application was discussed: Dimitry Kuzmin DA pointed out that the information provided was insufficient to work out the resources the applicant was planning to use. It was agreed that BCS should seek clarification concerning the scientific benefit to his work and information about typical job sizes (see new Action 111). 5. Legion usage report for January. Report available http://feynman.ritsisd.ucl.ac.uk:8888 BCS presented the Legion usage statistics for January 2013. NK noted that the issue concerning the user uccaeak had already been dealt with. He also noted that the Maths Consortium had an extremely long wait time for their jobs but their CPU time was not visible on the graph, and wondered whether there was any further information available. BCS replied that the jobs concerned had definitely run, but that they had probably failed immediately so CPU time would not be visible as it would be very small. The jobs could have been submitted in error, for example with the wrong amount of resource. Jobs are not automatically rejected if they do not fit any of the available queues, and each queue will not report on why it rejects a particular job because this would massively increase the size of the logs. DA pointed out that HECToR does provide this type of information; BCS explained that each individual queue will look out for submitted jobs to see if it the job is acceptable to it or not; TJ said that although it would be possible to write a new rule for SGE, it would take some effort and as the decision to continue using it has not yet been taken, this would not be a good use of effort. DA asked whether there was any data on how many users actually submit jobs that never run. BCS replied that the last time he had checked, there were 5 jobs in an error state. He pointed out that if job submission parameters cannot be met by any of the queues, the job will be rejected but the user will not get any information as to why. NK had observed that the type Y node statistics had shown long wait times for the ENGFEAandCFD Consortium, and asked BCS if he could explain this. BCS replied that once again, it could be that the resources the jobs demanded took some time to obtain, then the jobs immediately failed with an error. DA pointed out that there was a very long CPU time shown for Astro on the type Z nodes; BCS explained that jobs requiring the full 48Gb of memory will take priority on those nodes (type Z have 12 cores per node with 4Gb memory per core), with everything else that runs there being opportunistic. There is no way of determining wait times as they are not specific to the job class, it will also depend on the other queues. NK asked BCS to investigate these anomalies, and also to produce a scrollable graph as some of the information was not visible (see new Action 112). 6. Role of Consortia leaders and Account Application Policies and Processes discussion (Chair) (Doc: RC_Consortia_and_account_application_process_v1.0.pdf) CG had circulated a draft document with the proposed changes to the CRAG Terms of Reference for Research Computing Consortia and Account Application Policies and Processes; this was discussed and further amendments were made, including BW’s suggestion of asking for the contact details of either the PI or supervisor (if a PhD student) on the application form. The PI/supervisor can then be copied in to the approval email when the account is set up (with an indication that it is for information only and does not require them to take any action. Regarding the decisions required and points for further discussion, NK asked whether the Research Computing team required any guidance on the design/implementation of the new online account application process; BCS replied that a project needed to be set up to identify the necessary steps and infrastructure, but that this would be internal work unless a constraint was found that needed to be put forward to the CRAG. JL to circulate the amended document as soon as possible to allow the Group to consider if any further changes are desired in advance of the next meeting. 7. AOB BCS informed the group that the consortium leader for NGS would be stepping down; VP had already been asked if he would take over, and had given his agreement. He had also taken over ownership of the mailing list from Francesco; TJ to ensure it is searchable if it is a mailman list, and put up a link on the web pages (see new Action 113). BCS mentioned that the new Legion scratch area is now about 20% full; he asked if the Group felt that this should be enforced, and suggested doubling the existing default allocation of 200Gb, with requests for > 1Tb still being referred to the CRAG. NK and the Group agreed that this was fine (see new Action 114). BCS said that the RC team had received a request from a Research Assistant for a Legion account, and wondered whether there should be any constraint on this since RA’s are not usually directly involved in this type of research. NK pointed out that in practice, they can be; this situation would be covered to the proposed amendments to the application process as their PI would be informed that they had applied for an account. BCS also raised the issue of a user who had requested a temporary increase to their job priority on Legion, in order that their research paper could be submitted in time for an upcoming (2 weeks) publication deadline. Since this was for only 4 small jobs, it had been granted on this occasion. NK replied that as a general rule, he could not see any reason for special concessions; publication deadlines are known well in advance, and it could lead to abuse of the system if such requests became commonplace. 8. Next Meeting Date and Agenda 15th March 2012 from 1pm – 3pm, South Wing G14 Committee Room, Main UCL Campus, Gower Street, London WC1E 6BT Agenda (Items) for the next meeting: Standing items: 1. Approval of Minutes of last meeting 2. Update on status of current Actions 3. Any requests for additional resources: (a) scratch quota requests, (b) any other request. 4. Review of any Centre for Innovation (CfI) access requests (Chair) 5. Reporting data/stats to be made available to users 6. Proposal for a policy allowing Teaching and Learning usage of Legion by UG students and taught PG students – discussion (Bruno Silva) LIST OF CURRENT ACTIONS Shaded (closed/completed) items will be deleted in the next version. 73 Actions Status Owner Devise proactive strategy to inform users of the availability of Condor. (25/11/11): Email consortium leaders and speak to members of the Genomics community (Jacky Pallas perhaps) regarding Condor to move this forward. BCS (27/01/2012): Pending completion of on-going works to improve usability of Condor service. (24/02/2012): IN PROGRESS. Usability works still in progress; OK to liaise with ISD Datacentre Services to create shared storage for users to install applications (e.g. R & Python) via Research Computing. (30/03/2012): IN PROGRESS. Technical issues regarding mounting of shared storage currently undergoing testing. (24/05/2012): IN PROGRESS. Technical issues regarding mounting of shared storage currently undergoing testing (26/06/2012): IN PROGRESS. Shared storage areas have been mounted on Condor machines, applications require placement. (25/07/12): Can now mount and start moving users. Going to make Condor useable from Legion itself, and should be available by next year. Will be a hybrid system. (10/09/2012): Modification to Condor successful. Users will be contacted to move their web jobs for Condor. (12/10/2012): We have found unexpected problems mounting the K drive – we are investigating what the problem is. Nevertheless, Condor is usable. (9/11/2012): Further testing of K drive to be undertaken and results to be reported at next meeting. (14/12/2012): After further testing, the problem appears to be with either DNS or firewall; some nodes in the Bartlett were removed as job submission was not possible. A temporary solution has been implemented (not using institutional filestore). (11/1/2013): Test launched using RC bespoke file store to remove bad connections across UCL and nodes removed from list. Resume communications with users. (8/2/2013): Issues with storage identified; some machines had to be added to server access list while those which would never work (around 200, leaving more than 1,000 available) were removed. The problem was not random so now considered resolved. Users still need to be able to mount the drive which allows them access to all tools; Owain Kenway is working on this for the first group of users and will contact them. There are plans to scrap Condor and have similar workloads on Legion; TJ’s team have created a virtual machine “Legion node” on the new desktop. Update to be given at next meeting. 76 Peter Harrison request for additional Unity resource (27/01/2012): To follow up with requestors regarding additional information about check-pointing and problem decomposition. BCS (24/02/2012): Update from DG, Peter Harrison has been working closely with RC team and adapting code where possible. CRAG agreed to extend 72hr wall-clock to 10 days on Unity. CLOSED to be REVISTED at a later date (24/05/2012): DG to contact the user regarding progress of his work. (26/06/2012): Re-assigned to BCS (25/07/2012): BCS to speak to contact Peter Harrison. (10/09/2012): PH requests continued use of Unity and will provide review for next CRAG meeting. (12/10/2012): CRAG approved Peter Harrison’s indefinite request with proviso that Legion is cited in research papers. BCS to provide update to CRAG in 3 months. (14/12/2012): ONGOING – update next month (11/1/2013): ONGOING – update next month (8/2/2013): Re-visit in 3 months. 91 Establish policy for requesting Priority CP hours (10/09/2012): CG to circulate draft policy paper to CRAG members and inform Serge Guillas that his request is under review. All CRAG members to report back on implementation of Priority Queue. (12/10/2012): TJ to investigate implementation of Priority Access using ‘Projects’ method as discussed and agreed by group. Gold Accounting Software – RC to investigate by further testing. NA to provide local Miracle users for testing. (9/11/2012): Still pending. Meanwhile, TJ to set up priority access for Miracle jobs as previously agreed using same set up on Miracle as for Harvest project (Serge Guillas). (14/12/2012): Done for Miracle jobs; TJ to present Gold accounting software information at the next CRAG. TJ (14/12/2012): Gold accounting software installed. Client and lustre upgrade still pending. TJ to report back at next CRAG. (8/2/2013): TJs team are testing, looking into SGE and thinking about how to implement it. TJ to write up and report at next CRAG. 96 Record of CfI applications (9/11/2012): It was agreed that a spreadsheet record of all CfI requests, including reasons for rejection where appropriate, should be maintained. CG 11/1/2013 – CG to maintain list of usage and report to CRAG every three months. (8/2/2013): ONGOING – update next month 98 Publicising use of Legion@UCL and new scheduler information graph (14/12/2012): CG to publicise the location on the website which advises how to acknowledge use of Legion (Legion@UCL) and the new graph of scheduler information, by emailing the research-computing list. BCS (11/1/2013): BCS to inform users of availability of Legion stats and web location advising how to acknowledge use of Legion (Legion@UCL) and the new graph of scheduler information. (8/2/2013): Graph will be published every month for a rolling 12 month period. CLOSED 101 CfI usage statistics (14/12/2012): CG to clarify whether STFC’s share of Emerald CG is included in the under-utilisation figures. (11/1/2013): ONGOING (8/2/2013): ONGOING 103 CfI IRIDIS job classes (14/12/2012): CG to obtain data on IRIDIS job class distribution. (11/1/2013): CG to present job clusters from CfI (8/2/2013): ONGOING 105 Legion Teaching and Learning use (14/12/2012): BCS to bring a proposal for a Legion Teaching &Learning policy for discussion at the next meeting. (11/1/2013): ONGOING (8/2/2013): BCS has put forward a bid for funding; if successful, this might mean purchasing new equipment or re-purposing some of the older racks from Wolfson House, the latter solution being preferred if the Legion upgrade goes ahead. The exact number of nodes is to be discussed, but would give a sizeable cluster. ONGOING CG BCS 106 Request for Additional Resources (11/1/2013): BCS to liaise with Isaac Sugden for further information as to why he requires use of Unity. BCS (8/2/2013): Isaac Sugden replied to the request for information a few days after the last meeting, clarifying that he needed the resource for a particular calculation which just required extra time. This was approved by NK. CLOSED 107 Legion usage report for December. (11/1/2013): BCS to further investigate long wait times for users: ucceak uccaalo BCS To add consortia to stats graph: Built Environment Medical Imaging (8/2/2013): The long wait times for uccaalo did not recur last month. uccaeak’s jobs require 10Gb per process, requiring a particular cluster of nodes hence incurring longer wait times; BCS explained that if the full memory on a node is required, the entire CU needs to be cleared of other jobs first. Working across nodes is not possible if it is an MPI job (Infiniband has a higher contention CU to CU). This is assuming that he is not using the superqueue. Now that the 7 day queue has been removed, the maximum wait time will be 2 days. BCS expects to see this happening every month – the user is not complaining, but he will continue to pro-actively monitor. CLOSED. (Action reduced to adding Consortia to stats graph). BCS to amend stats graph so that zero-utilising Consortia are still shown. ONGOING 108 Consolidated metrics and information for reporting and users (11/1/2013): BCS to add to monthly report, the users listed per consortium. (8/2/2013): DONE. However, although all of the active users should be shown, the graph is sorted by run time and it is truncated due to the size. Since it might be useful to see all of the information (in the case of a job which waited for a long time then finished running after just a second or so) BCS will make the graph scrollable (see new Action 112). CLOSED BCS 109 Role of Consortia leaders. (11/1/2013): CG to amalgamate CRAG Terms of Reference for Research Computing Consortia with the Legion ReApplication Document and present draft at next meeting. (8/2/2013): On the agenda for this meeting and was discussed. Amendments to be circulated as soon as possible and taken to the next meeting. ONGOING 110 Consortia mailing list (11/1/2013): BCS to advertise Consortia mailing lists on the Research Computing webpage with an archive and search facility for consortium members. CG JL TJ and BCS (8/2/2013): The archives are not searchable at the moment; TJ to request that the lists are transferred to Mailman which will allow this. It may not be possible to have a single cross-searchable archive for all the lists. TJ will find out if non-members can have read-only access. BCS will add links for each archive to the web pages if necessary. ONGOING 111 CfI EMERALD access request by Dimitry Kuzmin 112 Consolidated metrics and information for reporting and users 113 Change of NGS Consortium Leader 114 Legion scratch quotas (8/2/2013): BCS to seek clarification concerning the scientific benefit to his work and typical job sizes; response to be circulated to the group for approval so that access can be granted before the next meeting. BCS, Group (8/2/2013): BCS to investigate anomalies in wait/CPU time, and also to produce a scrollable graph as some of the information is not visible. BCS (8/2/2013): TJ to look at the NGS mailman list and queues for VP (if necessary). TJ (8/2/2013): BCS to implement increased Legion scratch quotas of double the existing default allocation (200Gb). BCS