UCL Computational Resource Allocation Group (CRAG) MEETING MINUTES 8 February 2013

advertisement
UCL Computational Resource Allocation Group (CRAG)
MEETING MINUTES
8th February 2013
In Attendance:
1. Prof Nik Kaltsoyannis (Chair) - Molecular Quantum Dynamics and Electronic
Structure
2. Dr Nicholas Achilleos - Astrophysics and Remote Sensing
3. Dr Vincent Plagnol – Next Generation Sequencing
4. Prof Dario Alfe - Thomas Young Centre (Materials Science)
5. Dr Ben Waugh - High Energy Physics
6. Dr Bruno Silva - Research Computing Platforms Team Leader (Service Lead), ISD
7. Jo Lampard - Senior Research IT Services Facilitator, ISD
8. Thomas Jones - Research Platforms Team Leader (Infrastructure Lead), ISD
Apologies:
1. Prof Andrew Philips – Epidemiology
2. Dr Andrew Martin - Bioinformatics and Computational Biology
3. Clare Gryce - Head of Research Computing and Facilitating Services, ISD
Note: Minutes below provide a high level summary of decisions taken and actions
assigned by the Group.
1. Approval of Minutes of last meeting on 11th January 2013
The Group approved the Minutes of the January 11th 2013 meeting.
2. Update on status of current Actions
The list of current Actions (below) was updated, and new Actions arising were
added.
3. Any requests for additional resources: (a) scratch quota requests, (b) any
other request.
No requests for additional resources had been received.
4. Review of any Centre for Innovation (CfI) access requests (Chair)
(Doc: CfI_Access_Application_v0 6)
The following applications were submitted and approved:
 Francesco Gervasio
The following application was discussed:
 Dimitry Kuzmin
DA pointed out that the information provided was insufficient to work out the
resources the applicant was planning to use. It was agreed that BCS should seek
clarification concerning the scientific benefit to his work and information about typical
job sizes (see new Action 111).
5. Legion usage report for January. Report available http://feynman.ritsisd.ucl.ac.uk:8888
BCS presented the Legion usage statistics for January 2013. NK noted that the
issue concerning the user uccaeak had already been dealt with. He also noted that
the Maths Consortium had an extremely long wait time for their jobs but their CPU
time was not visible on the graph, and wondered whether there was any further
information available. BCS replied that the jobs concerned had definitely run, but
that they had probably failed immediately so CPU time would not be visible as it
would be very small. The jobs could have been submitted in error, for example with
the wrong amount of resource. Jobs are not automatically rejected if they do not fit
any of the available queues, and each queue will not report on why it rejects a
particular job because this would massively increase the size of the logs. DA pointed
out that HECToR does provide this type of information; BCS explained that each
individual queue will look out for submitted jobs to see if it the job is acceptable to it
or not; TJ said that although it would be possible to write a new rule for SGE, it
would take some effort and as the decision to continue using it has not yet been
taken, this would not be a good use of effort. DA asked whether there was any data
on how many users actually submit jobs that never run. BCS replied that the last
time he had checked, there were 5 jobs in an error state. He pointed out that if job
submission parameters cannot be met by any of the queues, the job will be rejected
but the user will not get any information as to why. NK had observed that the type Y
node statistics had shown long wait times for the ENGFEAandCFD Consortium, and
asked BCS if he could explain this. BCS replied that once again, it could be that the
resources the jobs demanded took some time to obtain, then the jobs immediately
failed with an error. DA pointed out that there was a very long CPU time shown for
Astro on the type Z nodes; BCS explained that jobs requiring the full 48Gb of
memory will take priority on those nodes (type Z have 12 cores per node with 4Gb
memory per core), with everything else that runs there being opportunistic. There is
no way of determining wait times as they are not specific to the job class, it will also
depend on the other queues.
NK asked BCS to investigate these anomalies, and also to produce a scrollable
graph as some of the information was not visible (see new Action 112).
6. Role of Consortia leaders and Account Application Policies and Processes discussion (Chair)
(Doc: RC_Consortia_and_account_application_process_v1.0.pdf)
CG had circulated a draft document with the proposed changes to the CRAG Terms
of Reference for Research Computing Consortia and Account Application Policies
and Processes; this was discussed and further amendments were made, including
BW’s suggestion of asking for the contact details of either the PI or supervisor (if a
PhD student) on the application form. The PI/supervisor can then be copied in to the
approval email when the account is set up (with an indication that it is for information
only and does not require them to take any action. Regarding the decisions required
and points for further discussion, NK asked whether the Research Computing team
required any guidance on the design/implementation of the new online account
application process; BCS replied that a project needed to be set up to identify the
necessary steps and infrastructure, but that this would be internal work unless a
constraint was found that needed to be put forward to the CRAG. JL to circulate the
amended document as soon as possible to allow the Group to consider if any further
changes are desired in advance of the next meeting.
7. AOB
BCS informed the group that the consortium leader for NGS would be stepping
down; VP had already been asked if he would take over, and had given his
agreement. He had also taken over ownership of the mailing list from Francesco; TJ
to ensure it is searchable if it is a mailman list, and put up a link on the web pages
(see new Action 113).
BCS mentioned that the new Legion scratch area is now about 20% full; he asked if
the Group felt that this should be enforced, and suggested doubling the existing
default allocation of 200Gb, with requests for > 1Tb still being referred to the CRAG.
NK and the Group agreed that this was fine (see new Action 114).
BCS said that the RC team had received a request from a Research Assistant for a
Legion account, and wondered whether there should be any constraint on this since
RA’s are not usually directly involved in this type of research. NK pointed out that in
practice, they can be; this situation would be covered to the proposed amendments
to the application process as their PI would be informed that they had applied for an
account.
BCS also raised the issue of a user who had requested a temporary increase to their
job priority on Legion, in order that their research paper could be submitted in time
for an upcoming (2 weeks) publication deadline. Since this was for only 4 small jobs,
it had been granted on this occasion. NK replied that as a general rule, he could not
see any reason for special concessions; publication deadlines are known well in
advance, and it could lead to abuse of the system if such requests became
commonplace.
8. Next Meeting Date and Agenda

15th March 2012 from 1pm – 3pm, South Wing G14 Committee Room, Main UCL
Campus, Gower Street, London WC1E 6BT
Agenda (Items) for the next meeting:
Standing items:
1. Approval of Minutes of last meeting
2. Update on status of current Actions
3. Any requests for additional resources: (a) scratch quota requests, (b) any other
request.
4. Review of any Centre for Innovation (CfI) access requests (Chair)
5. Reporting data/stats to be made available to users
6. Proposal for a policy allowing Teaching and Learning usage of Legion by UG
students and taught PG students – discussion (Bruno Silva)
LIST OF CURRENT ACTIONS
Shaded (closed/completed) items will be deleted in the next version.
73
Actions
Status
Owner
Devise proactive
strategy to inform
users of the
availability of Condor.
(25/11/11): Email consortium leaders and speak to members
of the Genomics community (Jacky Pallas perhaps)
regarding Condor to move this forward.
BCS
(27/01/2012): Pending completion of on-going works to
improve usability of Condor service.
(24/02/2012): IN PROGRESS. Usability works still in
progress; OK to liaise with ISD Datacentre Services to create
shared storage for users to install applications (e.g. R &
Python) via Research Computing.
(30/03/2012): IN PROGRESS. Technical issues regarding
mounting of shared storage currently undergoing testing.
(24/05/2012): IN PROGRESS. Technical issues regarding
mounting of shared storage currently undergoing testing
(26/06/2012): IN PROGRESS. Shared storage areas have
been mounted on Condor machines, applications require
placement.
(25/07/12): Can now mount and start moving users. Going
to make Condor useable from Legion itself, and should be
available by next year. Will be a hybrid system.
(10/09/2012): Modification to Condor successful. Users will
be contacted to move their web jobs for Condor.
(12/10/2012): We have found unexpected problems
mounting the K drive – we are investigating what the problem
is. Nevertheless, Condor is usable.
(9/11/2012): Further testing of K drive to be undertaken and
results to be reported at next meeting.
(14/12/2012): After further testing, the problem appears to be
with either DNS or firewall; some nodes in the Bartlett were
removed as job submission was not possible. A temporary
solution has been implemented (not using institutional
filestore).
(11/1/2013): Test launched using RC bespoke file store to
remove bad connections across UCL and nodes removed
from list. Resume communications with users.
(8/2/2013): Issues with storage identified; some
machines had to be added to server access list while
those which would never work (around 200, leaving
more than 1,000 available) were removed. The problem
was not random so now considered resolved. Users still
need to be able to mount the drive which allows them
access to all tools; Owain Kenway is working on this for
the first group of users and will contact them. There are
plans to scrap Condor and have similar workloads on
Legion; TJ’s team have created a virtual machine
“Legion node” on the new desktop. Update to be given
at next meeting.
76
Peter Harrison
request for additional
Unity resource
(27/01/2012): To follow up with requestors regarding
additional information about check-pointing and problem
decomposition.
BCS
(24/02/2012): Update from DG, Peter Harrison has been
working closely with RC team and adapting code where
possible. CRAG agreed to extend 72hr wall-clock to 10 days
on Unity. CLOSED to be REVISTED at a later date
(24/05/2012): DG to contact the user regarding progress of
his work.
(26/06/2012): Re-assigned to BCS
(25/07/2012): BCS to speak to contact Peter Harrison.
(10/09/2012): PH requests continued use of Unity and will
provide review for next CRAG meeting.
(12/10/2012): CRAG approved Peter Harrison’s indefinite
request with proviso that Legion is cited in research papers.
BCS to provide update to CRAG in 3 months.
(14/12/2012): ONGOING – update next month
(11/1/2013): ONGOING – update next month
(8/2/2013): Re-visit in 3 months.
91
Establish policy for
requesting Priority
CP hours
(10/09/2012): CG to circulate draft policy paper to CRAG
members and inform Serge Guillas that his request is under
review. All CRAG members to report back on
implementation of Priority Queue.
(12/10/2012): TJ to investigate implementation of Priority
Access using ‘Projects’ method as discussed and agreed by
group. Gold Accounting Software – RC to investigate by
further testing. NA to provide local Miracle users for testing.
(9/11/2012): Still pending. Meanwhile, TJ to set up priority
access for Miracle jobs as previously agreed using same set
up on Miracle as for Harvest project (Serge Guillas).
(14/12/2012): Done for Miracle jobs; TJ to present Gold
accounting software information at the next CRAG.
TJ
(14/12/2012): Gold accounting software installed. Client and
lustre upgrade still pending. TJ to report back at next CRAG.
(8/2/2013): TJs team are testing, looking into SGE and
thinking about how to implement it. TJ to write up and
report at next CRAG.
96
Record of CfI
applications
(9/11/2012): It was agreed that a spreadsheet record of all
CfI requests, including reasons for rejection where
appropriate, should be maintained.
CG
11/1/2013 – CG to maintain list of usage and report to CRAG
every three months.
(8/2/2013): ONGOING – update next month
98
Publicising use of
Legion@UCL and
new scheduler
information graph
(14/12/2012): CG to publicise the location on the website
which advises how to acknowledge use of Legion
(Legion@UCL) and the new graph of scheduler information,
by emailing the research-computing list.
BCS
(11/1/2013): BCS to inform users of availability of Legion
stats and web location advising how to acknowledge use of
Legion (Legion@UCL) and the new graph of scheduler
information.
(8/2/2013): Graph will be published every month for a
rolling 12 month period. CLOSED
101
CfI usage statistics
(14/12/2012): CG to clarify whether STFC’s share of Emerald CG
is included in the under-utilisation figures.
(11/1/2013): ONGOING
(8/2/2013): ONGOING
103
CfI IRIDIS job
classes
(14/12/2012): CG to obtain data on IRIDIS job class
distribution.
(11/1/2013): CG to present job clusters from CfI
(8/2/2013): ONGOING
105
Legion Teaching
and Learning use
(14/12/2012): BCS to bring a proposal for a Legion Teaching
&Learning policy for discussion at the next meeting.
(11/1/2013): ONGOING
(8/2/2013): BCS has put forward a bid for funding; if
successful, this might mean purchasing new equipment
or re-purposing some of the older racks from Wolfson
House, the latter solution being preferred if the Legion
upgrade goes ahead. The exact number of nodes is to be
discussed, but would give a sizeable cluster. ONGOING
CG
BCS
106
Request for
Additional Resources
(11/1/2013): BCS to liaise with Isaac Sugden for further
information as to why he requires use of Unity.
BCS
(8/2/2013): Isaac Sugden replied to the request for
information a few days after the last meeting, clarifying
that he needed the resource for a particular calculation
which just required extra time. This was approved by
NK. CLOSED
107
Legion usage report
for December.
(11/1/2013): BCS to further investigate long wait times for
users:

ucceak

uccaalo
BCS
To add consortia to stats graph:

Built Environment

Medical Imaging
(8/2/2013): The long wait times for uccaalo did not recur
last month. uccaeak’s jobs require 10Gb per process,
requiring a particular cluster of nodes hence incurring
longer wait times; BCS explained that if the full memory
on a node is required, the entire CU needs to be cleared
of other jobs first. Working across nodes is not possible
if it is an MPI job (Infiniband has a higher contention CU
to CU). This is assuming that he is not using the
superqueue. Now that the 7 day queue has been
removed, the maximum wait time will be 2 days. BCS
expects to see this happening every month – the user is
not complaining, but he will continue to pro-actively
monitor. CLOSED. (Action reduced to adding Consortia
to stats graph).
BCS to amend stats graph so that zero-utilising
Consortia are still shown. ONGOING
108
Consolidated metrics
and information for
reporting and users
(11/1/2013): BCS to add to monthly report, the users listed
per consortium.
(8/2/2013): DONE. However, although all of the active
users should be shown, the graph is sorted by run time
and it is truncated due to the size. Since it might be
useful to see all of the information (in the case of a job
which waited for a long time then finished running after
just a second or so) BCS will make the graph scrollable
(see new Action 112). CLOSED
BCS
109
Role of Consortia
leaders.
(11/1/2013): CG to amalgamate CRAG Terms of Reference
for Research Computing Consortia with the Legion ReApplication Document and present draft at next meeting.
(8/2/2013): On the agenda for this meeting and was
discussed. Amendments to be circulated as soon as
possible and taken to the next meeting. ONGOING
110
Consortia mailing list
(11/1/2013): BCS to advertise Consortia mailing lists on the
Research Computing webpage with an archive and search
facility for consortium members.
CG
JL
TJ and
BCS
(8/2/2013): The archives are not searchable at the
moment; TJ to request that the lists are transferred to
Mailman which will allow this. It may not be possible to
have a single cross-searchable archive for all the lists.
TJ will find out if non-members can have read-only
access. BCS will add links for each archive to the web
pages if necessary. ONGOING
111
CfI EMERALD
access request by
Dimitry Kuzmin
112
Consolidated metrics
and information for
reporting and users
113
Change of NGS
Consortium Leader
114
Legion scratch
quotas
(8/2/2013): BCS to seek clarification concerning the
scientific benefit to his work and typical job sizes;
response to be circulated to the group for approval so
that access can be granted before the next meeting.
BCS,
Group
(8/2/2013): BCS to investigate anomalies in wait/CPU
time, and also to produce a scrollable graph as some of
the information is not visible.
BCS
(8/2/2013): TJ to look at the NGS mailman list and
queues for VP (if necessary).
TJ
(8/2/2013): BCS to implement increased Legion scratch
quotas of double the existing default allocation (200Gb).
BCS
Download