Perf WS day 1 summary Today’s Session What do applications want? Current tools: How to tell if the Grid is up? Current tools: How to tell if my job/file transfer has failed? What kind of control is needed? Grid status and failure detection What does “up” mean for you? Today? Tomorrow? How often does this need to be checked? What are the important issues here? What does “up” mean for you? Today? Tomorrow? Load data is one thing – what about queues? Q – how portable is this? GridIce can do queues, Inca on NGS can do this (pbs), (MDS gathers this) Ok What if this data is old? How old is too old today/tomorrow? What about the issue of monitoring real jobs not test jobs Will people have to instrument their code? Paradyn as an option? What about putting code in appl? Birger says that’s the job of the queueing system, not the application Might be done in condor – but that limits you to the condor standard universe ONLY This is also application specific At some point – too slow == failure How can this be known? Appl specific? Lack of resource discovery RG has limited resources that are checked manually FG always has same services in same places Not a short term need Is grid up RG – short term need RG - Transfer data, run job, get data back FG – is service up, myproxy running, auth mgr running, (not in next 6 months) LCG – already has this NGS uses gits tests (globus job submission, small file transfers) Run every 4 hours – new site must have all green for 7 days to be admitted Now running as part of NGS Inca deployment To test that things are up… Could a scaled down version of the appl. be created? Maybe – but how do you test for the stupid electrician problem? What about something small that would touch all the bits of the normal appl? Test ap might could be used for failure detection Test ap could be used for training! Test Suite Set of these smaller tests to see how far you can get What about WebMD for job failures Identify common problems Ask question, run a reporter What about individual node failures? Scalability issues How do I stop having to ask if the Grid is up? Users want jobs to just run Is this really an admin problem? What about false positives/negatives? Do we need ebay to rate sites? Can we make the administrator job easier? Day 2 Questions to discuss 1. Do the apps people think they can use some of the tools, if so how? 2. What about what do tool people think the apps should use? Is there a tool they’ve missed out on? 3. 4. 5. What tools do the apps people think they'll use? (look into) From what the tools people have seen, are there any "low hanging fruits" for new tools? How do we bridge the gap between the requirements of the apps people and what the tools people are delivering (how can we generalize from this meeting How can we get app folks and tool developers to collaborate closer? Where are the 'new' areas in tool development that we might want to support What short term and long term tooling (proposals) can we propose from this meeting? 6. What do people want to see delivered from the meeting a. the report b. specific funding for this subject area c. collaborative projects etc. 7. What do we do about next years meeting