Perf WS day 1 summary

advertisement
Perf WS day 1 summary
Today’s Session

What do applications want?

Current tools: How to tell if the Grid is up?

Current tools: How to tell if my job/file
transfer has failed? What kind of control is
needed?
Grid status and failure detection

What does “up” mean for you?

Today? Tomorrow?

How often does this need to be checked?

What are the important issues here?
What does “up” mean for you?
Today? Tomorrow?

Load data is one thing – what about
queues?


Q – how portable is this?


GridIce can do queues, Inca on NGS can do
this (pbs), (MDS gathers this)
Ok
What if this data is old?

How old is too old today/tomorrow?
What about the issue of
monitoring real jobs not test jobs

Will people have to instrument their code?

Paradyn as an option?

What about putting code in appl?



Birger says that’s the job of the queueing
system, not the application
Might be done in condor – but that limits
you to the condor standard universe ONLY
This is also application specific

At some point – too slow == failure


How can this be known? Appl specific?
Lack of resource discovery

RG has limited resources that are checked
manually

FG always has same services in same places

Not a short term need
Is grid up

RG – short term need

RG - Transfer data, run job, get data back



FG – is service up, myproxy running, auth mgr
running, (not in next 6 months)
LCG – already has this
NGS uses gits tests (globus job submission, small
file transfers)


Run every 4 hours – new site must have all green for
7 days to be admitted
Now running as part of NGS Inca deployment
To test that things are up…

Could a scaled down version of the appl.
be created?




Maybe – but how do you test for the stupid
electrician problem?
What about something small that would
touch all the bits of the normal appl?
Test ap might could be used for failure
detection
Test ap could be used for training!
Test Suite



Set of these smaller tests to see how far
you can get
What about WebMD for job failures

Identify common problems

Ask question, run a reporter
What about individual node failures?

Scalability issues
How do I stop
having to ask if the Grid is up?

Users want jobs to just run

Is this really an admin problem?

What about false positives/negatives?

Do we need ebay to rate sites?

Can we make the administrator job easier?
Day 2
Questions to discuss
1.
Do the apps people think they can use some of the
tools, if so how?

2.
What about what do tool people think the apps should
use? Is there a tool they’ve missed out on?

3.
4.
5.
What tools do the apps people think they'll use? (look
into)
From what the tools people have seen, are there any "low
hanging fruits" for new tools?
How do we bridge the gap between the requirements of
the apps people and what the tools people are
delivering (how can we generalize from this meeting
How can we get app folks and tool developers to
collaborate closer?
Where are the 'new' areas in tool development that we
might want to support

What short term and long term tooling (proposals) can we
propose from this meeting?
6. What do people want to see delivered
from the meeting

a. the report

b. specific funding for this subject area

c. collaborative projects etc.
7. What do we do about next years meeting
Download