ATLAS Databeses

advertisement
Conditions data handling in FDR2c
 Tag hierarchies set up (largely by Paul) and communicated in advance
 No real problems uploading data to the correct tag
 Calibration experts starting to deal with ‘real’ IOVs (data valid for calibn period)
 New POOL file registration scripts worked fine
 Calibration users need to be in AFS group atlcond:poolcond
 Consider doing calibration uploads from a ‘calibration’ account, not personal ones?
 No instances of data in COOL without corresponding (or wrong) POOL file upload
 No use of run-signoff database pages yet
 System was not ready and integrated yet (holidays; too busy with other things)
 But only one set of runs, and all calibrations were ‘accepted’ - no real test
 Handling of detector status information works technically
 Merging and transfer to LBSUMM folder (for ESD/AOD) still done by hand
 Limited mapping of DQ histograms to status flags restricts usefulness
 Need to make sure this improves for real data
 Need to clarify how detector status flags are dealt with in ES1, ES2 processing
2nd September 2008
Richard Hawkings / Paul Laycock
1
Conditions DB access problems
 Big problems in Tier-0 conditions DB access Thursday night/ Friday morning
 Combination of several factors
 2/4 of Oracle server nodes got into trouble and restarted
 Kernel patch being applied this week, some interdependencies not fully understood yet
 Server full of ‘stuck’ connections which were never released or cleaned up - deadlock
 Very high load due to FDR2 bulk reprocessing and cosmics reprocessing going on
in parallel, plus FCT, ATN, RTT, TCT tests, plus user jobs
 All jobs accessing Oracle directly, no use of SQLite replicas at present
 Replica only useful once the run is ended online - applicable to ES2, bulk reco only
 Vulnerability in that ALL Athena jobs accessing Oracle use same reader account
 Limit of 800 concurrent sessions, now changed to 4 x 800
 Each Athena job holds O(10) connections in parallel until end of first event (one per
subdetector schema) - typically for 5 minutes or so. Vulerable to ‘deadlock’
 Further actions being pursued
 Deploy SQLite replica for bulk processing (but not for cosmics / express stream)
 Use a dedicated COOL reader account for Tier-0 jobs - guarantee # connections
 Reduce connection load from Athena jobs (short/long term actions)
2nd September 2008
Richard Hawkings / Paul Laycock
2
Next steps - discussion needed
 Work on conditions DB access problems
 Deployment of SQLite replicas to be used where possible
 Start to setup tag hierarchies for first data
 Separate top-level tags to be used by HLT, monitoring, Tier-0, reprocessing
 Define calibration loop model for first data
 Cosmics processing has no calibration loop, and several ‘express’ streams
 Same plan for single beam running, or move to ‘calibration loop’
 Calibration 24hrs might be needed for code fixes even if no prompt calibration can be
done yet, might have multiple processings at Tier-0
 What to do for first collisions
 Sign-off tool and Tier-0/conditions integration to support all this ..?
2nd September 2008
Richard Hawkings / Paul Laycock
3
Download