DSS Data & Storage Services New tape server software Status and plans CASTOR face-to-face workshop 22-23 September 2014 Eric Cano on behalf of CERN IT-DSS group CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it DSS Overview • Features for first release • New tape server architecture – – – – – Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it • • • • • Control and reporting flows Memory management and data flow Error handling Main process and sessions Stuck session and recovery Development methodologies and QA What changes in practice? What is still missing? Logical Block Protection investigation Release plans and potential new features New tape server software Castor Workshop Sep 2014 2 DSS Features for first release • Continuation of the push to replace legacy tape software – Started with creation of tape gateway and bridge – VMGR+VDQM will be next • Drop-in replacement – Tapeserverd consolidated in a single daemon – Replaces the previous stack: • • Identical outside protocols (almost) – – – – • taped & satellites + rtcpd + tapebridged Stager / Cli client (readtp in unchanged) VMGR/VDQM tpstat/tpconfig New labelling command (castor-tape-label) Keep what works: – One process per session (pid listed in tpstat, as before) Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it • • • • Better logs Latency shadowing (no impact of slow DB) Empty mount protection Result from big teamwork since last meeting: – E.Cano, S. Murray, V. Kotlyar, D. Kruse, D. Come New tape server software Castor Workshop Sep 2014 3 DSS New tape server architecture • Pipelined: based on FIFOs and threads/thread pools – Always fast to post to FIFO • Push data blocks, reports, requests for more work – Each FIFO output is served by one thread(pool) • Simple loop: pop, use/serve the data/request, repeat – All latencies are shadowed in the various threads – Keep the instruction pipeline non-empty with task prefetch – N-way parallel disk access (as before) – All reporting is asynchronous Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it • Tape thread is the central element that we want to keep busy at full speed New tape server software Castor Workshop Sep 2014 4 DSS Migration session overview Migration Mount Manager (main thread)* Task Injector Request for more Disk Read Task Tape Write Task Read data from disk Pop, execute, delete Free blocks Get free blocks Pop, execute, delete Push full data block n threads Data blocks www.cern.ch/it 1 thread New tape server software Pack information For tapeserverd and CERN IT Department CH-1211 Genève 23 Switzerland *(main thread) Pop block, write to tape, (flush,) report result 1 thread Report Packer 1 thread Pack information and send bulk report on flush/end session Global Status Reporter Internet Services Request more on threshold Data FIFO Tape Write Single Thread Task queue Task queue Disk Read Thread Pool 1 thread Return free block 1 thread Get more work from tape gateway, create and push tasks Client queue Memory manager Free blocks Instantiate memory manager, injector, packer, disk and tape thread Give initial kick to task injector Wait for completion 1 thread Castor Workshop Sep 2014 5 DSS Recall session overview Task Injector Memory manager 1 thread Disk Write Task Pull free blocks Pop, execute, delete Push full data block 1 thread Global Status Reporter 1 thread CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it New tape server software n threads Report Packer 1 thread Pack information and send bulk report threshold/end session *(main thread) Pack information For tapeserverd and Internet Services Pop block, write to disk, report result Data blocks Individual file reports, flush reports, end of session report Read data from tape Data FIFO Disk Write Thread Pool Request more on threshold Pop, execute, delete Tape Read Task Task queue Task queue Tape Read Single Thread Return free block (no thread) Get more work from tape gateway, create and push tasks Free blocks Instantiate memory manager, injector, packer, disk and tape thread Give initial kick to task injector Wait for completion 1 thread Request for more Recall Mount Manager (main thread)* Castor Workshop Sep 2014 6 DSS Control flow • Task injector – – – – Initially called synchronously (empty mount detection) Triggered by requests for more work (stored in a FIFO) Gets more work from client Creates and injects tasks • • Tasks created, linked to each other (reader/writer couple) and injected to the tape and disk thread FIFOs Disk thread pool – Pops disk tasks, executes them, deletes them and moves to the next • Tape thread – Same as disk after initializing the session • • • • • Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Mounting Tape identification Positioning for writing … and unmounting in the end The reader thread(pool) requests for more work – – – – Based on task FIFO content thresholds Always ask for n files or m bytes (whichever comes first, configurable) Asks again when half of that is still available in the task FIFO Asks again one last time when the task FIFO becomes empty (last call) New tape server software Castor Workshop Sep 2014 7 DSS Reporting flow • Reports to client (file related) – Posted to a FIFO – Packed and transmitted in a separate thread • Send on flush in migrations • Send on thresholds in recalls – End of session also follows this path • Reports to parent process (tape/drive related) Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it – Posted to a FIFO – Transmitted asynchronously by a separate thread – Parent process keeps track of the session’s status and informs the VDQM and VMGR New tape server software Castor Workshop Sep 2014 8 DSS Memory management and data flow • Same as before: circulate a fixed number of memory blocks (size and count configurable) • Errors can be piggy backed on data blocks – Writer side always does the reporting, even for read errors • Central memory manager – Migration: actively pushes blocks for each tape write task • Disk read tasks pulls block from there • Returns the block with data in a second FIFO • Data gets written to tape by the tape write task – Recalls: passive container • Tape read task pulls memory blocks as needed • Pushes them to the disk write tasks (in FIFOs) • Disk write tasks pushes the data to the disk server Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it – Memory blocks get recycled to the memory manager after writing to disk or tape New tape server software Castor Workshop Sep 2014 9 DSS Error handling • Reporting – Errors get logged when they happen – If error happens in the reader, it gets propagated to the writer through the data path – The writer propagates the error to the client • Session behaviour on error – Recalls: carry on for stager, halt on error for readtp • absolute positioning by blockId (stager) • relative positioning by fSeq (readtp) Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it – Migrations: any error ends the session New tape server software Castor Workshop Sep 2014 10 DSS Main process and sessions • The session is forked by the parent process – Parent process keeps track of sessions and drive statuses in a drive catalogue – Answers VDQM requests – Filters input requests based on drive state – Manages the configuration files • The child session reports tape related status to the parent process – mount, unmounts – amount of data transferred for the watchdog • The parent process informs the VMGR and VDQM on behalf of the child session – Client library completely rewritten • Forking is actually done a utility sub-process (forker) – No actual forking from the multithreaded parent process • Process inventory: Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it – 1 parent process + 1 fork helper process – N session processes (at most 1 per drive) New tape server software Castor Workshop Sep 2014 11 DSS ZeroMQ+Protocol buffers • The parent/session processes communication is a no-risk protocol – Both ends get release/deployed together – Can be changed at any time • Opportunity to experiment new serialization methodologies – Need to replace umbrello • This gave good results – Protocol buffers provide robust serialization with little development effort – ZMQ handles many communication scenarios Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it • Still in finalization (issues in the watchdog communication) New tape server software Castor Workshop Sep 2014 12 DSS Stuck sessions and recovery • Stuck sessions do happen – RFIO problems suspected • Currently handled by a script – Log file based. No move for set time => kill – Problematic with unusually big files • Watchdog will get more internal data – Too much to be logged – If data stops flowing for a given time => kill • Clean-up process launched automatically when session killed • No clean-up after session failure Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it – a non-stuck session failed to do its own clean-up – => drive down New tape server software Castor Workshop Sep 2014 13 DSS Development methodologies and QA • Full C++, maintainable software – Object encapsulation for separately manageable units • Easy unit testing – Exception handling simplifies error reporting a lot – RAII (destructors) simplifies resource management – Cleaner drive specifics implementation through inheritance • • Easy to add new models Hardcoding-free SCSI and tape format layers – Naming conventions matching the SCSI documentations – String error reporting for all SCSI error – Very similar approach with the AUL tape format • Unit testing – Allows running various scenarios systematically • • • • On RPM build Migrations, recalls, good day, bad day, full tape Using fake objects for drive, client interface Easier debugging when problems can be reproduced in unit test context – Run test standalone + through valgrind and helgrind • Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it • • Automatic detection of memory leaks and race conditions Completely brought to the CASTOR tree Automated system testing would be a nice addition to this setup New tape server software Castor Workshop Sep 2014 14 DSS What changes in practice? • The new logs – Convergence with the rest of CASTOR logs – Single line at completion of tape thread • Summarises the session for tape log – More detailed timings • Will make it easier to pinpoint performance bottlenecks – New log parsing required • Should be greatly simplified as all relevant information is on a single line Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it • A single daemon • Configuration not radically changed New tape server software Castor Workshop Sep 2014 15 DSS What is still missing? • Support for Oracle libraries • The parent process’s watchdog for transfer sessions – Will move stuck transfers detection from operators scripts to internal (with better precision) • File transfer protocol switching – Add local file support • reliance on rfio removed – Add Xroot support • switched on by configuration • instead of RFIO • Diskserver >= 2.1.14-15 required (for stat call) – Add Ceph support • Disk path based switch, automatic Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it • Fine tuning of logs for operations • Document the latest developments New tape server software Castor Workshop Sep 2014 16 DSS Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Release and deployment • Data transfers are being validated now on IBM drives • Oracle drives will follow with mount suport • Some previously mentioned features missing • Target date for a tapeserverd-only 2.1.15 CASTOR release: end of November • Production deployment ~January • Compatible with current 2.1.14 stagers • 2.1.14-15 on disk server will be needed for using Xroot • 2.1.14 is the end of road for rtcpd/taped New tape server software Castor Workshop Sep 2014 17 DSS Logical block protection • Tests of the tape drive feature have been done by F. Nikolaidis, J. Leduc and K. Ha • Adds a 4 byte checksum to tape blocks • Protects the data block during the transfer from computer memory to tape drive • 2 checksum algorithm in use today: – Reed-Solomon – CRC32-C Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it • Reed-Solomon requires 2 threads to match drive throughput • CRC32-C can fit in a single thread – CRC32-C is available on most recent drives New tape server software Castor Workshop Sep 2014 18 DSS Next tape developments • Tapeserverd – Logical block protection integration – Support for pre-emption of session • VDQM/VMGR – Merge of the two in a single tape resource manager • Simplify interface • Asymmetric drive support • Improve scheduling (atomic tape-in-drive semantics for migrations) – Today, the chosen tape might no have compatible drives available, leading to migration delays • Remove need for manual synchronization • Consider pre-emptive scheduling – – – – – Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it max-out the system with background task (repack, verify) Interrupt and make space for user sessions when they come Allow over quota for users when free drives exist Leading to 100% utilisation of the drives Facilitates tape server upgrades – Integrate the authentication part for tape (from Cupv) New tape server software Castor Workshop Sep 2014 19 DSS Conclusion • Tape server stack has been re-written and consolidated – New features already provide improvements • Empty mount protection for both read and write • Full request and report latency shadowing • Better timing monitoring is already in place – Major clean-up will allow easier development and maintenance • More new features coming – Xroot/Ceph support – Logical block protection – Session pre-emption • End of the road for rtcpd/taped – Will be dropped form 2.1.15 as soon as we are happy with tapeserverd in production • More tape software consolidation around the corner Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it – VDQM/VMGR New tape server software Castor Workshop Sep 2014 20