The European X-Ray Laser Project Redundant EPICS IOCs Matthias Clausen Gongfa Liu Bernd Schoeneburg (DESY) Matthias Clausen, Gongfa Liu, Bernd Schoeneburg (DESY), ICALEPCS, 2007 XFEL X-Ray Free-Electron Laser The European X-Ray Laser Project XFEL X-Ray Free-Electron Laser Agenda • • • • • High Availability Choose the ‘right’ approach Design Implementation Functionality – Redundancy Monitor Task – Continuous Control Executive – SNL Executive • Redundancy management – Diagnostic – SNL Debugger • Outlook Matthias Clausen, Gongfa Liu, Bernd Schoeneburg (DESY), ICALEPCS, 2007 2 The European X-Ray Laser Project XFEL X-Ray Free-Electron Laser High Availability Example: File-Server • Main Service: NFS File Server (1) – must be cluster service • Adding Services: Archiver (2) – as cluster service? – as ‘managed’ service! • Adding Services: LDAP-HA Archiver (2) Archiver (2) (1) – must be cluster service right choice for a file server? Keep your eye on the main service. Do not allow other services to interfere with the main service. Matthias Clausen, Gongfa Liu, Bernd Schoeneburg (DESY), ICALEPCS, 2007 3 The European X-Ray Laser Project XFEL X-Ray Free-Electron Laser High Availability: How to implement it? • Increase Availability by Following Mill Specs ? • Redundant Components ! Matthias Clausen, Gongfa Liu, Bernd Schoeneburg (DESY), ICALEPCS, 2007 4 The European X-Ray Laser Project XFEL X-Ray Free-Electron Laser High Availability: Why? In our case: • 24/7 Cryogenic operations for more then one year of operation without any interruption Necessary for: • FLASH Cryogenic Plant Will be converted from (redundant) commercial to redundant EPICS next year.(1/3 of the system) • XFEL Cryogenic Plant Will be converted from (redundant) commercial to redundant EPICS in 2010.(remaining part) • XFEL Cryogenic (and possibly Utility) Controls in the XFEL Tunnel Matthias Clausen, Gongfa Liu, Bernd Schoeneburg (DESY), ICALEPCS, 2007 5 The European X-Ray Laser Project When using redundant IOCs? In applications, where high availability is needed and the failure of an IOC can cause a long plant breakdown. If you have to be able to maintain the system during operation. Like exchanging a power supply, or loading new software versions. If a risk of failure is increased; e.g. in areas, where ionizing radiation can be present. Design goal: The redundant IOC pair must be more reliable than a standalone system! Matthias Clausen, Gongfa Liu, Bernd Schoeneburg (DESY), ICALEPCS, 2007 XFEL X-Ray Free-Electron Laser The European X-Ray Laser Project XFEL X-Ray Free-Electron Laser Project Schedule: • Design Phase (June 2005) • Identifying the main components: – Redundancy Monitor Task – Continuous Control Executive – SNL Executive • Implementation Phase (March-September 2006) – Redundancy Monitor Task: Industry – Continuous Control Executive: Bob and his brother – SNL Executive: DESY with SLAC support • Testing (2006-2007) • Porting RMT and CCE to other OS (cooperation with KEK) • Testing (2007) • Porting to uTCA System (planned) • Production: Middle of 2008 Matthias Clausen, Gongfa Liu, Bernd Schoeneburg (DESY), ICALEPCS, 2007 7 The European X-Ray Laser Project Layout: RMT, CC-Exec, SNL-Exec LAN The Redundancy Monitor Task (RMT) is an implementation Independent from EPICS Redundancy Monitor Process Control Update Process Q U E RMT CCE SNL Contr ol Exec SNL Exec Scan Task SNL Prog Scan Task SNL Prog : : : : Scan Task I/O Driver Matthias Clausen, Gongfa Liu, Bernd Schoeneburg (DESY), ICALEPCS, 2007 SNL Prog I/O Driver I/O Driver Driver SNL Update Process Q U E XFEL X-Ray Free-Electron Laser The European X-Ray Laser Project XFEL X-Ray Free-Electron Laser Basic Layout Eternet private Eternet RMT controller (IOC) A RMT controller (IOC) B The RMT is a process which handles all redundancy issues within the IOC redundant pair Two IOC form a redundant pair One controller is active (Master state) The other IOC keeps synchronized with the Master Matthias Clausen, Gongfa Liu, Bernd Schoeneburg (DESY), ICALEPCS, 2007 9 The European X-Ray Laser Project XFEL X-Ray Free-Electron Laser Ethernet Connectivity •check global Ethernet Global services: Time Alarming ..... CA Clients •check public Ethernet •check private Ethernet •RMT communication •synchronization of data •CA broadcast Eternet •Master replies only private Eternet RMT controller (IOC) A Master RMT controller (IOC) B redundant pair Matthias Clausen, Gongfa Liu, Bernd Schoeneburg (DESY), ICALEPCS, 2007 10 The European X-Ray Laser Project RMT: the Redundancy Supervisor XML Diagnose/Control IF Monitor XML Task Shell Commands rmtConfig rmtStart rmtStop rmtShutdown rmtReport ... rmtConfig rmtStart rmtStop rmtShutdown rmtReport ... External RMT Functions Redundancy Monitor Task Checks communication over public ethernet by connects to time or ftp server Main Thread Status Machine Ethernet Controller TCP/IP Communication to Partner RMT TCP/IP Communication to Partner RMT Task A Control Table TCP/IP Controller Public Ethernet Task B Driver IF TCP/IP Controller Private Ethernet Driver IF Driver n Process internal Thread Process realized by T-H Process realized by Desy Matthias Clausen, Gongfa Liu, Bernd Schoeneburg (DESY), ICALEPCS, 2007 Config File PRR Controller Driver IF SNL Exec PRR Controller PRR Controller Driver IF Control Exec PRR Controller Driver IF CA Server Watch Dog XFEL X-Ray Free-Electron Laser The European X-Ray Laser Project XFEL X-Ray Free-Electron Laser How the RMT works • Control the processes of interest for redundancy Init • Processes register themselves to be controlled • Communicate with the RMT in the other IOC • Set the IOC in the Master- or Slave-state (manage switch-over) • Monitor network connections (slide before) Master Slave CMD to fail Master fail Slave test RMT State Machine (sinplified) RMT In a redundant IOC PRR-processes register by calling a RMT function and wait for a start command. register start stop sync read status process Some processes need to be synchronized with their partner process in the other IOC. Synchronization over the private Ethernet is controlled by the RMT. Matthias Clausen, Gongfa Liu, Bernd Schoeneburg (DESY), ICALEPCS, 2007 The European X-Ray Laser Project XFEL X-Ray Free-Electron Laser When to fail over? From the Preamble of the design Document: “Any redundant implementation must make the system more reliable than the non redundant one. Precaution must be taken especially for the detection of errors which shall initiate the failover. This operation should only be activated if there is no doubt that keeping the actual mastership definitely causes more damage to the controlled system than an automatic failover.” Matthias Clausen, Gongfa Liu, Bernd Schoeneburg (DESY), ICALEPCS, 2007 13 The European X-Ray Laser Project XFEL X-Ray Free-Electron Laser EPICS Specific Parts LAN RMT is an implementation Independent from EPICS CCE: The Continuous Control Executive permanently collects changes on the master to update the client SNL Executive: Permanently collects states and values from SNL programs on the master. Sending changes to the slave. Redundancy Monitor Process Control Update Process Q U E RMT CCE Contr ol Exec SNL Exec Scan Task SNL Prog Scan Task SNL Prog : : : : Scan Task I/O Driver Matthias Clausen, Gongfa Liu, Bernd Schoeneburg (DESY), ICALEPCS, 2007 SNL SNL Prog I/O Driver I/O Driver Driver SNL Update Process Q U E The European X-Ray Laser Project XFEL X-Ray Free-Electron Laser Remote Diagnostic Client XML Request files can be passed though the RMT to any registered underlying process. The final destination of the Message will generate an answer. SNL Executive: In this special case the remote diagnostic protocol is used to debug the SLS programs actually Executive running on the remote IOC. Redundancy Monitor Process Control Update Process Q U E RMT CCE Contr ol Exec SNL Exec Scan Task SNL Prog Scan Task SNL Prog : : : : Scan Task I/O Driver Matthias Clausen, Gongfa Liu, Bernd Schoeneburg (DESY), ICALEPCS, 2007 SNL SNL Prog I/O Driver I/O Driver Driver SNL Update Process Q U E The European X-Ray Laser Project XFEL X-Ray Free-Electron Laser Status • Implementing support for redundant systems is quite a challenging task. • The current implementation is improving its maturity by continuous testing. (Gongfa Liu – (Hefei-China) DESY) • The RMT and CCE code has bee ported to Linux by Artem Kazakov (KEK) to run redundant (soft)-IOCs on Linux. (see TPPA31) • The ported RMT has been used to implement redundant CA-Gateways. (Artem Kazakov) (An option for load balancing is also available) Matthias Clausen, Gongfa Liu, Bernd Schoeneburg (DESY), ICALEPCS, 2007 16 The European X-Ray Laser Project XFEL X-Ray Free-Electron Laser Outlook • RMT can be used independent from EPICS to implement redundant applications. • The SNL debugging features will be improved. (Joint development of SLAC and DESY) • First production of a redundant IOC is foreseen for middle of 2008 • An implementation based/ running on uTCA is desired. Matthias Clausen, Gongfa Liu, Bernd Schoeneburg (DESY), ICALEPCS, 2007 17 The European X-Ray Laser Project XFEL X-Ray Free-Electron Laser Test System The test system consists of two Compact PCI CPUs in separated crates with redundant power supplies each. Thank you! Matthias Clausen, Gongfa Liu, Bernd Schoeneburg (DESY), ICALEPCS, 2007