Wojciech Sliwinski BE-CO-IN for the Middleware team: Felix Ehm, Kris Kostro, Joel Lauener, Radoslaw Orecki, Ilia Yastrebov, [Andrzej Dworak] Special thanks to: Vito Baggiolini and Pierre Charrue Agenda Context & Motivation for Renovation Middleware Review process Technical evaluation of the transport layer Changes in the MW Architecture in LS1 MW Upgrade milestones in 2013 Risk assessment and mitigation Conclusions 25th April 2013 Wojciech Sliwinski, Middleware Renovation 2 Agenda Context & Motivation for Renovation 25th April 2013 Wojciech Sliwinski, Middleware Renovation 3 MW Mandate & Scope Standard set of MW solutions Centrally managed services Track & optimize runtime parameters Well defined feedback channel for users Provide support & follow-up issues Control System GUI Applications Control Logic Middleware Scope: CERN Accelerator Complex Operational 24*7*365 Must be Reliable & High Quality 73’000 HW devices, 3’150 servers In all Eqp. groups (4 dpts: BE, EN, GS, TE) 25th April 2013 Wojciech Sliwinski, Middleware Renovation 4 CMW in the Controls System GENERAL PURPOSE NETWORK FIXED DISPLAYS OPERATOR CONSOLES FILE SERVERS JMS client (Java) TCP/IP GUIs communication services APPLICATION SERVERS CMW client (Java) JAPC Logging, LSA, InCA, SIS SCADA SERVERS CMW client/server (C++/Java) Proxy, DIP, AlarmMon, AQ JMS client (Java)services TCP/IP communication Servers: Logging, InCA, SIS TIMING GENERATION RT Lynx/OS VME FRONT ENDS WORLDFIP Front Ends M IDDLE TIER CERN GIGABIT ETHERNET TECHNICAL NETWORK CMW client (C++/Java) JAPC GUIs, LabView, RADE PRESENTATION TIER OPERATOR CONSOLES T T T T PLCs BEAM POSITION MONITORS, BEAM LOSS MONITORS, BEAM INTERLOCKS, RF SYSTEMS, ETC… T QUENCH PROTECTION AGENTS, POWER CONVERTERS FUNCTIONS GENERATORS, CRYO TEMPERATURE SENSORS… DIRECT I/O T T FIP/IO OPTICAL FIBERS T PROFIBUS T T CMW server (C++) PVSS (Cryo, Vacuum) RESOURCE TIER CMW server (C++) FESA, FGC, GM WorldFIP SEGMENT (1, 2.5 MBits/sec) TCP/IP communication services ACTUATORS AND SENSORS CRYOGENICS, VACUUM, ETC… LHC MACHINE 25th April 2013 Wojciech Sliwinski, Middleware Renovation 5 Motivations for MW Renovation Current CORBA-based CMW-RDA Integrated in the Control system Used to operate all CERN accelerators Provides widely accepted Device/Property model > 10 years old Why to review & upgrade MW ? CORBA was choosen 15 years ago Technical limitations of CORBA-based transport Functional limitations of the current CMW-RDA Codebase with long history difficult to maintain, needs architecture review Major issue of long-term support & future evolution Evolution of technology over last 10 years: HW, OS, middleware, 3rd party libraries Human factor less & less CORBA expertise on the market 25th April 2013 Wojciech Sliwinski, Middleware Renovation 6 Technical limitations of CORBA transport Became legacy, not actively supported maintenance issue Shrinking community, slow response time omniORB (C++) – 1 developer/maintainer, last release mid-2011 JacORB (Java) – few developers, small community Major technical limitations Lack of fully asynchronous processing channel Blocking communication infamous JacORB blocking issue Lack of low-level control of IO resources (sockets, request queues) Development issues Difficult to extend the wire protocol Backward compatibility issue Complex, error prone API Heavy in memory usage 25th April 2013 Wojciech Sliwinski, Middleware Renovation 7 Summary: Why change CORBA? CORBA was choosen 15 years ago Not actively maintained big risk for the MW project Better solutions exist on the market Invest in future solution rather than maintaining old one 25th April 2013 Wojciech Sliwinski, Middleware Renovation 8 Functional limitations of CMW-RDA Several pending operational issues Difficult (or hardly possible) to resolve with current library Any major change very difficult to introduce ○ Technical Stops & Xmas breaks too short for massive deployment ○ High risk Major impact on front-end frameworks and applications No protection against ’slow/bad’ client applications Misbehaving application may destabilise front-end server Affects reliability of the subscription channel Workaround: introduction of Proxy Poor scalability when many clients subscribed Stability issues observed when >200 clients subscribed (even for Proxy) Threading model doesn’t scale well with many clients Missing support for priority clients (e.g. SIS, PM, InCA, Logging) Non-critical clients (e.g. GUIs) have the same communication priority + others … 25th April 2013 Wojciech Sliwinski, Middleware Renovation 9 Summary: Why change CMW-RDA? With current CORBA-based middleware we can’t solve the pending operational issues We can’t provide better scalability & reliability CMW-RDA is difficult to evolve & extend 25th April 2013 Wojciech Sliwinski, Middleware Renovation 10 Agenda Middleware Review process 25th April 2013 Wojciech Sliwinski, Middleware Renovation 11 Middleware Renovation process MW Renovation = MW Review + MW Upgrade MW Review aims to provide the most appropriate technical solution satisfying the user requirements MW Upgrade establishes the plan & strategy for introduction of the new MW Objective: LS1 the unique opportunity for the major MW upgrade Middleware Review Process Gathering of users feedback and requirements (2010-11) Review of communication and serialization libraries (2011-12) Prototyping using selected communication products (2012) Design & impl. of new RDA3: Data, Client & Server (2012-13) Testing & validation of core MW infrastructure (summer’13) Upgrade of all dependent MW libraries & services (2013-14) ○ JAPC, Directory Service, Proxy, DIP Gateway 25th April 2013 Wojciech Sliwinski, Middleware Renovation 12 Review of users requirements 2010-11 – series of interviews with major users Lars Jensen, Stephen Jackson (BI) Andy Butterworth, Frode Weierud, Roman Sorokoletov (RF) Brice Copy, Clara Gaspar (DIP, DIM) Frederic Bernard, Herve Milcent, Alexander Egorov (PVSS) Alexey Dubrovskiy (CTF), Kris Kostro (DIP gateways) Marine Gourber-Pace, Nicolas Hoibian (Logging) Nicolas De Metz-Noblat (Front-Ends), Alastair Bland (Infrastructure) Michel Arruat (FESA), Stephen Page (FGC) Niall Stapley, Mark Buttner, Marek Misiowiec (LASER & DIAMON) Nicolas Magnin, Christophe Chanavat (ABT) Stephane Deghaye, Jakub Wozniak (InCA, SIS) Vito Baggiolini, Roman Gorbonosov (JAPC & DA systems) + regular feedback from OP + internal team input http://wikis/display/MW/Interviews+with+Experts 25th April 2013 Wojciech Sliwinski, Middleware Renovation 13 New RDA3: Accepted requirements New requirement General Java & C++ API, Win (64-bit) & Linux (SLC5 32-bit & SLC6 64-bit) Accelerator Device Model (i.e. Device/Property) Get, Set, Async-Get, Async-Set, Subscribe Early detection of communication failures Improve error reporting in all the layers: client, server, gateways Admin interface & runtime diagnostics & statistics Data support Data object: primitives, n-dim arrays, data structures Subscription mechanism Subscription behaviour the same regardless condition of the server (active, down) Several client subscription policies (default: continuous) Provide subscription notification ordering First-Update enforced via CMW on server-side ○ Provide callback to front-end framework for the server-side Get Drop support for on-change flag Standardise use of subscription filters and update flags (e.g. immediate update) Add header for acquired Data common metadata (e.g. acq. stamp, cycle name) All loss of data (dropped updates) must be notified to clients 25th April 2013 Wojciech Sliwinski, Middleware Renovation 14 New RDA3: Accepted requirements New requirement Client side RDA3 client API connects with both: RDA2 (old) & RDA3 (new) servers Efficient mechanism for: connection, disconnection & reconnection Must be able to recover from any interruption of communication with the server ○ Server restarts, IP address change, rename/move of a device to another server Improved semantics of Array Calls, i.e. handling of individual parameters Enhanced diagnostics & collection of statistics Server side Policies for discarding notifications, i.e. deal with overflows and ’bad clients’ ○ Instrument with counters & timings allowing to diagnose the notifications delivery Prioritisation of Get/Set requests for high-priority clients Server-side subscription tree fully managed by CMW ○ Server does not need to manage client subscriptions any more Manage the client connections, e.g. forced disconnect of a client Client lifetime callbacks (i.e. connected, disconnected) 25th April 2013 Wojciech Sliwinski, Middleware Renovation 15 New RDA3: Accepted requirements New requirement Server side (cont.) Client discovery for the diagnostics purposes (i.e. connected clients with payload) Enhanced diagnostics & collection of statistics Ongoing discussions (not accepted yet) Prioritisation of subscription notifications for high-priority clients Technical notes Invest in asynchronous & non-blocking communication Prefer 0-copy & lock-free data structures, message queues http://wikis/display/MW/Design+of+New+RDA 25th April 2013 Wojciech Sliwinski, Middleware Renovation 16 New RDA3: Summary of requirements Unchanged Device/Property model Set of basic operations (Get, Set, Subscribe) Fixes & improvements Subscription mechanism Connection management Diagnostics & statistics New functionality Policies for subscription management (client & server) Client priorities Server-side subscription tree Extended Data support Standardise First-Update concept 25th April 2013 Wojciech Sliwinski, Middleware Renovation 17 Agenda Technical evaluation of the transport layer 25th April 2013 Wojciech Sliwinski, Middleware Renovation 18 Middleware transport requirements Lightweight Desirable Friendly API, documentation Request/reply & pub/sub patterns Asynchronous Performance & Scalability Mandatory Stability, Maturity & Longevity Active community Open source license C++/Java Fundamental Linux/Windows Over TCP/IP LAN 25th April 2013 Wojciech Sliwinski, Middleware Renovation 19 Evaluation process –> our criteria Appearance Simple usage • Creators • specification • documentation • Users • forums • bug reports • Internet Testing • Communication patterns • Performance • Exceptional situations • QoS • Configuration • Download • licensing • Compile • Linux & gcc • Run examples CRITERIA API, look & feel, documentation 25th April 2013 Resources, binary size, memory Community, Communications maturity patterns Wojciech Sliwinski, Middleware Renovation QoS Performance Andrzej Dworak, ICALEPCS 2011 20 Evaluated middleware products All opinions are based only on our knowledge and evaluation. Each of the products, depending on the requirements, may constitute a good solution. CoreDX OpenAMQ RTI DDS QPid ZeroMQ OpenSpliceDDS RabbitMQ YAMI Ice omniORB JacORB 25th April 2013 MQtt RSMB Thrift Wojciech Sliwinski, Middleware Renovation Mosquito Andrzej Dworak, ICALEPCS 2011 21 25th April 2013 Sync, async & msg patterns QoS Dependencies & memory f-p Performance Look & feel, API, docs Community & maturity Score Products comparison (according to the criteria) ZeroMQ 6 Ice 5 YAMI4 4 RTI 3 Qpid 3 CORBA 2 Thrift 2 Wojciech Sliwinski, Middleware Renovation Andrzej Dworak, ICALEPCS 2011 22 Conclusions Several good middleware solutions available The choice is dictated by the most critical requirements Not easy performance matters but also ease of use, community, … Prototyping was done with the most promising candidates: ZeroMQ, Ice & YAMI Finally we decided to choose ZeroMQ (http://www.zeromq.org/) Asynchronous & non-blocking communication 0-copy & lock-free data structures, message queues Nice API, good documentation & active community 25th April 2013 Wojciech Sliwinski, Middleware Renovation 23 New RDA3 Java – Sync Get round-trip time Syn Get round-trip (1kB message payload) 18 16 14 Round-trip (ms) 12 10 max 8 average 6 4 2 0 0 100 200 300 400 500 600 700 800 900 1000 Number of clients Test setup: 1kB message payload, cs-ccr-* machines, 1 server host & 10 client hosts 25th April 2013 Wojciech Sliwinski, Middleware Renovation 24 New RDA3 Java – subscription notification latency Subscription notification latency (1kB message payload) 250 Latency (ms) 200 150 min max 100 average 50 0 0 100 200 300 400 500 600 700 800 900 1000 Number of clients Test setup: 1kB message payload, cs-ccr-* machines, 1 server host & 10 client hosts 25th April 2013 Wojciech Sliwinski, Middleware Renovation 25 New RDA3 Java – subscription notification latency Subscription notification latency (a closer look) 6 5 Latency (ms) 4 min 3 max average 2 1 0 0 20 40 60 80 100 120 140 160 180 200 Number of clients Test setup: 1kB message payload, cs-ccr-* machines, 1 server host & 10 client hosts 25th April 2013 Wojciech Sliwinski, Middleware Renovation 26 Agenda Changes in the MW Architecture in LS1 25th April 2013 Wojciech Sliwinski, Middleware Renovation 27 User written Current MW Architecture Java Control Programs Central services VB, Excel, LabView C++ Programs Passerelle C++ RDA Client API (C++/Java) Administration console Clients JAPC API Middleware Device/Property Model Configuration Database CCDB CMW Infrastructure CORBA-IIOP Directory Directory Service Service RBAC RBAC A1 Service Service RDA Server API (C++/Java) Device/Property Model Virtual Devices (Java) CMW int. CMW int. CMW int. CMW int. CMW int. FESA Server FGC Server PS-GM Server PVSS Gateway More Servers Servers CMW integr. Physical Devices (BI, BT, CRYO, COLL, QPS, PC, RF, VAC, …) 25th April 2013 Wojciech Sliwinski, Middleware Renovation 28 User written Changes in MW Architecture in LS1 Middleware Central services Java Control Programs C++ Programs Passerelle C++ RDA Client API (C++/Java) Administration console Clients JAPC API Upgrade in LS1 VB, Excel, LabView Device/Property Model Configuration Database CCDB CMW Infrastructure ZeroMQ Directory Directory Service Service RBAC RBAC A1 Service Service RDA Server API (C++/Java) Device/Property Model Virtual Devices (Java) CMW int. CMW int. CMW int. CMW int. CMW int. FESA Server FGC Server PS-GM Server PVSS Gateway More Servers Servers CMW integr. Physical Devices (BI, BT, CRYO, COLL, QPS, PC, RF, VAC, …) 25th April 2013 Wojciech Sliwinski, Middleware Renovation 29 Agenda MW Upgrade milestones in 2013 25th April 2013 Wojciech Sliwinski, Middleware Renovation 30 MW Upgrade Milestones in 2013 Milestone Completed by ? RDA3 Java (client/server) (alpha) June’13 RDA3 C++ server (alpha) July’13 RDA3 integration with: FESA, FGC, PVSS July-Oct’13 RDA3 C++/Java (client/server) validated September’13 New JAPC release with RDA3 Java September’13 RDA3 integration with: FESA, FGC, PVSS July-Oct’13 New FESA3.2 release with RDA3 December’13 RDA3 C++ Integration with FESA, FGC, PVSS July’13 July-Oct’13 RDA3 validated New FESA3.2 New JAPC September’13 December’13 Tests with eqp. Winter’13/14 End LS1 August’14 End-of-Life for RDA2: LS2 25th April 2013 Wojciech Sliwinski, Middleware Renovation 31 MW Upgrade strategy in LS1 and towards LS2 No BIG-BANG migration but gradual Backward compatible (connection-wise) new RDA3 client library New RDA3 clients can communicate with RDA2 & RDA3 servers FESA3 will exist with both: old RDA2 (FESA3.1) and new RDA3 (FESA3.2) Client apps will migrate during LS1 Only for justified, exceptional cases Old JAPC 25th April 2013 New JAPC Old RDA2 client RDA2 RDA3 Gateway Old RDA2 server Old RDA2 server FESA2.10 FESA3.1 New RDA3 client FEC developers should migrate to FESA3.2 ASAP Wojciech Sliwinski, Middleware Renovation New RDA3 server FESA3.2 32 LS1: Changes in JAPC New major JAPC version upgrade for RDA3 (September’13) Public API backward compatible Possible API extensions, but always compatible Announcement via accsoft-java-announce list Required Actions for JAPC Users Update JAPC jars (via CommonBuild) Re-release your product (via CommonBuild) New JAPC will support communication with RDA2 & RDA3 servers 25th April 2013 Wojciech Sliwinski, Middleware Renovation 33 LS1: Changes in RDA New major version: RDA3 (June’13 – alpha version) Public API NOT backward compatible New protocol, new architecture, new design Same Device/Property model & Get/Set/Subscribe calls Announcement via cmw-news & accsoft-java-announce lists Required Actions for RDA Users For Java: Use new version of JAPC (API unchanged) For Java: New JAPC will support communication with RDA2 & RDA3 servers For C++: Upgrade user code to new RDA3 API For C++: RDA3 will support communication with RDA2 & RDA3 servers Consequences if NO Action staying with old RDA2 NOT possible to communicate with new RDA3 servers (FESA3, FGC, etc.) 25th April 2013 Wojciech Sliwinski, Middleware Renovation 34 Agenda Risk assessment and mitigation 25th April 2013 Wojciech Sliwinski, Middleware Renovation 35 Risk assessment and mitigation Risks Mitigation Wrong product developed (wrong requirements) Early and continuous involvement of clients & experts Product is (too) late Careful planning and follow-up Fall-back to less ambitious goals Product has bugs or incompatibilities Early, continuous testing (unit and functional tests) Bugs affect operations Gradual migration Fast deployment of bugfixes 25th April 2013 Wojciech Sliwinski, Middleware Renovation 36 Risk: Wrong product developed (wrong requirements) Mitigation: Early and continuous involvement of clients & experts We involved clients and experts since 2010 Requirements review with all major clients Technical discussions with eqp. experts Iterative development involving the Review team Design meetings (API and internals) since January 2013 Alpha versions will be available for feedback and validation several months before the final release Feedback is continuously integrated in development (= iterative) 25th April 2013 Wojciech Sliwinski, Middleware Renovation 37 Risk: Product is (too) late Mitigation: Careful planning and follow-up Fall-back to less ambitious goals Planning prepared and followed by the MW team Taking into account needs and priorities of other CO projects and clients Regular follow-up In CO internally by TEC coordinator In informal meetings with the MW experts (as done so far) Fall-back to less ambitious goals Plan priorities of functionality Drop (postpone) work with lower priority 25th April 2013 Wojciech Sliwinski, Middleware Renovation 38 Risk: Product has bugs or incompatibilities Mitigation: Early, continuous testing (unit, functional & integration tests) Unit tests to asses quality inside the MW project Required dev. phase in the MW team Functionality tests in CO Testbed Functionality of CMW only Integration tests to check interoperability Integration with FESA in CO Testbed Integration with FGC in FGC Lab 25th April 2013 Wojciech Sliwinski, Middleware Renovation 39 Risk: Bugs affect operations Mitigation: Gradual Migration (1) No BIG-BANG migration but gradual Backward compatible (connection-wise) new RDA3 client library New RDA3 clients can talk to old RDA2 servers FESA3 will exist with both: old RDA2 and new RDA3 25th April 2013 Old JAPC New JAPC Old RDA2 client New RDA3 client Old RDA2 server Old RDA2 server New RDA3 server FESA2 FESA3 FESA3 Wojciech Sliwinski, Middleware Renovation 40 Risk: Bugs affect operations Mitigation: Gradual Migration (2) Deploy first on systems controlled by the MW team E.g. Proxies, Gateways Gain experience and confidence Start deployment with less critical systems first 25th April 2013 Wojciech Sliwinski, Middleware Renovation 41 Risk: Bugs affect operations Mitigation: Fast deployment of bugfixes If (inspite of all) something goes wrong in operations Fast reaction from the MW team In CO, we will study the need and mechanisms to quickly upgrade also servers 25th April 2013 Wojciech Sliwinski, Middleware Renovation 42 Conclusions We have to replace CORBA with a new solution We collected updated users requirements MW upgrade will be performed during LS1 Interoperability between RDA2 RDA3 Gradual control system migration until LS2 End-of-Life for RDA2: LS2 25th April 2013 Wojciech Sliwinski, Middleware Renovation 43