Middleware renovation

advertisement
Wojciech Sliwinski BE-CO-IN
for the Middleware team:
Felix Ehm, Kris Kostro, Joel Lauener,
Radoslaw Orecki, Ilia Yastrebov, [Andrzej Dworak]
Special thanks to: Vito Baggiolini and Pierre Charrue
Agenda

Context & Motivation for Renovation

Middleware Review process

Technical evaluation of the transport layer

Changes in the MW Architecture in LS1

MW Upgrade milestones in 2013

Risk assessment and mitigation

Conclusions
25th April 2013
Wojciech Sliwinski, Middleware Renovation
2
Agenda
Context & Motivation for Renovation
25th April 2013
Wojciech Sliwinski, Middleware Renovation
3
MW Mandate & Scope





Standard set of MW solutions
Centrally managed services
Track & optimize runtime parameters
Well defined feedback channel for users
Provide support & follow-up issues
Control System
GUI Applications
Control Logic
Middleware

Scope: CERN Accelerator Complex
 Operational 24*7*365
 Must be Reliable & High Quality
 73’000 HW devices, 3’150 servers
 In all Eqp. groups (4 dpts: BE, EN, GS, TE)
25th April 2013
Wojciech Sliwinski, Middleware Renovation
4
CMW in the Controls System
GENERAL
PURPOSE
NETWORK
FIXED
DISPLAYS
OPERATOR
CONSOLES
FILE SERVERS
JMS client (Java)
TCP/IP
GUIs communication services
APPLICATION SERVERS
CMW client (Java)
JAPC
Logging, LSA, InCA, SIS
SCADA SERVERS
CMW client/server (C++/Java)
Proxy, DIP, AlarmMon, AQ
JMS client
(Java)services
TCP/IP
communication
Servers: Logging, InCA, SIS
TIMING GENERATION
RT Lynx/OS
VME FRONT ENDS
WORLDFIP
Front Ends
M IDDLE TIER
CERN GIGABIT ETHERNET TECHNICAL NETWORK
CMW client (C++/Java)
JAPC
GUIs, LabView, RADE
PRESENTATION TIER
OPERATOR
CONSOLES
T
T
T
T
PLCs
BEAM POSITION MONITORS,
BEAM LOSS MONITORS,
BEAM INTERLOCKS,
RF SYSTEMS, ETC…
T
QUENCH PROTECTION AGENTS,
POWER CONVERTERS FUNCTIONS
GENERATORS, CRYO TEMPERATURE
SENSORS…
DIRECT I/O
T
T
FIP/IO
OPTICAL
FIBERS
T
PROFIBUS
T
T
CMW server (C++)
PVSS (Cryo, Vacuum)
RESOURCE TIER
CMW server (C++)
FESA, FGC, GM
WorldFIP SEGMENT
(1, 2.5 MBits/sec)
TCP/IP communication services
ACTUATORS AND SENSORS
CRYOGENICS, VACUUM, ETC…
LHC MACHINE
25th April 2013
Wojciech Sliwinski, Middleware Renovation
5
Motivations for MW Renovation

Current CORBA-based CMW-RDA
 Integrated in the Control system
 Used to operate all CERN accelerators
 Provides widely accepted Device/Property model
 > 10 years old

Why to review & upgrade MW ?
 CORBA was choosen 15 years ago
 Technical limitations of CORBA-based transport
 Functional limitations of the current CMW-RDA
 Codebase with long history  difficult to maintain, needs architecture review
 Major issue of long-term support & future evolution
 Evolution of technology over last 10 years: HW, OS, middleware, 3rd party libraries
 Human factor  less & less CORBA expertise on the market
25th April 2013
Wojciech Sliwinski, Middleware Renovation
6
Technical limitations of CORBA transport

Became legacy, not actively supported  maintenance issue
 Shrinking community, slow response time
 omniORB (C++) – 1 developer/maintainer, last release mid-2011
 JacORB (Java) – few developers, small community

Major technical limitations
 Lack of fully asynchronous processing channel
 Blocking communication  infamous JacORB blocking issue
 Lack of low-level control of IO resources (sockets, request queues)

Development issues
 Difficult to extend the wire protocol  Backward compatibility issue
 Complex, error prone API
 Heavy in memory usage
25th April 2013
Wojciech Sliwinski, Middleware Renovation
7
Summary: Why change CORBA?
CORBA was choosen 15 years ago
 Not actively maintained  big risk for the MW project
 Better solutions exist on the market
 Invest in future solution rather than maintaining old one

25th April 2013
Wojciech Sliwinski, Middleware Renovation
8
Functional limitations of CMW-RDA

Several pending operational issues
 Difficult (or hardly possible) to resolve with current library
 Any major change very difficult to introduce
○ Technical Stops & Xmas breaks too short for massive deployment
○ High risk  Major impact on front-end frameworks and applications

No protection against ’slow/bad’ client applications
 Misbehaving application may destabilise front-end server
 Affects reliability of the subscription channel
 Workaround: introduction of Proxy

Poor scalability when many clients subscribed
 Stability issues observed when >200 clients subscribed (even for Proxy)
 Threading model doesn’t scale well with many clients

Missing support for priority clients (e.g. SIS, PM, InCA, Logging)
 Non-critical clients (e.g. GUIs) have the same communication priority

+ others …
25th April 2013
Wojciech Sliwinski, Middleware Renovation
9
Summary: Why change CMW-RDA?
With current CORBA-based middleware we can’t solve
the pending operational issues
 We can’t provide better scalability & reliability
 CMW-RDA is difficult to evolve & extend

25th April 2013
Wojciech Sliwinski, Middleware Renovation
10
Agenda
Middleware Review process
25th April 2013
Wojciech Sliwinski, Middleware Renovation
11
Middleware Renovation process

MW Renovation = MW Review + MW Upgrade
 MW Review aims to provide the most appropriate technical solution satisfying the
user requirements
 MW Upgrade establishes the plan & strategy for introduction of the new MW
 Objective: LS1 the unique opportunity for the major MW upgrade

Middleware Review Process
 Gathering of users feedback and requirements (2010-11)
 Review of communication and serialization libraries (2011-12)
 Prototyping using selected communication products (2012)
 Design & impl. of new RDA3: Data, Client & Server (2012-13)
 Testing & validation of core MW infrastructure (summer’13)
 Upgrade of all dependent MW libraries & services (2013-14)
○ JAPC, Directory Service, Proxy, DIP Gateway
25th April 2013
Wojciech Sliwinski, Middleware Renovation
12
Review of users requirements

2010-11 – series of interviews with major users
 Lars Jensen, Stephen Jackson (BI)
 Andy Butterworth, Frode Weierud, Roman Sorokoletov (RF)
 Brice Copy, Clara Gaspar (DIP, DIM)
 Frederic Bernard, Herve Milcent, Alexander Egorov (PVSS)
 Alexey Dubrovskiy (CTF), Kris Kostro (DIP gateways)
 Marine Gourber-Pace, Nicolas Hoibian (Logging)
 Nicolas De Metz-Noblat (Front-Ends), Alastair Bland (Infrastructure)
 Michel Arruat (FESA), Stephen Page (FGC)
 Niall Stapley, Mark Buttner, Marek Misiowiec (LASER & DIAMON)
 Nicolas Magnin, Christophe Chanavat (ABT)
 Stephane Deghaye, Jakub Wozniak (InCA, SIS)
 Vito Baggiolini, Roman Gorbonosov (JAPC & DA systems)
 + regular feedback from OP
 + internal team input

http://wikis/display/MW/Interviews+with+Experts
25th April 2013
Wojciech Sliwinski, Middleware Renovation
13
New RDA3: Accepted requirements








New requirement
General
Java & C++ API, Win (64-bit) & Linux (SLC5 32-bit & SLC6 64-bit)
Accelerator Device Model (i.e. Device/Property)
Get, Set, Async-Get, Async-Set, Subscribe
Early detection of communication failures
Improve error reporting in all the layers: client, server, gateways
Admin interface & runtime diagnostics & statistics
Data support
 Data object: primitives, n-dim arrays, data structures

Subscription mechanism




Subscription behaviour the same regardless condition of the server (active, down)
Several client subscription policies (default: continuous)
Provide subscription notification ordering
First-Update enforced via CMW on server-side
○ Provide callback to front-end framework for the server-side Get




Drop support for on-change flag
Standardise use of subscription filters and update flags (e.g. immediate update)
Add header for acquired Data  common metadata (e.g. acq. stamp, cycle name)
All loss of data (dropped updates) must be notified to clients
25th April 2013
Wojciech Sliwinski, Middleware Renovation
14
New RDA3: Accepted requirements

New requirement
Client side
 RDA3 client API connects with both: RDA2 (old) & RDA3 (new) servers
 Efficient mechanism for: connection, disconnection & reconnection
 Must be able to recover from any interruption of communication with the server
○ Server restarts, IP address change, rename/move of a device to another server
 Improved semantics of Array Calls, i.e. handling of individual parameters
 Enhanced diagnostics & collection of statistics

Server side
 Policies for discarding notifications, i.e. deal with overflows and ’bad clients’
○ Instrument with counters & timings allowing to diagnose the notifications delivery
 Prioritisation of Get/Set requests for high-priority clients
 Server-side subscription tree fully managed by CMW
○ Server does not need to manage client subscriptions any more
 Manage the client connections, e.g. forced disconnect of a client
 Client lifetime callbacks (i.e. connected, disconnected)
25th April 2013
Wojciech Sliwinski, Middleware Renovation
15
New RDA3: Accepted requirements
New requirement

Server side (cont.)
 Client discovery for the diagnostics purposes (i.e. connected clients with payload)
 Enhanced diagnostics & collection of statistics

Ongoing discussions (not accepted yet)
 Prioritisation of subscription notifications for high-priority clients

Technical notes
 Invest in asynchronous & non-blocking communication
 Prefer 0-copy & lock-free data structures, message queues

http://wikis/display/MW/Design+of+New+RDA
25th April 2013
Wojciech Sliwinski, Middleware Renovation
16
New RDA3: Summary of requirements

Unchanged
 Device/Property model
 Set of basic operations (Get, Set, Subscribe)

Fixes & improvements
 Subscription mechanism
 Connection management
 Diagnostics & statistics

New functionality





Policies for subscription management (client & server)
Client priorities
Server-side subscription tree
Extended Data support
Standardise First-Update concept
25th April 2013
Wojciech Sliwinski, Middleware Renovation
17
Agenda
Technical evaluation of the
transport layer
25th April 2013
Wojciech Sliwinski, Middleware Renovation
18
Middleware transport requirements
Lightweight
Desirable
Friendly API, documentation
Request/reply & pub/sub patterns
Asynchronous
Performance & Scalability
Mandatory
Stability, Maturity & Longevity
Active community
Open source license
C++/Java
Fundamental
Linux/Windows
Over TCP/IP LAN
25th April 2013
Wojciech Sliwinski, Middleware Renovation
19
Evaluation process –> our criteria
Appearance
Simple usage
• Creators
• specification
• documentation
• Users
• forums
• bug reports
• Internet
Testing
• Communication
patterns
• Performance
• Exceptional
situations
• QoS
• Configuration
• Download
• licensing
• Compile
• Linux & gcc
• Run examples
CRITERIA
API, look & feel,
documentation
25th April 2013
Resources,
binary size,
memory
Community,
Communications
maturity
patterns
Wojciech Sliwinski, Middleware Renovation
QoS
Performance
Andrzej Dworak, ICALEPCS 2011
20
Evaluated middleware products
All opinions are based only on our knowledge and evaluation. Each of the
products, depending on the requirements, may constitute a good solution.
CoreDX
OpenAMQ
RTI DDS
QPid
ZeroMQ
OpenSpliceDDS
RabbitMQ
YAMI
Ice
omniORB
JacORB
25th April 2013
MQtt RSMB
Thrift
Wojciech Sliwinski, Middleware Renovation
Mosquito
Andrzej Dworak, ICALEPCS 2011
21
25th April 2013
Sync, async &
msg patterns
QoS
Dependencies
& memory f-p
Performance
Look & feel,
API, docs
Community &
maturity
Score
Products comparison (according to the criteria)
ZeroMQ






6
Ice






5
YAMI4






4
RTI






3
Qpid






3
CORBA






2
Thrift






2
Wojciech Sliwinski, Middleware Renovation
Andrzej Dworak, ICALEPCS 2011
22
Conclusions




Several good middleware solutions available
The choice is dictated by the most critical requirements
Not easy  performance matters but also ease of use, community, …
Prototyping was done with the most promising candidates:
 ZeroMQ, Ice & YAMI

Finally we decided to choose ZeroMQ (http://www.zeromq.org/)
 Asynchronous & non-blocking communication
 0-copy & lock-free data structures, message queues
 Nice API, good documentation & active community
25th April 2013
Wojciech Sliwinski, Middleware Renovation
23
New RDA3 Java – Sync Get round-trip time
Syn Get round-trip (1kB message payload)
18
16
14
Round-trip (ms)
12
10
max
8
average
6
4
2
0
0
100
200
300
400
500
600
700
800
900
1000
Number of clients
Test setup: 1kB message payload, cs-ccr-* machines, 1 server host & 10 client hosts
25th April 2013
Wojciech Sliwinski, Middleware Renovation
24
New RDA3 Java – subscription notification latency
Subscription notification latency (1kB message payload)
250
Latency (ms)
200
150
min
max
100
average
50
0
0
100
200
300
400
500
600
700
800
900
1000
Number of clients
Test setup: 1kB message payload, cs-ccr-* machines, 1 server host & 10 client hosts
25th April 2013
Wojciech Sliwinski, Middleware Renovation
25
New RDA3 Java – subscription notification latency
Subscription notification latency (a closer look)
6
5
Latency (ms)
4
min
3
max
average
2
1
0
0
20
40
60
80
100
120
140
160
180
200
Number of clients
Test setup: 1kB message payload, cs-ccr-* machines, 1 server host & 10 client hosts
25th April 2013
Wojciech Sliwinski, Middleware Renovation
26
Agenda
Changes in the MW Architecture in LS1
25th April 2013
Wojciech Sliwinski, Middleware Renovation
27
User written
Current MW Architecture
Java Control
Programs
Central services
VB, Excel, LabView
C++ Programs
Passerelle C++
RDA Client API (C++/Java)
Administration
console
Clients
JAPC API
Middleware
Device/Property Model
Configuration
Database
CCDB
CMW Infrastructure
CORBA-IIOP
Directory
Directory
Service
Service
RBAC
RBAC
A1
Service
Service
RDA Server API (C++/Java)
Device/Property Model
Virtual Devices
(Java)
CMW int.
CMW int.
CMW int.
CMW int.
CMW int.
FESA
Server
FGC
Server
PS-GM
Server
PVSS
Gateway
More
Servers
Servers
CMW integr.
Physical Devices (BI, BT, CRYO, COLL, QPS, PC, RF, VAC, …)
25th April 2013
Wojciech Sliwinski, Middleware Renovation
28
User written
Changes in MW Architecture in LS1
Middleware
Central services
Java Control
Programs
C++ Programs
Passerelle C++
RDA Client API (C++/Java)
Administration
console
Clients
JAPC API
Upgrade in LS1
VB, Excel, LabView
Device/Property Model
Configuration
Database
CCDB
CMW Infrastructure
ZeroMQ
Directory
Directory
Service
Service
RBAC
RBAC
A1
Service
Service
RDA Server API (C++/Java)
Device/Property Model
Virtual Devices
(Java)
CMW int.
CMW int.
CMW int.
CMW int.
CMW int.
FESA
Server
FGC
Server
PS-GM
Server
PVSS
Gateway
More
Servers
Servers
CMW integr.
Physical Devices (BI, BT, CRYO, COLL, QPS, PC, RF, VAC, …)
25th April 2013
Wojciech Sliwinski, Middleware Renovation
29
Agenda
MW Upgrade milestones in 2013
25th April 2013
Wojciech Sliwinski, Middleware Renovation
30
MW Upgrade Milestones in 2013
Milestone
Completed by ?
RDA3 Java (client/server) (alpha)
June’13
RDA3 C++ server (alpha)
July’13
RDA3 integration with: FESA, FGC, PVSS
July-Oct’13
RDA3 C++/Java (client/server) validated
September’13
New JAPC release with RDA3 Java
September’13
RDA3 integration with: FESA, FGC, PVSS
July-Oct’13
New FESA3.2 release with RDA3
December’13
RDA3 C++
Integration with
FESA, FGC, PVSS
July’13
July-Oct’13
RDA3 validated
New FESA3.2
New JAPC
September’13
December’13
Tests with eqp.
Winter’13/14
End LS1
August’14
End-of-Life for RDA2: LS2
25th April 2013
Wojciech Sliwinski, Middleware Renovation
31
MW Upgrade strategy in LS1 and towards LS2


No BIG-BANG migration but gradual
Backward compatible (connection-wise) new RDA3 client library
 New RDA3 clients can communicate with RDA2 & RDA3 servers
 FESA3 will exist with both: old RDA2 (FESA3.1) and new RDA3 (FESA3.2)
Client apps will
migrate during LS1
Only for justified,
exceptional cases
Old JAPC
25th April 2013
New JAPC
Old RDA2
client
RDA2  RDA3
Gateway
Old RDA2
server
Old RDA2
server
FESA2.10
FESA3.1
New RDA3
client
FEC developers
should migrate to
FESA3.2 ASAP
Wojciech Sliwinski, Middleware Renovation
New RDA3
server
FESA3.2
32
LS1: Changes in JAPC

New major JAPC version  upgrade for RDA3 (September’13)
 Public API backward compatible
 Possible API extensions, but always compatible
 Announcement via accsoft-java-announce list

Required Actions for JAPC Users
 Update JAPC jars (via CommonBuild)
 Re-release your product (via CommonBuild)
 New JAPC will support communication with RDA2 & RDA3 servers
25th April 2013
Wojciech Sliwinski, Middleware Renovation
33
LS1: Changes in RDA

New major version: RDA3 (June’13 – alpha version)
 Public API NOT backward compatible
 New protocol, new architecture, new design
 Same Device/Property model & Get/Set/Subscribe calls
 Announcement via cmw-news & accsoft-java-announce lists

Required Actions for RDA Users
 For Java: Use new version of JAPC (API unchanged)
 For Java: New JAPC will support communication with RDA2 & RDA3 servers
 For C++: Upgrade user code to new RDA3 API
 For C++: RDA3 will support communication with RDA2 & RDA3 servers

Consequences if NO Action  staying with old RDA2
 NOT possible to communicate with new RDA3 servers (FESA3, FGC, etc.)
25th April 2013
Wojciech Sliwinski, Middleware Renovation
34
Agenda
Risk assessment and mitigation
25th April 2013
Wojciech Sliwinski, Middleware Renovation
35
Risk assessment and mitigation
Risks
Mitigation
 Wrong product developed
(wrong requirements)
 Early and continuous involvement
of clients & experts
 Product is (too) late
 Careful planning and follow-up
 Fall-back to less ambitious goals
 Product has bugs or
incompatibilities
 Early, continuous testing
(unit and functional tests)
 Bugs affect operations
 Gradual migration
 Fast deployment of bugfixes
25th April 2013
Wojciech Sliwinski, Middleware Renovation
36
Risk: Wrong product developed (wrong requirements)
Mitigation: Early and continuous involvement
of clients & experts

We involved clients and experts since 2010
 Requirements review with all major clients
 Technical discussions with eqp. experts

Iterative development involving the Review team
 Design meetings (API and internals) since January 2013
 Alpha versions will be available for feedback and validation several months
before the final release
 Feedback is continuously integrated in development (= iterative)
25th April 2013
Wojciech Sliwinski, Middleware Renovation
37
Risk: Product is (too) late
Mitigation: Careful planning and follow-up
Fall-back to less ambitious goals

Planning prepared and followed by the MW team
 Taking into account needs and priorities of other CO projects and clients

Regular follow-up
 In CO internally by TEC coordinator
 In informal meetings with the MW experts (as done so far)

Fall-back to less ambitious goals
 Plan priorities of functionality
 Drop (postpone) work with lower priority
25th April 2013
Wojciech Sliwinski, Middleware Renovation
38
Risk: Product has bugs or incompatibilities
Mitigation: Early, continuous testing
(unit, functional & integration tests)

Unit tests to asses quality inside the MW project
 Required dev. phase in the MW team

Functionality tests in CO Testbed
 Functionality of CMW only

Integration tests to check interoperability
 Integration with FESA in CO Testbed
 Integration with FGC in FGC Lab
25th April 2013
Wojciech Sliwinski, Middleware Renovation
39
Risk: Bugs affect operations
Mitigation: Gradual Migration (1)
No BIG-BANG migration but gradual
 Backward compatible (connection-wise) new RDA3 client library

 New RDA3 clients can talk to old RDA2 servers
 FESA3 will exist with both: old RDA2 and new RDA3
25th April 2013
Old JAPC
New JAPC
Old RDA2
client
New RDA3
client
Old RDA2
server
Old RDA2
server
New RDA3
server
FESA2
FESA3
FESA3
Wojciech Sliwinski, Middleware Renovation
40
Risk: Bugs affect operations
Mitigation: Gradual Migration (2)

Deploy first on systems controlled by the MW team
 E.g. Proxies, Gateways


Gain experience and confidence
Start deployment with less critical systems first
25th April 2013
Wojciech Sliwinski, Middleware Renovation
41
Risk: Bugs affect operations
Mitigation: Fast deployment of bugfixes

If (inspite of all) something goes wrong in operations
 Fast reaction from the MW team

In CO, we will study the need and mechanisms to quickly upgrade
also servers
25th April 2013
Wojciech Sliwinski, Middleware Renovation
42
Conclusions

We have to replace CORBA with a new solution

We collected updated users requirements

MW upgrade will be performed during LS1

Interoperability between RDA2  RDA3

Gradual control system migration until LS2

End-of-Life for RDA2: LS2
25th April 2013
Wojciech Sliwinski, Middleware Renovation
43
Download