Composing Just Enough Middleware Christopher D. Gill cdgill@cse.wustl.edu Department of Computer Science and Engineering Washington University St. Louis, MO, USA E71 CS 6785 Programming Languages Seminar Friday, October 17, 2003 Talk Outline • Part I: Introduction to middleware (30 minutes + Q&A) • Part II: Composing just enough middleware (60 minutes + Q&A) • Questions are welcome and encouraged (During and after each part) Part I: Introduction to Middleware • Middleware is a “glue” layer between – “Fixed” infrastructure (hardware, operating system) – Variable (application-specific) parts of a system • Middleware raises the level of abstraction – At which developers program the system message Client Server (Lots of details hidden by middleware) Distributed Object Computing (DOC) • Assumes objects, references, methods • Also, an Object Request Broker (ORB) • Historically, a user-space software layer – Between the operating system and the application – Middleware may be pushed into OS, hardware • Sensor nets, on-chip FPGAs, hardware threading – But for this talk, assume • implemented on top of threads, sockets, and timers method () Client Object reference (Lots of details again hidden by middleware) Maintaining the DOC Illusion • A “reference” to an object Client object reference Stub ORB Servant Skeleton IIOP message ORB – Encodes IP address, ID • A “wire format” is defined for invocation messages – Client stubs “marshal” – Server skeletons “un-marshal” • A servant implements an object in a server • All other details are ORB implementation features – Thread pools, sockets, etc. – Usually abstracted away A Simple DOC Invocation Path Client Client invokes a method call on the remote object Client Stub Stub creates a collection of parameters Parameters are marshalled CDR Marshaller IIOP Request sent ORB Core Server method invoked Server Servant object located Simple Object Adapter Dispatch method call to servant Parameters are unmarshalled IIOP Request received ORB Core Server Skeleton CDR Demarshaller But What’s Really Inside an ORB? upcall to skeleton lookup thread pool ORB POA ORB IIOP message – At most one recipient once – Caller may get exception reactor timers send_n recv_n socket • ORB connects stubs to skeletons • Uses threads, sockets, reactors, tables, etc. • Ensures 0/1 delivery • User level: servant throws • ORB level: e.g., network • Object adapters support multiple servants Marshalling (IIOP message) • Standard CORBA GIOP message format – Mapped to TCP/IP protocols • IDL parameter types – – – – – – – Primitive types Arrays Sequences Structures Interfaces Anys Typecodes • Marshaling and un-marshaling relatively expensive – Apply optimizations, e.g., co-location, where possible Wait-On-Reply Strategy (2-Way Calls) Client C 1 2 Reactor Callback Server 4 wait 3 • Wait on connection Servant 5 Reactor Deadlock here Client 1 6 Reactor 2 Server 4 Reactor Callback C 3 wait 5 Deadlock avoided by waiting on reactor Servant – Low overhead – Does not interfere with other calls – But may lead to deadlock • Wait on reactor – Higher overhead – May block other calls – Does not lead to same deadlock Socket/Thread Multiplexing • Connection cache – Serial socket re-use worker thread buffered requests asynchronous requests Half-Sync/Half-Async Design Pattern (POSA2) • Thread pool – Serial thread re-use • Reactors support concurrent multiplexing – Calls onto a thread • Can multiplex calls onto a socket concurrently too – Leader/followers – Half-sync/Half-async Where does Real-Time Fit In? • Real-time is about predictability “Real-time != real fast” (predicable real fast is good) • Different real-time enforcement mechanisms – Static → efficiency of mechanisms (e.g., overhead) – Dynamic → flexibility of policies (e.g., utilization) – Hybrid → combine both (e.g., utilization + isolation) • Different notions of scheduling strategy possible – E.g., EDF, MLF, RMS, MAU • Various kinds of schedulable entities – E.g., messages, events, OS or distributable threads Static Real-Time Lanes • RT CORBA 1.0 spec implemented in TAO – Irfan Pyarali’s dissertation work • Each lane composed of thread(s)+socket(s) – Dedicated statically to each lane a priori • Each lane has a priority value – Highest priority lane is most eligible at a resource • Efficient enforcement by OS, network layers – E.g., fixed thread + RSVP priorities • Dispatching decision function Priority value → lane assignment (lane does the rest) Real-Time Object Adapters • Efficiency vs. flexibility trade-offs in an upcall – Irfan Pyarali, Aniruddha Gokhale, Doug Schmidt • If space of object IDs is known a priori – Can use hashing for O(1) servant lookup – But must waste some space of empty table slots – And server must assign IDs (limits spec slightly) • Offers another scheduling point in the ORB – Can adapt priorities at these points – Restrict priority inversion (say if no RSVP) • Promotes Efficiency (Gokhale et al.) – Reductions in locking and copying during upcall Kokyu Dispatching Framework • “Kokyu” is a Japanese word – Means “breath”, but also timing Dispatching configuration RMS Dispatcher static – Isolate dispatch lanes mandatory laxity – To dynamic, hybrid cases • Prioritized Threads static LLF • Generalizes lanes optional timers • Queues – Order requests within each lane • Timers – Pace periodic requests • Configuration (tuples) <#, prio, Q type, timer> • Implicit projection of scheduling onto OS Suppliers static static laxity Dispatcher Event Channel Proxy Proxy Filtering Correlation Kokyu in a Real-Time Event Channel Consumers • Well known environment – Harrison M.S. thesis, O’Ryan, Gill dissertations – Boeing, Stanford, KSU collaborations • Schedule event pushes as in Kokyu experiments – Application: ~70 components “flown” in flight simulator – Static, hybrid static/dynamic scheduling strategies RMS, MUF, RMS+LLF – Empirical studies of isolation and effectiveness Kokyu with (Distributable) Threads Servants Clients Stub Dispatcher Stub Skeleton POA Skeleton Dispatcher ORB • E.g., schedule DSRT CORBA 2.0 thread eligibility • Work in progress (2004) – Thrall, Torri, Mgeta (M.S.) Zhang, Subramonian (D.Sc.) – Boeing, BBN, URI, OU, KU collaborations • Many open issues – Distributable thread identity, mechanism trade-offs – Template meta-programming, configuration logic/types Part I: Middleware Summary • Middleware is ultimately about abstraction – Which details to reveal and which to hide • Optimization possible then hit trade-offs – Affects abstraction, policy, mechanism choices • We’ve surveyed a number of issues – Themes and examples will re-appear in part II • Set the stage for discussion of composition – Both what to compose and how to compose it Part II: Just Enough Middleware • Motivating example: active vibration damping – Illustrates networked embedded systems domain • Constraints: optimizations and trade-offs • ORB customization for example application – Footprint reduction approach in nORB – Real-time optimizations in ACE, nORB, TAO • Beyond optimization to trade-offs – Footprint vs. timeliness (add/remove features) – Deadlock vs. feasibility (change a feature) • Composition logics, types, models An Aside: Why Not minimumCORBA? Pros • minimumCORBA designed for resource-constrained systems • Maintains interoperability with full-featured CORBA Cons • Designed for a particular point in the larger design space • Resource constraints may be too stringent (or just different) • Thus, “one size fits all” may cost too much in key cases Our approach • Provide a flexible efficient substrate for feature composition Consequences for “just enough” ORB design • Follows the spirit if not the letter of minimumCORBA • Provides fine grain tailored feature subsetting/removal • Maintains appropriate interoperability Networked Embedded Systems Active vibration damping Structure with Piezoelectric Transducers Actuator Excitation compute node Sensor Measurements compute node compute node – Sensors: 2kHz sampling F: deflection → voltage (100Hz reporting rate) – Computing nodes • Coordination protocols (ping scheduling) • Closed sensor, control, actuator loop – Actuators: continuous G: voltage → deflection (Change value when told) QoS Constraints on Time and Space • Given > 1 dimension footprint – E.g., real-time, footprint – Tight constraints are commonly the case trade off • Optimizations optimize optimize execution time – Improve in a dimension – At no cost to the other (may even improve too) • Trade-offs – Improve in a dimension – At cost to other Footprint Reduction Approach in ACE • Doxygen, other tools show ACE dependency graph • Prune unused classes and interactions first • Refactor for fine-grain composition of what remains Refining the ACE Substrate • Decouple concerns ACE_Event_Handler ACE_Event_Handler – Re-factoring needed • Also shows problem ACE_Service_Object PS_Event_Handler • Pattern-oriented role decomposition ACE_Task_Base ACE_Task ACE_Svc_Handler – With inheritance based composition Peer stream – – – – – Reactor Acceptor Connector Event Handler Svc Handler A Starting Point for Just Enough Middleware Client Client invokes a method call on the remote object Client Stub Stub creates a collection of parameters Parameters are marshalled CDR Marshaller IIOP Request sent ORB Core Server method invoked Server Servant object located Simple Object Adapter Dispatch method call to servant Parameters are unmarshalled IIOP Request received ORB Core Server Skeleton CDR Demarshaller nORB Design Approach • Implement simple DOC invocation path • Customize TAO’s CORBA IDL compiler • Benchmark nORB, TAO, ACE empirically – Using a representative coordination protocol – Pay careful attention to time/space trade-offs • Cycle: design → implement → benchmark … Customized CORBA IDL Compiler • Subset of standard CORBA IDL, IIOP 1.0 only – Optimizes marshaling time, message sizes – All primitive CORBA types (boolean, long, float, …) – Arrays, Sequences, Structures, Interfaces • Re-factored TAO IDL compiler – nORB specific back end • Custom mapping using C++ STL – Easier programming model to use – I.e., fewer memory management pitfalls nORB Performance Enhancements ACE_COMPONENTS=FOR_TAO exceptions=0 rtti=0 inline=0 threads=0 debug=0 optimize=1 ami=0 corba_messaging=0 rt_corba=0 shared_libs=0 static_libs_only=1 DEFFLAGS=DACE_USE_RCSID=0 minimum_corba=1 • Critical path optimizations similar to TAO – Gather-write technique used to send requests – Memory allocators instead of new/delete – Direct upcall model – Single-read optimization for server side requests • Capture key build options Benchmarking Studies • With Venkita Subramonian, Guoliang Xing, Ron Cytron • Early results reported at WORDS ’03 (Guadalajara) • Test application colorv colorx V colorw colorx colorx X colorx colorz W colory Z Y – Distributed graph coloring – Simple distributed constraint satisfaction problem – Represents, e.g., ping node scheduling in our example – We used it as a touchstone for footprint & performance • Compare 3 implementations using ACE, TAO, nORB Comparing ACE, nORB and TAO node Y one round colorX colorY improveX improveY colorZ choose store compare colorY improveY improveZ time choose • ~500,000 repeated trials to generate large sample population – – – – Better confidence in finer-grain distinctions between time bounds Time for each asynchronous message passing round to complete 100 nodes in 10x10 square mesh (interior nodes have 4 neighbors) Four 2.53GHz P4 512MB RAM KURT-Linux boxes over 100Kb/s Ethernet Node Footprint Comparison Middleware layer with only ACE costs 212KB Middleware layer with nORB+ACE costs 345KB (133KB over ACE) Middleware layer using TAO+ACE costs ~1.7MB (~1.2MB over nORB+ACE) 2,000 Node NodeRegistry 1,800 Footprint in KB 1,600 1,400 1,200 Node application alone costs 164KB 1,000 800 600 400 200 0 ACE TAO nORB compile optimized TAO compile optimized nORB Node 376 1800 567 1738 509 NodeRegistry 324 1778 549 1725 492 Experimental Trials node Y one round colorX colorY improveX improveY colorZ choose store compare colorY improveY improveZ time choose ~500,000 repeated trials to generate large sample population – Want confidence in fine-grain time bounds distinctions – Measure time of each message passing round – 100 nodes in 10x10 square mesh on 4 networked machines – 2.53GHz P4 512MB RAM KURT-Linux, 100Kb/s Ethernet Optimizing TAO Cost per Round null locks wait-on-select-reactor g++ -O3 compile time optimization single read POA and reactor locks wait-on-TP-reactor notice tails Optimizing nORB Cost per Round single read null locks g++ -O3 compile time optimization SOA and reactor locks wait-on-connection tighter distributions than with TAO Impact on Algorithm Convergence TAO better in average case nORB distribution is tighter ORB cost ~6Hz 4Hz 2.5Hz 3Hz Soft Real-Time Bounds on Round Times 70 60 Bounds for nORB tighter than for TAO at or above 90% previous plots resolution msec 50 TAO nORB compile optimized TAO compile optimized nORB runtime optimized TAO runtime optimized nORB ACE 40 30 20 10 0 99% 98% 95% 90% percentage of samples bounded 80% Hard Real-Time Bounds on Round Times 250 Bounds for nORB much tighter than for TAO as we approach 100% 200 msec 150 TAO nORB compile optimized TAO compile optimized nORB runtime optimized TAO runtime optimized nORB ACE ACE values anomalous… (termination artifact) 100 50 0 100.00% 99.9999% 99.999% 99.99% percentage of samples bounded 99.9% footprint cost Time and Space Design Map TAO runtime compile optimization optimization hash lookup, single marshal (in progress) nORB runtime compile optimization optimization ACE (hand crafted) time cost Beyond Optimization to Trade-Offs • For a given application, customization is possible – Usually more a matter of engineering than research • However, beyond a certain point we hit trade-offs – Finding those trade-offs is interesting systems research – Lead to deep research questions in CS theory, logic • We looked at combinations of features in nORB – Leading to time and space trade-offs • We’ll consider the impact of a single feature – Interesting due to interactions with other features • And the use of typesystems to generate configurations Call Reply Configuration Use-Cases • Strategy used to wait for replies Client C 1 Callback • Interleaved processing of incoming requests • Blocking factors affected – Call graph – Deployment of servants – Thread pool sizes Servant 5 Reactor Deadlock here Client 1 Callback 6 Reactor C 2 Server 4 Reactor • Choose strategy based on system characteristics 2 wait 3 • No interference with other calls – Wait on reactive mechanism Server 4 Reactor – Wait on connection Nested Upcall scenario 3 wait 5 Deadlock avoided by waiting on reactor Servant Logics and Typesystems • Can describe configuration problem informally – But computability and time complexity are serious issues • Can describe problem formally in first-order logic – But may be computationally infeasible for complex systems – Horn clauses etc. may help, but only in some cases • Another approach: apply re-factoring here as well – Push evaluation down into universe of discourse – Thus simplifying logic so it’s tractable (even real-time!) • Typesystems approach may help – Compute static modes (state space explosion) – Compute dynamic modes (halting problems) – Behavioral types are interesting (Henzinger and Lee) Dispatching Configuration Example • QoS attributes based on scheduling policy • Bundle together all QoS attributes in one descriptor • Can we generate the appropriate QoS descriptor? – Use a configurator to generate the attributes – Scheduling policy as input to generator Scheduling policy QoS Descriptor Generator QoS Descriptor C++ Template Meta-Programming • Mechanism to embed generators in C++ – Completely within the purview of C++ language • Metainformation represented using – Member traits, Traits classes, Traits templates • Compile-time control-flow constructs – Template metafunctions E.g. IF, THEN, ELSE, CASE – Conditional compilation based on evaluation of typeexpressions • Issues – Advanced usage of C++ templates – Compiler support an issue Generator for QoS Descriptor enum Disp_Rule_t { RMS, EDF, MLF, MUF, other }; template<Disp_Rule_t> struct QoSDesc { }; //template specializations template<> struct QoSDesc<RMS> { long period; //fields specific to RMS }; template<> struct QoSDesc<EDF> { long deadline; //fields specific to EDF }; template<> struct QoSDesc<MLF> { //fields specific to MLF }; template <Disp_Rule_t disp_rule> struct QoSDescriptorGenerator { typedef typename CASE<EDF,QoSDesc<EDF>, CASE<RMS,QoSDesc<RMS>, CASE<MLF,QoSDesc<MLF> > > > disp_rule_case_list; typedef typename SWITCH<disp_rule, disp_rule_case_list>::RET QoSDescriptor_; typedef QoSDescriptor_ RET; }; typedef QoSDescriptorGenerator<EDF>::RET QoSDescriptor; Related Work • Minimalist middleware frameworks – UBI-core at UIUC, other projects • Note that making lowest level substrate robust is an issue • Composition logics – Task Scheduler Logic at U. Utah – WUGLE project at WUSTL • Instrumentation and history analysis – DSKI/DSUI and event history work at KU • Extending/exploiting type systems – Ptolemy at U.C. Berkeley – Kokyu template meta-programming at WUSTL • RMA, dispatcher composition in C++ typesystem Concluding Remarks • Use generative middleware programming to compose and configure fine-grained infrastructure – Leverage features of existing programming languages • Design substrates for use in system generation – “system aspect frameworks” • Drive infrastructure configuration and adaptation strategies from logics, types, algebras – While avoiding NP-hard cases and halting problems • From “Just Enough Middleware” we want to generate “Just The Right Middleware” each time