Slide 1 - Department of Computer Science & Engineering

advertisement
Composing
Just Enough Middleware
Christopher D. Gill
cdgill@cse.wustl.edu
Department of Computer Science and Engineering
Washington University
St. Louis, MO, USA
E71 CS 6785 Programming Languages Seminar
Friday, October 17, 2003
Talk Outline
• Part I: Introduction to middleware
(30 minutes + Q&A)
• Part II: Composing just enough middleware
(60 minutes + Q&A)
• Questions are welcome and encouraged
(During and after each part)
Part I: Introduction to Middleware
• Middleware is a “glue” layer between
– “Fixed” infrastructure (hardware, operating system)
– Variable (application-specific) parts of a system
• Middleware raises the level of abstraction
– At which developers program the system
message
Client
Server
(Lots of details hidden by middleware)
Distributed Object Computing (DOC)
• Assumes objects, references, methods
• Also, an Object Request Broker (ORB)
• Historically, a user-space software layer
– Between the operating system and the application
– Middleware may be pushed into OS, hardware
• Sensor nets, on-chip FPGAs, hardware threading
– But for this talk, assume
• implemented on top of threads, sockets, and timers
method ()
Client
Object
reference
(Lots of details again hidden by middleware)
Maintaining the DOC Illusion
• A “reference” to an object
Client
object
reference
Stub
ORB
Servant
Skeleton
IIOP
message
ORB
– Encodes IP address, ID
• A “wire format” is defined for
invocation messages
– Client stubs “marshal”
– Server skeletons “un-marshal”
• A servant implements an
object in a server
• All other details are ORB
implementation features
– Thread pools, sockets, etc.
– Usually abstracted away
A Simple DOC Invocation Path
Client
Client invokes
a method call on
the remote object
Client
Stub
Stub creates a
collection
of parameters
Parameters are
marshalled
CDR
Marshaller
IIOP
Request
sent
ORB
Core
Server method
invoked
Server
Servant object
located
Simple
Object
Adapter
Dispatch method
call to servant
Parameters are
unmarshalled
IIOP
Request
received
ORB
Core
Server
Skeleton
CDR
Demarshaller
But What’s Really Inside an ORB?
upcall to
skeleton
lookup
thread
pool
ORB
POA
ORB
IIOP
message
– At most one recipient once
– Caller may get exception
reactor
timers
send_n recv_n
socket
• ORB connects stubs to
skeletons
• Uses threads, sockets,
reactors, tables, etc.
• Ensures 0/1 delivery
• User level: servant throws
• ORB level: e.g., network
• Object adapters support
multiple servants
Marshalling (IIOP message)
• Standard CORBA GIOP message format
– Mapped to TCP/IP protocols
• IDL parameter types
–
–
–
–
–
–
–
Primitive types
Arrays
Sequences
Structures
Interfaces
Anys
Typecodes
• Marshaling and un-marshaling relatively expensive
– Apply optimizations, e.g., co-location, where possible
Wait-On-Reply Strategy (2-Way Calls)
Client
C
1
2
Reactor
Callback
Server
4
wait
3
• Wait on connection
Servant
5
Reactor
Deadlock
here
Client
1
6
Reactor
2
Server
4
Reactor
Callback
C
3
wait
5
Deadlock
avoided
by waiting
on reactor
Servant
– Low overhead
– Does not interfere
with other calls
– But may lead to
deadlock
• Wait on reactor
– Higher overhead
– May block other calls
– Does not lead to
same deadlock
Socket/Thread Multiplexing
• Connection cache
– Serial socket re-use
worker thread
buffered
requests
asynchronous
requests
Half-Sync/Half-Async
Design Pattern (POSA2)
• Thread pool
– Serial thread re-use
• Reactors support
concurrent multiplexing
– Calls onto a thread
• Can multiplex calls onto a
socket concurrently too
– Leader/followers
– Half-sync/Half-async
Where does Real-Time Fit In?
• Real-time is about predictability
“Real-time != real fast” (predicable real fast is good)
• Different real-time enforcement mechanisms
– Static → efficiency of mechanisms (e.g., overhead)
– Dynamic → flexibility of policies (e.g., utilization)
– Hybrid → combine both (e.g., utilization + isolation)
• Different notions of scheduling strategy possible
– E.g., EDF, MLF, RMS, MAU
• Various kinds of schedulable entities
– E.g., messages, events, OS or distributable threads
Static Real-Time Lanes
• RT CORBA 1.0 spec implemented in TAO
– Irfan Pyarali’s dissertation work
• Each lane composed of thread(s)+socket(s)
– Dedicated statically to each lane a priori
• Each lane has a priority value
– Highest priority lane is most eligible at a resource
• Efficient enforcement by OS, network layers
– E.g., fixed thread + RSVP priorities
• Dispatching decision function
Priority value → lane assignment (lane does the rest)
Real-Time Object Adapters
• Efficiency vs. flexibility trade-offs in an upcall
– Irfan Pyarali, Aniruddha Gokhale, Doug Schmidt
• If space of object IDs is known a priori
– Can use hashing for O(1) servant lookup
– But must waste some space of empty table slots
– And server must assign IDs (limits spec slightly)
• Offers another scheduling point in the ORB
– Can adapt priorities at these points
– Restrict priority inversion (say if no RSVP)
• Promotes Efficiency (Gokhale et al.)
– Reductions in locking and copying during upcall
Kokyu Dispatching Framework
• “Kokyu” is a Japanese word
– Means “breath”, but also timing
Dispatching
configuration
RMS
Dispatcher
static
– Isolate dispatch lanes
mandatory
laxity
– To dynamic, hybrid cases
• Prioritized Threads
static
LLF
• Generalizes lanes
optional
timers
• Queues
– Order requests within each lane
• Timers
– Pace periodic requests
• Configuration (tuples)
<#, prio, Q type, timer>
• Implicit projection of
scheduling onto OS
Suppliers
static
static
laxity
Dispatcher
Event Channel
Proxy
Proxy
Filtering
Correlation
Kokyu in a Real-Time Event Channel
Consumers
• Well known environment
– Harrison M.S. thesis, O’Ryan, Gill dissertations
– Boeing, Stanford, KSU collaborations
• Schedule event pushes as in Kokyu experiments
– Application: ~70 components “flown” in flight simulator
– Static, hybrid static/dynamic scheduling strategies RMS, MUF, RMS+LLF
– Empirical studies of isolation and effectiveness
Kokyu with (Distributable) Threads
Servants
Clients
Stub
Dispatcher
Stub
Skeleton
POA
Skeleton
Dispatcher
ORB
• E.g., schedule DSRT CORBA 2.0 thread eligibility
• Work in progress (2004)
– Thrall, Torri, Mgeta (M.S.) Zhang, Subramonian (D.Sc.)
– Boeing, BBN, URI, OU, KU collaborations
• Many open issues
– Distributable thread identity, mechanism trade-offs
– Template meta-programming, configuration logic/types
Part I: Middleware Summary
• Middleware is ultimately about abstraction
– Which details to reveal and which to hide
• Optimization possible then hit trade-offs
– Affects abstraction, policy, mechanism choices
• We’ve surveyed a number of issues
– Themes and examples will re-appear in part II
• Set the stage for discussion of composition
– Both what to compose and how to compose it
Part II: Just Enough Middleware
• Motivating example: active vibration damping
– Illustrates networked embedded systems domain
• Constraints: optimizations and trade-offs
• ORB customization for example application
– Footprint reduction approach in nORB
– Real-time optimizations in ACE, nORB, TAO
• Beyond optimization to trade-offs
– Footprint vs. timeliness (add/remove features)
– Deadlock vs. feasibility (change a feature)
• Composition logics, types, models
An Aside: Why Not minimumCORBA?
Pros
• minimumCORBA designed for resource-constrained systems
• Maintains interoperability with full-featured CORBA
Cons
• Designed for a particular point in the larger design space
• Resource constraints may be too stringent (or just different)
• Thus, “one size fits all” may cost too much in key cases
Our approach
• Provide a flexible efficient substrate for feature composition
Consequences for “just enough” ORB design
• Follows the spirit if not the letter of minimumCORBA
• Provides fine grain tailored feature subsetting/removal
• Maintains appropriate interoperability
Networked Embedded Systems
Active vibration damping
Structure with
Piezoelectric Transducers
Actuator Excitation
compute
node
Sensor Measurements
compute
node
compute
node
– Sensors: 2kHz sampling
F: deflection → voltage
(100Hz reporting rate)
– Computing nodes
• Coordination protocols
(ping scheduling)
• Closed sensor, control,
actuator loop
– Actuators: continuous
G: voltage → deflection
(Change value when told)
QoS Constraints on Time and Space
• Given > 1 dimension
footprint
– E.g., real-time, footprint
– Tight constraints are
commonly the case
trade
off
• Optimizations
optimize
optimize
execution time
– Improve in a dimension
– At no cost to the other
(may even improve too)
• Trade-offs
– Improve in a dimension
– At cost to other
Footprint Reduction Approach in ACE
• Doxygen, other tools show ACE dependency graph
• Prune unused classes and interactions first
• Refactor for fine-grain composition of what remains
Refining the ACE Substrate
• Decouple concerns
ACE_Event_Handler
ACE_Event_Handler
– Re-factoring needed
• Also shows problem
ACE_Service_Object
PS_Event_Handler
• Pattern-oriented role
decomposition
ACE_Task_Base
ACE_Task
ACE_Svc_Handler
– With inheritance
based composition
Peer stream
–
–
–
–
–
Reactor
Acceptor
Connector
Event Handler
Svc Handler
A Starting Point for Just Enough Middleware
Client
Client invokes
a method call on
the remote object
Client
Stub
Stub creates a
collection
of parameters
Parameters are
marshalled
CDR
Marshaller
IIOP
Request
sent
ORB
Core
Server method
invoked
Server
Servant object
located
Simple
Object
Adapter
Dispatch method
call to servant
Parameters are
unmarshalled
IIOP
Request
received
ORB
Core
Server
Skeleton
CDR
Demarshaller
nORB Design Approach
• Implement simple DOC invocation path
• Customize TAO’s CORBA IDL compiler
• Benchmark nORB, TAO, ACE empirically
– Using a representative coordination protocol
– Pay careful attention to time/space trade-offs
• Cycle: design → implement → benchmark …
Customized CORBA IDL Compiler
• Subset of standard CORBA IDL, IIOP 1.0 only
– Optimizes marshaling time, message sizes
– All primitive CORBA types (boolean, long, float, …)
– Arrays, Sequences, Structures, Interfaces
• Re-factored TAO IDL compiler
– nORB specific back end
• Custom mapping using C++ STL
– Easier programming model to use
– I.e., fewer memory management pitfalls
nORB Performance Enhancements
ACE_COMPONENTS=FOR_TAO
exceptions=0
rtti=0
inline=0
threads=0
debug=0
optimize=1
ami=0
corba_messaging=0
rt_corba=0
shared_libs=0
static_libs_only=1
DEFFLAGS=DACE_USE_RCSID=0
minimum_corba=1
• Critical path optimizations
similar to TAO
– Gather-write technique
used to send requests
– Memory allocators instead
of new/delete
– Direct upcall model
– Single-read optimization
for server side requests
• Capture key build options
Benchmarking Studies
• With Venkita Subramonian,
Guoliang Xing, Ron Cytron
• Early results reported at
WORDS ’03 (Guadalajara)
• Test application
colorv
colorx
V
colorw
colorx
colorx
X
colorx
colorz
W
colory
Z
Y
– Distributed graph coloring
– Simple distributed constraint
satisfaction problem
– Represents, e.g., ping node
scheduling in our example
– We used it as a touchstone for
footprint & performance
• Compare 3 implementations
using ACE, TAO, nORB
Comparing ACE, nORB and TAO
node Y
one
round
colorX
colorY
improveX
improveY
colorZ
choose
store
compare
colorY
improveY
improveZ
time
choose
• ~500,000 repeated trials to generate large sample population
–
–
–
–
Better confidence in finer-grain distinctions between time bounds
Time for each asynchronous message passing round to complete
100 nodes in 10x10 square mesh (interior nodes have 4 neighbors)
Four 2.53GHz P4 512MB RAM KURT-Linux boxes over 100Kb/s Ethernet
Node Footprint Comparison
Middleware layer with only ACE costs 212KB
Middleware layer with nORB+ACE costs 345KB (133KB over ACE)
Middleware layer using TAO+ACE costs ~1.7MB (~1.2MB over nORB+ACE)
2,000
Node
NodeRegistry
1,800
Footprint in KB
1,600
1,400
1,200
Node application
alone costs 164KB
1,000
800
600
400
200
0
ACE
TAO
nORB
compile
optimized TAO
compile
optimized
nORB
Node
376
1800
567
1738
509
NodeRegistry
324
1778
549
1725
492
Experimental Trials
node Y
one
round
colorX
colorY
improveX
improveY
colorZ
choose
store
compare
colorY
improveY
improveZ
time
choose
~500,000 repeated trials to generate large sample population
– Want confidence in fine-grain time bounds distinctions
– Measure time of each message passing round
– 100 nodes in 10x10 square mesh on 4 networked machines
– 2.53GHz P4 512MB RAM KURT-Linux, 100Kb/s Ethernet
Optimizing TAO Cost per Round
null locks
wait-on-select-reactor
g++ -O3 compile
time optimization
single read
POA and reactor locks
wait-on-TP-reactor
notice tails
Optimizing nORB Cost per Round
single read
null locks
g++ -O3 compile time optimization
SOA and reactor locks
wait-on-connection
tighter
distributions
than with TAO
Impact on Algorithm Convergence
TAO better in
average case
nORB distribution
is tighter
ORB cost
~6Hz
4Hz
2.5Hz
3Hz
Soft Real-Time Bounds on Round Times
70
60
Bounds for nORB tighter than
for TAO at or above 90%
previous plots resolution
msec
50
TAO
nORB
compile optimized TAO
compile optimized nORB
runtime optimized TAO
runtime optimized nORB
ACE
40
30
20
10
0
99%
98%
95%
90%
percentage of samples bounded
80%
Hard Real-Time Bounds on Round Times
250
Bounds for nORB much tighter than
for TAO as we approach 100%
200
msec
150
TAO
nORB
compile optimized TAO
compile optimized nORB
runtime optimized TAO
runtime optimized nORB
ACE
ACE values anomalous…
(termination artifact)
100
50
0
100.00%
99.9999%
99.999%
99.99%
percentage of samples bounded
99.9%
footprint cost
Time and Space Design Map
TAO
runtime
compile
optimization optimization
hash lookup,
single marshal
(in progress)
nORB
runtime
compile
optimization optimization
ACE
(hand crafted)
time cost
Beyond Optimization to Trade-Offs
• For a given application, customization is possible
– Usually more a matter of engineering than research
• However, beyond a certain point we hit trade-offs
– Finding those trade-offs is interesting systems research
– Lead to deep research questions in CS theory, logic
• We looked at combinations of features in nORB
– Leading to time and space trade-offs
• We’ll consider the impact of a single feature
– Interesting due to interactions with other features
• And the use of typesystems to generate configurations
Call Reply Configuration Use-Cases
• Strategy used to wait for
replies
Client
C
1
Callback
• Interleaved processing of
incoming requests
• Blocking factors affected
– Call graph
– Deployment of servants
– Thread pool sizes
Servant
5
Reactor
Deadlock
here
Client
1
Callback
6
Reactor
C
2
Server
4
Reactor
• Choose strategy based on
system characteristics
2
wait
3
• No interference with other calls
– Wait on reactive mechanism
Server
4
Reactor
– Wait on connection
Nested Upcall scenario
3
wait
5
Deadlock
avoided
by waiting
on reactor
Servant
Logics and Typesystems
• Can describe configuration problem informally
– But computability and time complexity are serious issues
• Can describe problem formally in first-order logic
– But may be computationally infeasible for complex systems
– Horn clauses etc. may help, but only in some cases
• Another approach: apply re-factoring here as well
– Push evaluation down into universe of discourse
– Thus simplifying logic so it’s tractable (even real-time!)
• Typesystems approach may help
– Compute static modes (state space explosion)
– Compute dynamic modes (halting problems)
– Behavioral types are interesting (Henzinger and Lee)
Dispatching Configuration Example
• QoS attributes based on
scheduling policy
• Bundle together all QoS
attributes in one descriptor
• Can we generate the
appropriate QoS
descriptor?
– Use a configurator to
generate the attributes
– Scheduling policy as input to
generator
Scheduling policy
QoS Descriptor
Generator
QoS Descriptor
C++ Template Meta-Programming
• Mechanism to embed generators in C++
– Completely within the purview of C++ language
• Metainformation represented using
– Member traits, Traits classes, Traits templates
• Compile-time control-flow constructs
– Template metafunctions E.g. IF, THEN, ELSE, CASE
– Conditional compilation based on evaluation of typeexpressions
• Issues
– Advanced usage of C++ templates
– Compiler support an issue
Generator for QoS Descriptor
enum Disp_Rule_t
{ RMS, EDF, MLF, MUF, other };
template<Disp_Rule_t>
struct QoSDesc
{
};
//template specializations
template<>
struct QoSDesc<RMS>
{
long period;
//fields specific to RMS
};
template<>
struct QoSDesc<EDF>
{
long deadline;
//fields specific to EDF
};
template<>
struct QoSDesc<MLF>
{
//fields specific to MLF
};
template <Disp_Rule_t disp_rule>
struct QoSDescriptorGenerator
{
typedef typename
CASE<EDF,QoSDesc<EDF>,
CASE<RMS,QoSDesc<RMS>,
CASE<MLF,QoSDesc<MLF> > > >
disp_rule_case_list;
typedef typename
SWITCH<disp_rule,
disp_rule_case_list>::RET
QoSDescriptor_;
typedef QoSDescriptor_ RET;
};
typedef
QoSDescriptorGenerator<EDF>::RET
QoSDescriptor;
Related Work
• Minimalist middleware frameworks
– UBI-core at UIUC, other projects
• Note that making lowest level substrate robust is an issue
• Composition logics
– Task Scheduler Logic at U. Utah
– WUGLE project at WUSTL
• Instrumentation and history analysis
– DSKI/DSUI and event history work at KU
• Extending/exploiting type systems
– Ptolemy at U.C. Berkeley
– Kokyu template meta-programming at WUSTL
• RMA, dispatcher composition in C++ typesystem
Concluding Remarks
• Use generative middleware programming to
compose and configure fine-grained infrastructure
– Leverage features of existing programming languages
• Design substrates for use in system generation
– “system aspect frameworks”
• Drive infrastructure configuration and adaptation
strategies from logics, types, algebras
– While avoiding NP-hard cases and halting problems
• From “Just Enough Middleware” we want to
generate “Just The Right Middleware” each time
Download