The Next Generation Internet: Unsafe at any Speed? Ken Birman Dept. of Computer Science Cornell University Convergent Trends Existing Internet exhibiting brownouts, security and quality-of-service problems Talk of a “next generation Internet” offering 10 to 100-fold performance improvements A new generation of networked applications includes large numbers of “critical” ones Typical Critical Applications Medical monitoring and clinical databases. Community health information networks. Remote home care and Remote telesurgery Integrated modular avionics systems. Air traffic control. Free flight, 4-D navigation Medical Networks Contacted a number of technical and business people in this field (HP Careview, EMTEK, Hospital for Sick Children) Asked: What are the trends? How are networks changing healthcare? How are these systems made secure & reliable? Got any good stories for me? An ICU Computer System Doctor’s office Bedside Clinical data server Digital library and online PDR Laboratories Pharmacy … a field in transition During 1980’s, hospitals used largely dedicated systems Client-server architectures now becoming dominant, but trend is a recent one Systems ran in physical isolation and had limited, mission-specific functionality Important distinction Medical monitoring equipment, computer controlled “devices” These “practice medicine” FDA regulated, like a drug Software subjected to extreme verification methods, safety certification is costly and hard Extends to the IEEE “medical information bus” for connecting bedside devices Important distinction Medical monitoring equipment, computer controlled “devices” Clinical data systems By definition, not considered safety critical Maintain the legally binding “patient record” Think of a database system. Human checks all entries, even data obtained directly from devices or lab reports. Traditional Approach Each runs as a separate network Developed completely independently No interconnection of any kind Networking technology? Monitoring network is increasingly a dedicated “real-time” LAN, this permits configuration flexibility, remote telemetry, even adjustment of monitoring devices Clinical database system increasingly connected to laboratories, community health information networks (CHIN’s), physician’s office, insurance and HMO’s Platform choices? Overwhelming trend is to introduce standard PC’s and workstations, standard Internet technologies Forced migration from dedicated platforms to shared, standard network platforms Web access now common from PC’s that run clinical database software … bluring the distinction Increasingly, see monitoring network crossconnected to the clinical data network Some physical isolation: not yet common to see an IV perfusion drip controllable over an internet within a hospital “Perimeter” security using passwords, firewalls. But medical security needs are unusual; mismatched to standard solutions. Creep of “critical” role Technically, clinical data system is non-critical But increasingly, the system actually is critical: doctors and nurses depend upon the Accuracy and timeliness of reporting Correct data for lab results, vitals, medications FDA is simply late to catch up with trends Moreover, already seeing Windows 95 and MS Access as basis for such systems Consumer / society “pull” Intensive and growing cost pressures Desire for freedom from medical system, home care Consolidation of hospitals HMO’s want to control care plan … create trend towards remote telemedicine, even robot telesurgery, CHIN’s Vision: A Virtual Private Network Application shares the network with untrusted agents but is isolated from them. Reality? Current VPN support approximates this, but configuration potentially awkward, slow Many CHIN’s won’t use VPN’s By running over the Internet, CHIN’s are exposed to bandwidth fluctuations and denial of service from many causes Good stories? Many cases of security or privacy violations (EMTEK has a good one). HP told me that some hackers accidently disrupted a cardiac monitor in the Boston area a few years ago (trying to track this down) Nutty nursing aide in Britain changed orders, discharged patients, scheduled tests… HP Careview, starved for bandwidth, flickers on and offline in some critical care units... Broad picture? Application trends outstripping technology Decision making is by societal consensus, cost pressures, reflects HMO needs. Hospital executives insisting on “standards” Hospital network of future: PC’s, off-theshelf Internet software, standard Web stuff. Critical or not, like it or not, it’s happening! What about aviation? Much use of computer technologies Flight management system (controls airplane) Flaps, engine controls (critical subsystems) Navigational systems TCAS (collision avoidance system) Air traffic control system on ground In-flight, approach, international “hand-off” Airport ground system (runways, gates, etc) What about aviation? Much use of computer technologies Flight management system (controls airplane) Flaps, engine controls (critical subsystems) Navigational systems TCAS (collision avoidance system) Air traffic control system on ground In-flight, approach, international “hand-off” Airport ground system (runways, gates, etc) ATC system components Onboard Radar X.500 Directory Controllers Air Traffic Database (flight plans, etc) … similar turmoil On-board systems moving to COTS, integrated modular avionics Boeing 777 SafeBus a success story Unlikely it could be replicated with standard O/S and standard ATM or LAN hardware Emergence of “4-D navigation” (free flight) systems: ground network penetrates level-A critical cockpit components. Free flight Transponder and GPS Ground system On-board conflict alert and resolution system Future avionics systems Ground systems rely increasingly on automation, have form of a highly available, highly critical network. Built using standard PC’s, software tools Ground network becomes critical to flight safety On-board avionics are basically a dedicated realtime LAN built with standard PC’s but perhaps non-standard O/S. One platform, many apps. Safety validation of components replaces current validation of system. Think “plug ‘n play” The list goes on Disaster warning and response coordination Power management (grid control) Banking, stock markets, trading systems Computer-controlled vehicles Military intelligence, command and control Critical business applications Commercial Off The Shelf Build using “COTS” Standard components Buy off the shelf, then harden them Intended to be cheaper, easier to maintain As a practical matter, there is nothing else on the shelf! Roll-your-own solutions abandon powerful tools that make modern computing great! Technology Mountain COTS Reliable Technology Mountain COTS Next Generation Internet Current Internet looks frail Only government investment can address security, reliability, scalability and performance problems of the Internet Expectation is that we’ll build it quickly, hence that we “basically” know how today Next Generation Internet Concrete details? Seeks 10 to 100-fold performance improvement Originally expected to provide IP-v6 interface Originally expected to implement Long IP addresses IPSec, DNSec Quality of Service options “over” some form of Diffsrv (or RSVP) mechanism Reality check Both IPv6 and RSVP now uncertain due to resistance from mainstream IPv4 crowd RSVP resource use on routers grows as O(n2) IPv6 would outmode a huge existing investment How likely is it that the NGI will solve the practical problems identified earlier? How does one build a “secure, reliable, scalable, high performance” network application, anyhow? Do we in fact know how to do this? Glimpse of the IPv4 crowd They gave us TCP/IP, core internet services, stuff on which we run email, web They elevate the end-to-end argument to a religion (basically: “packets, not circuits”) Little experience with critical applications What about QoS? Best scheme: Diffsrv Uses an edge-classification of packets; routers look at just a byte or two But routers distort flow dynamics You send 50 packets per second… … but within the network, a router might see a burst of 100, then a second of silence Consequence is that Diffsrv will be at best stochastic (and it also can’t handle routing changes) … a troubling implication It seems unlikely that the NGI will easily support isolation of critical subsystems with the range of properties required More likely: a tool for building virtual circuits (one-one connections) that run at very high speeds Missing “connection” is the step from the network to the robust application What do we need? Isolation of functions Critical functionality compartmentalized Components only interact through well-defined interfaces with well-defined semantics Developer “proves” that implementation respects interface definition and semantics On the other hand, adequate performance is fundamental to providing robustness Evidence for these claims? This is how modern avionics modules are built (wing flap and engine control, flight management system, inertial navigation) Process is extremely costly and works only for very small pieces of software SafeBus on Boeing 777 allows such software to share platform by creating very strong firewalls between components Agenda emerges Find ways to divide and conquer Transform big nasty system into smaller independent modules Run them in an environment that has strong properties, which the modules exploit Resulting system has strong properties too Can we apply this to familiar distributed computing problems? Philosophy Imagine a network as an abstract data type An “Overlay Network” or ON We can instantiate it multiple times, “condition” each copy with desired quality properties: A Virtual Overlay Network or VON How to introduce properties? Mixture of resource reservation at routers, on a per-ON basis, and management actions at edges A VON Looks like a dedicated Internet, although hosted on a shared infrastructure Supports guarantees of properties such as Bandwidth Noise level Security and freedom from denial of service Treated as an aggregate, not a set of pt-to-pt connections! Making Vision a Reality 1) NGI needs to give us the ON mechanism 2) We need to implement VONs using fairly standard protocols over the base ONs 3) Must be able to produce specialized solutions for reliability/security needs 4) Solutions amenable to selective use of formal tools NGI hooks? Diffsrv and RSVP won’t do it Creates an O(n2) resource reservation problem Problem is that both schemes are fundamentally connection oriented, and VON concept is fundamentally multipoint in nature Hence these point-to-point QoS mechanisms are not suitable for supporting VONs Any other options? Switches supporting “flows” already exist MCI, Sprint, AT&T already sell each other dedicated bandwidth with isolation This is on a scale of perhaps 10’s of flows and hence classification is easy VONs might mean that a switch would see thousands, but such scaling seems well within technical feasibility Router understands flows Looks like this Router understands flows Flow 1 Flow 2 Looks like Looks like Looks like this Acts like this this this Flow 3 Everything else Things to notice A flow in this sense aggregates all the traffic for one ON – the identifier is for the ON not the endpoints Classification task is thus much smaller and resources needed to support this are linear in number of ONs that pass through the switch, not the number of potential connections Each ON is like a dedicated network An ON has A bandwidth guarantee (router sets resources aside on its behalf) Perhaps latency guarantees Can offer isolation between flows But not much else NGI part of the picture NGI needs to give us “raw” ON’s but also: Robust routing infrastructure Naming Ability to build an ON tolerant of “one link or router failure” Many building blocks are already in place But the core Internet community is balking on all forms of QoS: isolation or other guarantees seen as inconsistent with end-to-end philosophy But suppose we get our wish Next President declares “moral equivalent of war” after continuing Internet siege shuts down his web site during election: “Let there be Overlay Networks!” Then what? Our new goal? Create VONs by adding properties to Ons User sees VON as a set of end-points with minimum guarantees, like isolation, between them We need a way to strengthen these properties E.g. manage security keys, manage RSVP parameters, routing, network name space We may also need ways to reliably communicate (1-1, 1-n patterns) VONs as abstract data types VONs as abstract data types Focus on the processes and network VONs as abstract data types Think of the ON interface as an abstract type ON ON ON VONs as abstract data types “Add” encryption by substituting a module that looks the same but encrypts messages encrypt encrypt encrypt Layered Microprotocols Interface to Horus is extremely flexible Horus manages group abstraction group semantics (membership, actions, events) defined by stack of modules Horus stacks plug-and-play modules to give design flexibility to developer vsync filter encrypt ftol sign Layered Microprotocols in Horus Interface to Horus is extremely flexible Horus manages group abstraction group semantics (membership, actions, events) defined by stack of modules Ensemble stacks plug-and-play modules to give design flexibility to developer vsync filter encrypt ftol sign Layered Microprotocols in Horus Interface to Horus is extremely flexible Horus manages group abstraction group semantics (membership, actions, events) defined by stack of modules Ensemble stacks plug-and-play modules to give design flexibility to developer filter ftol vsync encrypt sign Same stack under each endpoint ftol ftol ftol vsync vsync vsync encrypt encrypt encrypt Multiple VONs in single application Yellow group for video communication Green for control and coordination ftol ftol vsync vsync encrypt encrypt ftol ftol ftol ftol vsync vsync encrypt encrypt vsync vsync encrypt encrypt Examples of reliability models Virtual synchrony model: emerged from our work on Isis, now widely accepted Bimodal multicast model: probabilistic and has neat performance properties but weaker logical consistency guarantees Secure group communication Multimedia channels… Virtual Synchrony Model G0={p,q} G1={p,q,r,s} p G2={q,r,s} G3={q,r,s,t} crash q r s t r, s request to join r,s added; state xfer p fails t requests to join t added, state xfer Virtual Synchrony Tools Various forms of replication: Replicated data, replicate an object, state transfer for starting new replicas... 1-many event streams (“network news”) Load-balanced and fault-tolerant request execution Management of groups of nodes or machines in a network setting Stock Exchange Problem: Vsync. multicast is too “fragile” Most members are healthy…. … but one is slow Measured Impact of Perturbation Effect of Perturbation (msgs/sec) Throughput throughput (msgs/sec) 250 200 150 100 50 0 0.1 0.2 0.3 0.4 0.5 ount perturbed Amountam Perturbed 0.6 0.7 0.8 0.9 The problem gets worse as the system scales up Virtually synchronous Ensemble multicast protocols average throughput on nonperturbed members 250 group size: 32 group size: 64 group size: 96 200 150 100 50 0 0 0.1 0.2 0.3 0.4 0.5 perturb rate 0.6 0.7 0.8 0.9 Why does stability matter? Swiss Stock Exchange Exchange is fully electronic [FTCS-27 paper] Uses Isis SDK to distribute all bids/offers and all trades. Every “node” has the picture But this means that entire trading history available to 50 member banks & firms and hundreds or thousands of traders! Unstable node could bring exchange to its knees. Similar issues seen in many other settings Pbcast has a probabilistic reliability model Either almost all destinations receive the message or almost none do so This is strong enough to use in applications with critical reliability needs (but not necessary for all their communication purposes -- put side by side with virtual synchrony) p{#processes=k} Pbcast bimodal delivery distribution 1.E+00 1.E-05 1.E-10 1.E-15 1.E-20 1.E-25 1.E-30 0 5 10 15 20 25 30 35 40 45 50 number of processes to deliver pbcast Scalability of Pbcast reliability 1.E+00 1.E-05 1.E-05 1.E-10 P{failure} p{#processes=k} Pbcast bimodal delivery distribution 1.E-10 1.E-15 1.E-20 1.E-15 1.E-20 1.E-25 1.E-30 1.E-35 1.E-25 10 15 1.E-30 0 5 10 15 20 25 30 35 40 45 P{failure} fanout 5 6 7 40 45 50 55 60 Predicate II 9 8.5 8 7.5 7 6.5 6 5.5 5 4.5 4 20 4 35 Fanout required for a specified reliability 1.E+00 1.E-02 1.E-04 1.E-06 1.E-08 1.E-10 1.E-12 1.E-14 1.E-16 3 30 Predicate I Effects of fanout on reliability 2 25 #processes in system 50 number of processes to deliver pbcast 1 20 8 9 10 25 30 35 40 45 50 #processes in system fanout Predicate I for 1E-8 reliability Predicate I Predicate II Figure 5: Graphs of analytical results Predicate II for 1E-12 reliability Pbcast has stable throughput Gets this from a mixture of gossip-style local repair with several innovations to avoid overload when some process fails We implemented the protocol and experimentally confirmed this 0.6 0.4 Traditional Protocol w ith .45 sleep probability 0.2 0 Inter-arrival spacing (m s) 1 0.8 Pbcast w ith .05 sleep probability 0.6 Pbcast w ith .45 sleep probability 0.4 0.2 0 0. 00 5 0. 01 5 0. 02 5 0. 03 5 0. 04 5 0. 05 5 0. 06 5 Traditional Protocol w ith .05 sleep probability Histogram of throughput for pbcast Probability of occurence 0.8 0. 00 5 0. 01 5 0. 02 5 0. 03 5 0. 04 5 0. 05 5 0. 06 5 Probability of occurence Histogram of throughput for Ensemble's FIFO Virtual Synchrony Protocol Inter-arrival spacing (ms) Throughput measured at unperturbed process High Bandwidth measurements with varying numbers of sleepers 200 150 100 50 0 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 Probability of sleep event Traditional w/1 sleeper Pbcast w/1 sleeper Traditional w/3 sleepers Pbcast w 3/sleepers Traditional w/5 sleepers Pbcast w/5 sleepers 16 nodes - 4 perturbed processes throughput throughput (msgs/sec) 8 nodes - 2 perturbed processes 250 200 150 100 50 0 0.1 0.2 0.3 0.4 250 200 150 100 50 0 0.5 0.1 0.2 perturb rate 0.3 0.4 perturb rate 0.5 128 nodes - 32 perturbed processes throughput throughput 250 200 150 100 50 0 0.2 0.4 perturb rate 96 nodes - 24 perturbed processes 0.1 0.3 0.5 250 200 150 100 50 0 0.1 0.2 0.3 0.4 perturb rate 0.5 Now we have several styles... Each style or model yields a VON with different properties Application might not “see” the multicast stack Instead, the environment in which the application runs could see the stack and use it on behalf of the application For example, a library could use stack to maintain the keys with which it authenticates actions… Formal methods With so much riding on VON, we need strong guarantees that the stack works! If protocols can be formally proved correct, confidence will be far stronger Can we use formal tools on network protocols built in this compositional manner? Exploiting formal methods Van Renesse and Hayden: code stack with language having strong semantics They used O’Caml dialect of ML Now we can bring formal tools to bear on issues of correctness: Using Nuprl system for this Basically, it automates proofs and program transformations Initial Progress? Presented in 1999 ACM SOSP paper Have formalized the transformations used to optimize stacks for high performance We show that from one initial stack, we can produce multiple optimized stacks for common cases. Yields big speedups! Steps Transform Ensemble stack into a single function in a functional style Use partial evaluation to produce optimized version for common cases Use theorem proving to establish that stacks provide desired properties Transform back to imperative style Resulting code is optimized yet retains properties of original stack Optimization Example ? Common case? ftol ftol vsync vsync encrypt encrypt Original code is simple but inefficient Optimized code for “common case” is provably equivalent yet inefficiencies are eliminated Optimization Example ? Common case? ftol ftol vsync vsync encrypt encrypt We do nearly as well as hand-optimization and can automatically handle much bigger stacks! ? Common case? Wrapping things up By building better networks, and isolating protocol components and system components… … and adopting a modular architecture … and selectively using formal methods … we make it more and more practical to gain both high performance and other desired properties, such as reliability, security, stability, etc. Potential NGI lets critical applications share network with untrusted ones VONs But will it happen? Current political agenda focuses on speed and ecommerce transactions End-to-end community resists giving any guarantees no matter how simple And NGI focus is exclusively on point-to-point QoS, which seems unscalable …denying us the one primitive building block on which the whole concept depends! Conclusions? The world needs better networks! Improve them by improved opportunity for modularity, isolation, guarantees of security and quality of service – VONs and layers built over them Lacking this, we face very serious problems simply going forward in directions to which society is already committed. More info http://www.cs.cornell.edu/ken/unsafe.ps