High Performance Data Streaming in a Service Architecture Jackson State University Internet Seminar November 18 2004 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 gcf@indiana.edu http://www.infomall.org http://www.grid2002.org 1 Abstract We discuss a class of HPC applications characterized by large scale simulations linked to large data streams coming from sensors, data repositories and other simulations. Such applications will increase in importance to support "data-deluged science”. We show how Web service and Grid technologies offer significant advantages over traditional approaches from the HPC community. We cover Grid workflow (contrasting it with dataflow) and how Web Service (SOAP) protocols can achieve high performance 2 Parallel Computing Parallel processing is built on breaking problems up into parts and simulating each part on a separate computer node There are several ways of expressing this breakup into parts with Software: • Message Passing as in MPI or • OpenMP model for annotating traditional languages • Explicitly parallel languages like High Performance Fortran And several computer architectures designed to support this breakup • Distributed Memory with or without custom interconnect • Shared Memory with or without good cache • Vectors with usually good memory bandwidth 3 The Six Fundamental MPI routines MPI_Init (argc, argv) -- initialize MPI_Comm_rank (comm, rank) -- find process label (rank) in group MPI_Comm_size(comm, size) -- find total number of processes MPI_Send (sndbuf,count,datatype,dest,tag,comm) -- send a message MPI_Recv (recvbuf,count,datatype,source,tag,comm,status) - receive a message MPI_Finalize( ) -- End Up 4 Whatever the Software/Parallel Architecture ….. The software is a set of linked parts • Threads, Processes sharing the same memory or independent programs on different computers And the parts must pass information between them in to synchronize themselves and ensure they really are working the same problem The same of course is true in any system • Neurons pass electrical signals in the brain • Humans use a variety of information passing schemes to build communities: voice, book, phone • Ants and Bees use chemical messages Systems are built of parts and in interesting systems the parts communicate with each other and this communication expresses “why it is a system” and not a bunch of independent bits 5 A Picture from 20 years ago 6 Passing Information Information passing between parts covers a wide range in size (number of bits electronically) and “urgency” Communication Time = Latency + (Information Size)/Bandwidth From Society we know that we choose multiple mechanisms with different tradeoffs • Planes and high latency and bandwidth • Walking is low latency but low bandwidths • Cars are somewhat in between theses cases We can always think of information being transferred as a message • If airplane passenger, sound waves or a posted letter • Whether if an MPI message or UNIX Pipe between processes or a method call between threads 7 Parallel Computing and Message Passing We worked very hard to get a better programming model for parallel computing that removed need for user to • Explicitly decompose problem and derive parallel algorithm for decomposed parts • Write MPI programs expressing explicit decomposition This effort wasn’t so successful and on distributed memory machines (including BlueGene/L) at least message passing of MPI style is the execution model even if one uses a higher level language So for parallelism, we are forced to use message passing and this is efficient but intellectually hard 8 The Latest Top 5 in Top500 9 What about Web Services? • Web Services are distributed computer programs that can be in any language (Fortran .. Java .. Perl .. Python) • The simplest implementations involve XML messages (SOAP) and programs written in net friendly languages like Java and Python • Here is a typical e-commerce use? Payment WSDL interfaces Security WSDL interfaces Credit Card Catalog Warehouse shipping 10 Internet Programming Model Web Services are designed as the latest distributed computing programming paradigm motivated by the Internet and the expectation that enterprise software will be built on the same software base Parallel Computing is centered on DECOMPOSITION Internet Programming is centered on COMPOSITION The components of e-commerce (catalog, shipping, search, payment) are NATURALLY separated (although they are often mistakenly integrated in older implementations) These same components are naturally linked by Messages MPI is replaced by SOAP and the COMPOSITION model is called Workflow Parallel Computing and the Internet have the same execution model (processes exchanging messages) but very different REQUIREMENTS 11 Requirements for MPI Messaging tcalc tcomm tcalc MPI and SOAP Messaging both send data from a source to a destination • MPI supports multicast (broadcast) communication; • MPI specifies destination and a context (in comm parameter) • MPI specifies data to send • MPI has a tag to allow flexibility in processing in source processor • MPI has calls to understand context (number of processors etc.) MPI requires very low latency and high bandwidth so that tcomm/tcalc is at most 10 • BlueGene/L has bandwidth between 0.25 and 3 Gigabytes/sec/node and latency of about 5 microseconds • Latency determined so Message Size/Bandwidth > Latency 12 BlueGene/L MPI I http://www.llnl.gov/asci/platforms/bluegene/papers/6almasi.pdf 13 BlueGene/L MPI II http://www.llnl.gov/asci/platforms/bluegene/papers/6almasi.pdf 14 BlueGene/L MPI III http://www.llnl.gov/asci/platforms/bluegene/papers/6almasi.pdf 500 Megabytes/sec 15 Requirements for SOAP Messaging Web Services has much of the same requirements as MPI with two differences where MPI more stringent than SOAP • Latencies are inevitably 1 (local) to 100 milliseconds which is 200 to 20,000 times that of BlueGene/L 1) 0.000001 ms – CPU does a calculation 2) 0.001 to 0.01 ms – MPI latency 3) 1 to 10 ms – wake-up a thread or process 4) 10 to 1000 ms – Internet delay • Bandwidths for many business applications are low as one just needs to send enough information for ATM and Bank to define transactions SOAP has MUCH greater flexibility in areas like security, faulttolerance, “virtualizing addressing” because one can run a lot of software in 100 milliseconds • Typically takes 1-3 milliseconds to gobble up a modest message in Java and “add value” 16 Ways of Linking Software Modules Closely coupled Java/Python … Module B Module A Coarse Grain Service Model Service B .001 to 1 millisecond METHOD CALL BASED Messages Service A 0.1 to 1000 millisecond latency MESSAGE BASED EVENT BASED with brokered messages “Listener” Subscribe to Events Publisher Post Events Service B Service A Message Queue in the Sky 17 MPI and SOAP Integration Note SOAP Specifies format and through WSDL interfaces MPI only specifies interface and so interoperability between different MPIs requires additional work • IMPI http://impi.nist.gov/IMPI/ Pervasive networks can support high bandwidth (Terabits/sec soon) but latency issue is not resolvable in general way Can combine MPI interfaces with SOAP messaging but I don’t think this has been done Just as walking, cars, planes, phones coexist with different properties; so SOAP and MPI are both good and should be used where appropriate 18 NaradaBrokering http://www.naradabrokering.org We have built a messaging system that is designed to support traditional Web Services but has an architecture that allows it to support high performance data transport as required for Scientific applications • We suggest using this system whenever your application can tolerate 1-10 millisecond latency in linking components • Use MPI when you need much lower latency Use SOAP approach when MPI interfaces required but latency high • As in linking two parallel applications at remote sites Technically it forms an overlay network supporting in software features often done at IP Level 19 Transit Delay (Milliseconds) Mean transit delay for message samples in NaradaBrokering: Different communication hops 9 8 7 6 5 4 3 2 1 0 hop-2 hop-3 hop-5 hop-7 100 1000 Message Payload Size (Bytes) Pentium-3, 1GHz, 256 MB RAM 100 Mbps LAN 20 JRE 1.3 Linux Standard Deviation for message samples in NaradaBrokering Different communication hops - Internal Machines 0.8 hop-2 hop-3 hop-5 hop-7 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1000 1500 2000 2500 3000 3500 Message Payload Size (Bytes) 4000 4500 5000 21 22 Average Video Delays for one broker – divide by N for N load balanced brokers Latency ms Multiple sessions One session 30 frames/sec # Receivers 23 NB-enhanced GridFTP Adds Reliability and Web Service Interfaces to GridFTP Preserves parallel TCP performance and offers choice of transport and Firewall penetration 24 Role of Workflow Service-1 Service-3 Service-2 Programming SOAP and Web Services (the Grid): Workflow describes linkage between services As distributed, linkage must be by messages Linkage is two-way and has both control and data Apply to multi-disciplinary, multi-scale linkage, multi-program linkage, link visualization to simulation, GIS to simulations and visualization filters to each other Microsoft-IBM specification BPEL is current preferred Web Service XML specification of workflow 25 Example workflow Here a sensor feeds a datamining application (We are extending datamining in DoD applications with Grossman from UIC) The data-mining application drives a visualization 26 Example Flood Simulation workflow Data Archives Runoff Model Flow Model Data Archives GIS Grid Services Link Distributed Data and Applications SOAP Messages And Events Runoff Model Flow Model Flow Model 27 SERVOGrid Codes, Relationships Elastic Dislocation Inversion Viscoelastic FEM Viscoelastic Layered BEM Elastic Dislocation Pattern Recognizers Fault Model BEM 28 This linkage called Workflow in Grid/Web Service parlance Two-level Programming I • The Web Service (Grid) paradigm implicitly assumes a two-level Programming Model • We make a Service (same as a “distributed object” or “computer program” running on a remote computer) using conventional technologies – C++ Java or Fortran Monte Carlo module – Data streaming from a sensor or Satellite – Specialized (JDBC) database access • Such services accept and produce data from users files and databases Service Data • The Grid is built by coordinating such services assuming we have solved problem of programming the service 29 Two-level Programming II The Grid is discussing the composition of distributed services with the runtime Service1 Service2 interfaces to Grid as opposed to UNIX Service3 Service4 pipes/data streams Familiar from use of UNIX Shell, PERL or Python scripts to produce real applications from core programs Such interpretative environments are the single processor analog of Grid Programming Some projects like GrADS from Rice University are looking at integration between service and composition levels but dominant effort looks at each level separately 30 3 Layer Programming Model Application (level 1 Programming) Application Semantics (Metadata, Ontology) Level 2 “Programming” MPI Fortran C++ etc. Semantic Web Basic Web Service Infrastructure Web Service 1 WS 2 WS 3 WS 4 Workflow (level 3) Programming BPEL Workflow will be built on top of NaradaBrokering as messaging layer 31 Structure of SOAP • SOAP defines a very obvious message structure with a header and a body just like email • The header contains information used by the “Internet operating system” – Destination, Source, Routing, Context, Sequence Number … • The message body is partly further information used by the operating system and partly information for application when it is not looked at by “operating system” except to encrypt, compress it etc. – Note WS-Security supports separate encryption for different parts of a document • Much discussion in field revolves around what is referenced in header • This structure makes it possible to define VERY Sophisticated messaging 32 Deployment Issues for “System Services” “System Services” (handlers/filters) are ones that act before the real application logic of a service They gobble up part of the SOAP header identified by the namespace they care about and possibly part or all of the SOAP body • e.g. the XML elements in header from the WS-RM namespace They return a modified SOAP header and body to next handler in chain Header Body WS-RM Handler WS-…….. Handler e.g. ……. Could be WS-Eventing WS-Transfer ….33 Fast Web Service Communication I • Internet Messaging systems allow one to optimize message streams at the cost of “startup time”, • Web Services can deliver the fastest possible interconnections with or without reliable messaging • Typical results from Grossman (UIC) comparing Slow SOAP over TCP with binary and UDP transport (latter gains a factor of 1000) Record Count SOAP/XML Pure SOAP WS-DMX/ASCII SOAP over UDP WS-DMX/Binary Binary over UDP MB µ σ/µ MB µ σ/µ MB µ σ/µ 10000 50000 150000 375000 1000000 5000000 0.93 4.65 13.9 34.9 93 465 2.04 8.21 26.4 75.4 278 7020 7020 6.45% 1.57% 0.30% 0.25% 0.11% 2.23% 0.5 2.4 7.2 18 48 242 1.47 1.79 2.09 3.08 3.88 8.45 0.61% 0.50% 0.62% 0.29% 1.73% 6.92% 0.28 1.4 4.2 10.5 28 140 1.45 1.63 1.94 2.11 3.32 5.60 5.60 0.38% 0.27% 0.85% 1.11% 0.25% 34 8.12% Fast Web Service Communication II • Mechanism only works for streams – sets of related messages • SOAP header in streams is constant except for sequence number (Message ID), time-stamp .. • One needs two types of new Web Service Specification • “WS-StreamNegotiation” to define how one can use WS-Policy to send messages at start of a stream to define the methodology for treating remaining messages in stream • “WS-FlexibleRepresentation” to define new encodings of messages 35 Fast Web Service Communication III • Then use “WS-StreamNegotiation” to negotiate stream in Tortoise SOAP – ASCII XML over HTTP and TCP – – Deposit basic SOAP header through connection – it is part of context for stream (linking of 2 services) – Agree on firewall penetration, reliability mechanism, binary representation and fast transport protocol – Naturally transport UDP plus WS-RM • Use “WS-FlexibleRepresentation” to define encoding of a Fast transport (On a different port) with messages just having “FlexibleRepresentationContextToken”, Sequence Number, Time stamp if needed – RTP packets have essentially this structure – Could add stream termination status • Can monitor and control with original negotiation stream 36 • Can generate different streams optimized for different end-points Data Deluged Science In the past, we worried about data in the form of parallel I/O or MPI-IO, but we didn’t consider it as an enabler of new algorithms and new ways of computing Data assimilation was not central to HPCC DoE ASC set up because didn’t want test data! Now particle physics will get 100 petabytes from CERN • Nuclear physics (Jefferson Lab) in same situation • Use around 30,000 CPU’s simultaneously 24X7 Weather, climate, solid earth (EarthScope) Bioinformatics curated databases (Biocomplexity only 1000’s of data points at present) Virtual Observatory and SkyServer in Astronomy Environmental Sensor nets 37 Weather Requirements 38 Data Deluged Science Computing Paradigm Data Assimilation Information Simulation Informatics Model Ideas Computational Science Datamining Reasoning Virtual Observatory Astronomy Grid Integrate Experiments Radio Far-Infrared Visible Dust Map Visible + X-ray Galaxy Density Map40 DAME Data Deluged Engineering In flight data ~5000 engines ~ Gigabyte per aircraft per Engine per transatlantic flight Airline Global Network Such as SITA Ground Station Engine Health (Data) Center Maintenance Centre Internet, e-mail, pager Rolls Royce and UK e-Science Program Distributed Aircraft Maintenance Environment 41 USArray Seismic Sensors 42 a Site-specific Irregular Scalar Measurements Ice Sheets Constellations for Plate Boundary-Scale Vector Measurements a a Volcanoes PBO Greenland Long Valley, CA Topography 1 km Stress Change Northridge, CA Earthquakes Hector Mine, CA 43 OGSA-DAI Grid Services Grid Grid Data Assimilation HPC Simulation Analysis Control Visualize Data Deluged Science Computing Architecture Distributed Filters massage data For simulation 44 Data Assimilation Data assimilation implies one is solving some optimization problem which might have Kalman Filter like structure Nobs min Theoretical Unknowns 2 Data ( position , time ) Simulated _ Value Error i i 2 i 1 Due to data deluge, one will become more and more dominated by the data (Nobs much larger than number of simulation points). Natural approach is to form for each local (position, time) patch the “important” data combinations so that optimization doesn’t waste time on large error or insensitive data. Data reduction done in natural distributed fashion NOT on HPC machine as distributed computing most cost effective if calculations essentially independent • Filter functions must be transmitted from HPC machine 45 Distributed Filtering Nobslocal patch >> Nfilteredlocal patch ≈ Number_of_Unknownslocal patch In simplest approach, filtered data gotten by linear transformations on original data based on Singular Value Decomposition of Least squares matrix Send needed Filter Receive filtered data Nobslocal patch 1 Nfilteredlocal patch 1 Geographically Distributed Sensor patches Nobslocal patch 2 Factorize Matrix to product of local patches Nfilteredlocal patch 2 Distributed Machine HPC Machine 46