Rapid Prototyping and Deployment of Distributed Web / Grid Services in a Service Oriented Architecture using Scripting Thesis Proposal Harshawardhan Gadgil hgadgil@cs.indiana.edu http://www.hpsearch.org Outline Motivation Literature Survey Research Issues HPSearch Architecture Contributions and Milestones Applications Summary Motivation Critical Infrastructure systems connect disparate data sources, high-performance computing applications and visualization services for real-time data processing. Real-time data processing Results required in real-time. Data available in streams. Requires pre-processing (e.g. filtering data to remove unwanted parts). Scalability Potentially large number of data sources (Static, dynamic) or data processing elements (services) Unpredictable behavior Fault-tolerance a key factor. E.g. Incorporate new data sources or processing units on the fly Motivation (contd.) System Management Increasing complexity of application implies more metadata. Proper management required to ensure smooth functioning of the system. Require easy access to manage system characteristics. Motivation Streaming data Processing Critical Infrastructure systems (Scientific applications) Audio/video applications. Real-time streaming sources exist E.g. sensors, satellite stations Real-time sources E.g. Collaborative sessions OR Static data source (stored A/V files) Data filtering / transformation essential in most cases for converting data to proper format for processing application Pre-processing required to modify A/V characteristic Real-time processing required. Crucial for critical infrastructure applications OR Static data sources (databases containing previously warehoused observations) Format (encoding) / bit rate (quality) etc… Real-time processing crucial for collaborative environments Outline Motivation Literature Survey Research Issues HPSearch Architecture Contributions and Milestones Applications Summary Literature Survey Services (Web / Grid) Scripting Languages Benefits Possible problems Handling data flow in applications File-based vs. Streaming Workflow Systems Enable gluing High performance components GUI – based building and programming flavor Component based architectures Messaging systems (for High throughput data transfer) System Management Service “Service is a logical manifestation of a logical /physical resource (DB, programs, devices, humans etc) and/or some application logic exposed to network” - Web Service Grids: An Evolutionary Approach (2004) Web Services Simple mechanism for distributed computing Language independent, firewall friendly Grid Services Are essentially Web Services Transient – (can be created, destroyed, or die naturally) State – Maintained between calls to the Web Service Scripting Languages Benefits Enables Rapid prototyping (less code size and development time) Less effort to Perform complex tasks Interface with OS (hosting environment) Glue code to tie programs Usually portable Primarily for Plugging existing components together However, some disadvantages too Weak typing Less structure, difficult to maintain Some examples Rhino – Java script for JAVA Perl, VBScript, (P/J)ython Scripting vs GUI builders GUI Builders – Ease of involvement of novice design engineer Scripting – Provides more flexibility thru direct access Scripting Environments Hosting Services OGSI:Lite & WSRF:Lite Based on Perl Rapidly deploy grid services Matlab / Jython from GEODISE GEODISE – Suite of CAD integrated with distributed grid-enabled computing, data, analysis and knowledge resources Uses Matlab to provide programatic access to GEODISE functions along with an existing suite of Matlab tools Jython used to provide a hosting environment using Java CoG kit. Data flow in applications Real-time processing required. Typically data transfer involves temporary storing of data. This data may be transferred using files (E.g. Grid FTP). Every component of the chain processes data from input file, writes processed data to output file. Time and Space critical in real-time applications hence file-based transfer is undesirable for real-time applications. Tools to automate data transfer and invoke applications (E.g. Grid Ant, Karajan) Workflow Architectures Triana – Graphical PSE to compose scientific applications Composed of one or more Triana engines. Distributed version Data transfer takes place using JXTA pipes. Taverna Can interact with arbitrary services. Plugins to mediate / operate the service in each case Uses XScufl (derived from WSFL) workflow language. Kepler Java packages for designing and execution. Has a graphical interface for composing complex workflows Can wrap existing code written in different languages. For e.g. Perl script or Matlab script Component Architectures XCAT @ IU-Extreme Connects components (Provides and Uses ports) Jython based scripting to do application management tasks (create application, set properties, invoke application) Data transfer by GridFTP between components, Globus Reliable File Transfer (fault tolerance). Many other systems Focus mainly on invocation of services as in a Workflow Messaging systems JXTA – P2P middleware, JMS for communication Pastry Fault tolerant P2P middleware Based on Distributed Hash tables No real-time routing possible NaradaBrokering @ IU – http://www.naradabrokering.org Event- brokering system designed to run on a large network of cooperating brokers. Implements high-performance protocols (message transit time < 1 ms per broker) Order-preserving optimized message transport Interface with reliable storage for persistent events Fault tolerant data transport Support for different underlying transport implementations such as TCP, UDP, Multicast, SSL, RTP System Management Increasing complexity of systems implies increasing amount of metadata to be managed Provide access to System and management of System metadata - WS - Management E.g. Performance metrics, logs, service metadata Require ability to query system data and take actions affecting the characteristics of the system. For e.g. Perl provides hooks to query system data Outline Motivation Literature Survey Research Issues HPSearch Architecture Contributions and Milestones Applications Summary Research Issues Support for streaming data processing. Data transfer and processing in real-time Data transfer to be carried on between the end-points (sender and recipient) without the flow engine mediating - Grid Services Flow Language Design a run-time system that allows merging data sources, data filtering and processing applications and visualization tools in a service-oriented architecture Assume all components available as Web (Grid) services. Scalability an issue – Addition of data sources or processing applications (Services) should not degrade the system performance Fault-tolerance – Services and data sources may be lost. Allow system to detect faults and discover and incorporate new components. Research Issues System Management Interface - Allow access to system and manipulate the characteristics of system by querying system metadata Create Virtual topology for application deployment Query performance metrics to design policies to change routing substrate characteristics (E.g. Add new brokers or links between existing brokers to aid efficient routing) Discover Services / brokers / topics of interest. To dynamically rewire components with data streams. Replay events Useful for achieving recovery after failure Outline Motivation Literature Survey Research Issues HPSearch Architecture Contributions and Milestones Applications Summary HPSearch Binds URI to a scripting language We use Mozilla Rhino (A Javascript implementation, Refer: http://www.mozilla.org/rhino), but the principles may be applied to any other scripting language Every Resource may be identified by a URI and HPSearch allows us to manipulate the resource using the URI. For e.g. Read from a web address and write it to a local file x = “http://trex.ucs.indiana.edu/data.txt”; y = “file:///u/hgadgil/data.txt”; Resource r = new Resource(“Copier”); r.port[0].subscribeFrom(x); /* read from */ r.port[0].publishTo(y); /* write to */ f = new Flow(); f.addStartActivities(r); f.start(“1”); Adding support for WS-Addressing construct, under investigation HPSearch (contd.) Currently provide bindings for the following file:// socket://ip:port http://, ftp:// topic:// jdbc: Host-objects to do specific tasks – invoke web-services using SOAP PerfMetrics – Bind NaradaBrokering performance metrics. Store published metrics and allow querying Resource – Every data source / filter / sink is a resource. Flow – To create a data flow between resources. Useful for creating data flows For more information, visit WSDL http://www.hpsearch.org Architecture Consists of SHELL TASK_SCHEDULER (FLOW_ENGINE) Front end to scripting. Distributes tasks among co-operating engines for load-balancing purposes. WSPROXY An AXIS web service wraps an actual service. The behavior of the service can be controlled by making simple WS calls to this proxy. Can be controlled by any Workflow Engine WSProxy handles streaming data communication on behalf of the service. Service only sees I/P and O/P streams. These could be files or a remote data stream or even a file transferred via HTTP / FTP or results from a database query Can be deployed in standard Web Service containers (such as Tomcat) Architecture WSProxy - Interfaces Runnable More control over execution (start, suspend, resume, stop…) Basic idea (read block of data, process it, write it out) Ideal for designing quick filtering applications that process data in streams. Wrapped Wrap an existing service (Executables [*.exe], Matlab scripts, shell / Perl scripts etc…) Less control, can only start, stop Ideal for wrapping existing programs / services to expose as a pluggable component / web service HPSearch Architecture Overview HPSearch Kernel Files Sockets Topics HPSearch Kernel Request Handler Request Handler Java script Shell DataBase URIHandler Task Scheduler Flow Handler Web Service DBHandler Web Service EP WSDLHandler WSProxy WSProxyHandler Other Objects Service Broker Network WSProxy HPSearch Kernel Service WSProxy ... Service So what is the overhead ? Partial results as of now Taken on 1.6 GHz Pentium 4 machine w/ 256 MB RAM running Java 1.4.1_02, NB version 0.98 rc2, Rhino 1.5R3 Shell Init: 2085 mSec (average) Results from RDAHMM Script (26 lines, small script) takes about 15 mSec (average per line) to execute Task distribution (2 engine, 4 tasks) 3897.645 mSec WSProxy (Init – depends on number of streams to initialize) 700 – 2000 mSec (approximate value using System.currentTimeMillis). Outline Motivation Literature Survey Research Issues HPSearch Architecture Contributions and Milestones Applications Summary Contribution of this Thesis Stream and Service Management - Program data-flows Incorporate static and dynamic data sources WSProxy ensures that data flows directly between components (Services) without the HPSearch engine mediating. Useful for streaming large amounts of data without clouding the controller. Scalable ? We use NB as our messaging substrate which can handle large number of clients All components (data sources, data processing and visualization applications) are clients. HPSearch manages streams and connects and steers components. Fault – tolerant ? Data source, data filter (processing application) failure possible. HPSearch can use the discovery service to invoke new services (in lieu of failed services) and reconnect components via streams to continue data flow Contribution of this Thesis (contd.) System Management - Scripting admin tasks Creating network (virtual broker network) topology Querying Performance metrics Topic / Broker discovery Rapid deployment of applications Deploy Network topology Set Application properties Deploy Application In short: Provide alternative programmatic (scripting) access to remote services / resources Milestones Implement WS front-end to shell Remotely submit a script for execution, possibly through a portal WSProxy / Handler: Fault tolerance to handle situations when The machine hosting the WSProxy dies The broker which is used by the proxy dies The HPSearch Engine dies Design Application Interface Allow users to create applications using this interface Set Application properties, Allow modification of application properties at runtime using scripting NB Admin objects NaradaBroker, PerfMetrics, NBDiscovery, ReplayService Milestones (contd.) Design stream negotiation module to allow WSProxy to negotiate stream characteristics Select best possible transport and other QoS elements for data transfer between two services (for a particular stream) Applications - To demonstrate the use Audio / Video mixer application Multiple data sources and data filtering applications joined in a chain. Outline Motivation Literature Survey Research Issues HPSearch Architecture Contributions and Milestones Applications Summary Applications Streaming Data Filtering Sensor Source GPS Data HPSearch Kernel - TSE Kernel - TSE Data Filter Filters the input data to get only the estimate and error values Matlab Plotting Script Graph Kernel - TSE (Distributed) Services RDAHMM Analyze the data Applications Creating Virtual Broker Network for deploying applications b = new NaradaBroker("school.cs.indiana.edu"); b.create(""); /* OR b.create("file:///u/hgadgil/alternateConfig.conf"); */ b.connectTo("156.56.104.170", "5045", "t", ""); b.requestNodeAddress("156-56-104-170.bl-dhcp.indiana.edu:5045", "0"); c = new NaradaBroker("trex.ucs.indiana.edu"); c.create(""); c.connectTo("156.56.104.170", "5045", "t", ""); c.requestNodeAddress("tcp://156-56-104-170.bl-dhcp.indiana.edu:5045", "0"); school.cs.indiana.edu 156.56.104.170 school.cs.indiana.edu trex.ucs.indiana.edu HPSearch Shell trex.cs.indiana.edu Applications Invoking Arbitrary Web Services approved = false; userID = "111-22-3333"; if(loanAmt < 10000) approved = true; else { loanAmt < 10000 wsRA = new WSDL("http://www.riskAssessor.com/services/RiskAssessor"); risk = wsRA.invoke("assessRisk", userID, loanAmt); if(risk > 50) approved = false; else risk = WS_riskAssessor(userID, loanAmt) approved = true; } Print "Loan Approved: " + approved; risk > 50 approved = false approved = true approved = true Print result Outline Motivation Literature Survey Research Issues HPSearch Architecture Contributions and Milestones Applications Summary Summary This thesis addresses Managing data streams (Dynamic and static) Enabling connecting data sources and data processing components (available as Web Services) for processing data in real-time for critical infrastructure applications Develop a general purpose scripting architecture (like Perl) for a multitude of tasks Goal is to create an architecture that is Pluggable / Extensible Manageable - Programmable Similar to the UNIX Pipe-Filter Architecture, but implemented on a Distributed scale