The Future of MOCHA Nick Roussopoulos October 5, 2001 The Problem Distributed and heterogeneous data sources • Data Sources for an enterprise are: – Distributed • Internet, intranets, extranets – Heterogeneous • Web servers, relational databases, file systems – Mission-critical • Weather service, ocean temperature, stock status, … – Costly to replace or upgrade • Risk of breaking it and loss of investment Stanford Oct 5, 2001 Nick Roussopoulos 2 The Problem Client Client Client Client Client High volume access from everywhere Client Client Client Client Client Client Client Client Client Client Client Client Client Client Client Internet Oracle 8i Stanford Oct 5, 2001 Informix XML Data Nick Roussopoulos Text Data 3 Client-Server 2-tier architecture complex FAT clients Client Client Client Client Client Bad Idea Internet Oracle 8i Stanford Oct 5, 2001 Informix XML Data Nick Roussopoulos Text Data 4 Middleware 3-tier architecture Thin & fit clients Client Client Client Client Client Client Integration Server Catalog Internet Translator Translator Translator Translator Oracle 8i Informix XML Data Text Data Stanford Oct 5, 2001 Nick Roussopoulos 5 Nice but… • Most middleware solutions are static • Not flexible for dynamic environments • Not scalable to hundreds of client and server sites • Development cost is high • One-site-at-a-time at a fixed cost • Maintenance cost is high • Upgrades are practically redevelopments Stanford Oct 5, 2001 Nick Roussopoulos 6 A dynamic world needs Code extensibility & auto-deployment • Need for user-defined types and functions – Polygon – Composite() – image aggregation • Porting and manual installation of code (C/C++) – Operating System – Hardware Platform • High cost of code maintenance – Updates on all platforms – Version management • Security in hostile platforms Stanford Oct 5, 2001 Nick Roussopoulos 7 Code Deployment Problem Client Client Integration Server Catalog Internet Translator Translator Translator Translator Oracle 8i Informix XML Data Text Data Stanford Oct 5, 2001 Nick Roussopoulos 8 Query Processing • Query execution options – Limited by site-dependent software • Composite() – must be ported before use • Most processing done at the Integration Server – Powerful Data Servers are under-utilized • I/O Nodes – Excessive data movement over the network • Network bottleneck • Slow internet access Stanford Oct 5, 2001 Nick Roussopoulos 9 Query Processing Problem Client Client Integration Server 100MB 100MB Translator Stanford Oct 5, 2001 200MB 200MB Translator Translator 100MB Oracle 8i Catalog Internet Translator 200MB Informix XML Data Nick Roussopoulos Text Data 10 Solution MOCHA Middleware Based On a Code SHipping Architecture Stanford Oct 5, 2001 Nick Roussopoulos 11 MOCHA Solution: Ship Java Code Mochlets Code Repository Catalog Oracle Informix QPC DAP DAP Q Q Texas Q Q Client Virginia Stanford Oct 5, 2001 Q Maryland Virginia Q Q Q Q Select location, Composite(image) From Rasters Where week BETWEEN t1 and t2 Group By location Nick Roussopoulos 12 MOCHA Solution: Filter Data @ Source 200MB 100MB tuples Code Repository Catalog Oracle DAP 150KB results Texas 150KB results 150KBQPC 200KB results results Virginia 350KB Informix tuples 200KB DAP results 200KB Maryland results results 350KB results Client Virginia Stanford Oct 5, 2001 Select location, Composite(image) From Rasters Where week BETWEEN t1 and t2 Group By location Nick Roussopoulos 13 Software architecture DBMS DAP Code Repository Catalog QPC OS File DAP Client Stanford Oct 5, 2001 Nick Roussopoulos 14 QPC: The Query Processing Coordinator QPC Controls and Coordinates Query Execution Client API Query Parser XML Catalog Query Optimizer Code Repository Catalog Manager Execution Engine Proc. Interface SQL & XML Code Loader DAP Access API Stanford Oct 5, 2001 DAP Nick Roussopoulos 15 DAP: The Data Access Provider DAP Provides QPC with Remote Access to the Data DAP Access API Control Module Execution Engine Proc. Interface SQL & XML Code Loader Data Source Access Layer JDBC I/O API Stanford Oct 5, 2001 DOM JNI Nick Roussopoulos Data Source 16 Data Server: Storage System • Stores and Manages the data sets – database, web server, file system, XML repository Data Server Stanford Oct 5, 2001 Nick Roussopoulos 17 Processing a Query in MOCHA Query Parsing Resource Discovery Query: Query Optimization Metadata and Control Exchange Select location, Composite(image) From Rasters Where week BETWEEN t1 and t2 Group By location Code Deployment Phase Query Execution Stanford Oct 5, 2001 Table Rasters location image week band Nick Roussopoulos 18 Plan Generation Coordination Thread Client Execution Thread Client Execution Thread QPC Code Repository Catalog DAP DAP Informix Oracle Stanford Oct 5, 2001 Nick Roussopoulos Select location, Composite(image) From Rasters Where week BETWEEN t1 and t2 Group By location 19 Automatic Code Deployment Coordination Thread Client Execution Thread Client Execution Thread QPC Code Repository Catalog DAP DAP Informix Oracle Stanford Oct 5, 2001 Nick Roussopoulos Select location, Composite(image) From Rasters Where week BETWEEN t1 and t2 Group By location 20 Data Processing Coordination Thread Client Execution Thread Client Execution Thread QPC Code Repository Catalog DAP DAP Informix Oracle Stanford Oct 5, 2001 Nick Roussopoulos Select location, Composite(image) From Rasters Where week BETWEEN t1 and t2 Group By location 21 Features of MOCHA • Automatic code deployment • “Plug-N-Play” • no system-wide installations • Metadata and Schema Mapping framework • XML, RDF • easy to exchange and map schemas • semi-automatic mapping • Query optimization based on code shipping – reduce data movement overhead • filters at the source • expands at the client • metrics for code (operator) placement • optimization for selection, union and join plans Stanford Oct 5, 2001 Nick Roussopoulos 22 MOCHA Demo: Global Land Cover Facility • Integrates the following DAP sites – University of New Hampshire (Webster), NASA GSFC, UMD-CS, UMD-Geography, UMD-UMIACS SP-2 HPSS • GLCF hosts the QPC • Operations supported: – – – – Coverage queries Visualization of preview images for Data sets MODIS, TM, AVHRR GIS Features • Dynamic Sub-setting of TM scenes • Composites of GIS Features and AVHRR images Stanford Oct 5, 2001 Nick Roussopoulos 23 Multi-Sensor Analysis of the Los Alamos Fire Event Using MOCHA • Data Synergy and Multi-Resolution Instrument Analysis using MOCHA – Access data residing at various data sources – Utilize image processing tools • Fire Analysis required a multi-resolution approach – MOCHA is independent of instrument or resolution specifics • High Resolution: IKONOS and TM data • Moderate Resolution: 250m MODIS • Coarse Resolution: AVHRR and DMSP Stanford Oct 5, 2001 Nick Roussopoulos 24 MOCHA Search Utility Stanford Oct 5, 2001 Nick Roussopoulos 25 MOCHA Search Utility (cont’d) Stanford Oct 5, 2001 Nick Roussopoulos 26 MOCHA Search Utility (cont’d) Stanford Oct 5, 2001 Nick Roussopoulos 27 MOCHA Query Results Stanford Oct 5, 2001 Nick Roussopoulos 28 MOCHA ETM+ Subsetting Utility Stanford Oct 5, 2001 Nick Roussopoulos 29 May 9, 2000 Los Alamos (Bands 1,2,3) Stanford Oct 5, 2001 Nick Roussopoulos 30 May 9, 2000 Los Alamos (Bands 7,5,4) Stanford Oct 5, 2001 Nick Roussopoulos 31 Multi-Sensor Query Stanford Oct 5, 2001 Nick Roussopoulos 32 Tabular Query Results Stanford Oct 5, 2001 Nick Roussopoulos 33 MODIS: May 11, 2000: During Fire Stanford Oct 5, 2001 Nick Roussopoulos 34 MODIS: May 24, 2000: After Fire Stanford Oct 5, 2001 Nick Roussopoulos 35 DMSP: Night Visibility of Fire Stanford Oct 5, 2001 Nick Roussopoulos 36 IKONOS 4m resolution Stanford Oct 5, 2001 Nick Roussopoulos 37 IKONOS 4m Subset Stanford Oct 5, 2001 Nick Roussopoulos 38 IKONOS 1m resolution Stanford Oct 5, 2001 Nick Roussopoulos 39 IKONOS 1m Subset Stanford Oct 5, 2001 Nick Roussopoulos 40 MOCHA Metadata Publishing Framework • Provides information about system resources • Data sources • schemas and mappings • user-defined types and functions • Automates operation of MOCHA • Incremental system growth • neither fixed nor hardwired parameters • no extension by re-compilation • Share metadata with others (Internet) • machine readable form Stanford Oct 5, 2001 Nick Roussopoulos 41 MOCHA Catalog Organization • Metadata about “resources” – – – – Local and global tables UDF data types and operators Schema mapping rules DAPs • Each one has Uniform Resource Identifier (URI) global namespace – e.g.: mocha://cs1.umd.edu/EarthSci/Polygon • Modeled with RDF, serialized with XML easy to understand, use and exchange Stanford Oct 5, 2001 Nick Roussopoulos 43 RDF Model: Data Types mocha://cs1.umd.edu/EarthSci/Raster user1@cs.umd.edu Raster Raster.class Stanford Oct 5, 2001 cs1.umd.edu/EarthSci Nick Roussopoulos 1 megabyte 44 XML Serialization: Data Types • W3C Standards • Easy to specify using GUI tools • Easy to exchange • Crawlers can harvest it • Stored in – DB – File System Stanford Oct 5, 2001 <rdf:Description about= “mocha://cs1.umd.edu/EarthSci/Raster”> <mocha:Type>Raster</mocha:Type> <mocha:Class> Raster.class </mocha:Class> <mocha:Repository> cs1.umd.edu/EarthSci </mocha:Repository> <mocha:Size> 1 MB</mocha:Size> <mocha:Creator>user1@cs1.umd.edu </mocha:Creator> </rdf:Description> Nick Roussopoulos 45 Other Resources in MOCHA • Local and Global tables – data sources + columns + types • UDF Functions – argument types + return type – code repository • Schema mapping rules • DAPs – URL – login information Stanford Oct 5, 2001 Nick Roussopoulos 46 Schema Mapping in MOCHA • Direct column mappings • Complex Expressions RastersMD point1 point2 photo date band Stanford Oct 5, 2001 Rasters rect() week() Nick Roussopoulos location image week band 47 MOCHA Schema Mapping Rules • Use XML to encode mapping rules • Schema mapping sub-plans – leaf nodes Plan Tree SMP SMP Stanford Oct 5, 2001 <MapList> <mi mapped = “direct”> <mocha:Column>image</mocha:Column> <mocha:Expr>photo</mocha:Expr> </mi> <mi mapped = “expression”> <mocha:Column> location </mocha:Column> <mocha:Expr> rect(point1, point2) </mocha:Expr> </mi> … SMP Nick Roussopoulos 48 MOCHA Optimization Framework • Query optimization based on heuristics • cost = network + CPU + I/O • Network is the dominant factor (WAN) • optimize for it first • CPU and I/O are cheaper • optimize for them later • Operator placement: Enhanced Hybrid Shipping • Code • Data Stanford Oct 5, 2001 Nick Roussopoulos 50 Operator Placement in MOCHA • Data-Reducing Operators – “Filter” the data – aggregates, predicates, projections, semi-joins • Composite(), Overlaps() , AvgEnergy() Composite() Push to the DAPs • Return distilled results • Less data movement Stanford Oct 5, 2001 Nick Roussopoulos 51 Operator Placement in MOCHA • Data-Inflating Operators • “Expand” the data • projections, image processing, some joins … • DoubleResolution(), RotateSolid() DoubleRes() • Pull to the QPC • Data Shipping policy [FJK96] • Only send back raw arguments • Less data movement Stanford Oct 5, 2001 Nick Roussopoulos 52 Placement Metric: VRF Volume Reduction Factor: Given operator f and relation R, then VDT VRF ( f ) VDA •VDT - volume of data transmitted after applying f to R •VDA - volume of data originally present in R f is Data-Reducing VRF < 1 f is Data-Inflating VRF 1 Composite() Stanford Oct 5, 2001 DoubleRes() Nick Roussopoulos 53 Goal: Plans with small CVRF Cumulative Volume Reduction Factor: Given a plan P to solve query Q over relations R1, …, Rn CVDT CVRF ( P) CVDA • CVDT - volume of data transmitted by applying all operators in P to R1, …, Rn • CVDA- volume of data originally present in R1, …, Rn Search Space Optimizer searches for plans that move minimal amount of data. CVRF(Plan) [0,1] Stanford Oct 5, 2001 Nick Roussopoulos 54 MOCHA Query Optimizer • System R style – – – – Left-deep plans (joins at QPC) cost: execution time (network + CPU + I/O) operator placement : VRF and plan cost selections, unions and joins • Placement Policy: Enhanced Hybrid Shipping – Code Shipping: operators at DAPs – Data Shipping: operators at QPC – generalizes Hybrid Shipping [FJK96] Stanford Oct 5, 2001 Nick Roussopoulos 55 Sequoia 2000 Benchmark • Goals of first experiment: – Measure how good code shipping can be – Validate heuristics being proposed • VRF • CVRF • Configured MOCHA with plans that place operators – at DAP with code shipping – at QPC with data shipping Stanford Oct 5, 2001 Nick Roussopoulos 56 Reducing vs. Inflating • Query classes Performance – composites • 99% data reduction • 4-1 better performance – clipping and expansion • 80% data reduction • 3-1 better performance DB Running Time (secs) – Q1: Composite of all images – Q2: Clipping and sub-setting – Q3: Double resolution of images 2000 1800 1600 1400 1200 1000 800 600 400 200 0 NET MISC DAP QPC QPC QPC DAP Q1 DAP Q2 Q3 Query Class Validates heuristics Stanford Oct 5, 2001 CPU Nick Roussopoulos 57 VRF vs. Selectivity DB Consider 50% selectivity • DAP CVRF = 0.01 • QPC CVRF = 1 NET MISC 800 700 600 500 400 300 200 0 .25 .50 .75 DAP QPC DAP QPC DAP QPC DAP DAP 0 QPC 100 QPC • Selectivity and cardinality not enough for distributed predicate placement Running Time (secs) • CPU 1 Selectivity VRF is a better metric Stanford Oct 5, 2001 Nick Roussopoulos 58 WAN Experiment • Sites used: – – – – – University of Maryland (QPC) University of Puerto Rico Oregon Graduate Institute University of North Dakota University of Alabama Stanford Oct 5, 2001 Nick Roussopoulos 59 Union with Data-Reducing Resouce Usage for Q6 700 Usage Time (secs) Execution Time (secs) Execution Time Q6 600 500 400 300 200 100 0 1200 1000 800 600 400 200 0 DS QS DS EHS Execution Policy Q6: Select landuse, location From polygons Where perimeter(location) > 2000.0 Sites: UPR and OGI Stanford Oct 5, 2001 QS EHS Execution Policy • EHS is the better option – Filters data – 2-1 better performance – Minimal resource usage Nick Roussopoulos 60 Union with Reducing and Inflating Resource Usage for Q5 2000 Usage Time (secs) Execution Time (secs) Execution of Q5 1500 1000 500 0 DS QS 3500 3000 2500 2000 1500 1000 500 0 DS EHS Q5: Select landuse, location, triangulate(location) From Polygons Where perimeter(location) > 2000.0 Stanford Oct 5, 2001 EHS Execution Policy Execution Policy Sites: UPR and OGI QS EHS is better than DS and QS • 2-1 better than QS • 6-1 better than DS • Consumes least resources Nick Roussopoulos 61 Join with Data-Reducing Resouce Usage for Q8 700 600 500 400 300 200 100 0 Usage Time (secs) Execution Tim e (secs) Execution Time Q8 DS QS EHS 1400 1200 1000 800 600 400 200 0 DS Execution Policy Stanford Oct 5, 2001 EHS Execution Policy Q8: Select P.landuse, R.location, R.week From polygons P, rasters R Where overlaps(P.location, R.location) And perimeter(P.location) > 2000.0 Sites: UPR and OGI QS • EHS is the better option • 3-1 better performance – Minimal resource usage • Same pattern as with unions – Data movement is the key Nick Roussopoulos 62 MOCHA System Status • Operational MOCHA prototype – It’s real! – over 40,000 lines of 100% Java code (JDK 1.3) – People involved: • Manuel Rodriguez-Martinez (lead) • • • • • Mike McGann Steve Kelley Vadim Katz John Towshend, Frank Lindsay, Ben White (Geographers) Joseph JaJa (Algorithms) – Tested with NASA ESIP Federation • Los Alamos fire – Supports: Oracle, Postgres, Informix, Sybase, HPSS Stanford Oct 5, 2001 Nick Roussopoulos 64 Features of MOCHA • • • • Automatic Code Deployment Scalable middleware architecture Query optimization based on data movement reduction Metadata publishing framework [RMR00a] • RDF and XML • Publish schemas, mappings, types and functions • Drives automatic code deployment • Schema mapping rules expressed in XML • attach as leaf nodes in query plan • extensible Stanford Oct 5, 2001 Nick Roussopoulos 65 MOCHA Publications • Research papers and talks – ACM SIGMOD 2000 – EDBT 2000 • Demos – ACM SIGMOD 2000 – SSDBM 2001 – NASA ESIP meetings and workshops – U.S. National Academy of Sciences Stanford Oct 5, 2001 Nick Roussopoulos 66 The Future of MOCHA A Million Site MOCHA Stanford Oct 5, 2001 Nick Roussopoulos 67 The Future of MOCHA • The role of MOCHA in distributed software systems – – – – – – – sensors satellites network switches and routers laptops, palm computers custom-built devices cars, planes, boats people (fireman), animals (whales) Stanford Oct 5, 2001 Nick Roussopoulos 68 Network of MOCHA enabled sensors • Sensors are deployed in an area using ad hoc network techniques DAP DAP • Sensors run Java JDK 1.3 DAP DAP DAP DAP • Lighter Sensors run Java JDK 1.3 Micro Edition DAP DAP DAP DAP DAP DAP DAP Stanford Oct 5, 2001 Nick Roussopoulos 69 Organization of sensors • Sensors are grouped together for specific goal or service • data acquisition • data aggregation, analysis • data streaming • Group leaders are responsible for Leader Normal Sensor Groups • establishing themselves (broadcast, voting, …) • coordination among sensors • making decisions (agents) • participate in other higher level groups (hybrid P2P) Stanford Oct 5, 2001 Nick Roussopoulos 70 Concrete Example (from NASA) • Constellation of Satellites (with sensors) • A group observes Gamma radiation – aggregates measurements – determines an important radiation event • Group leader tells other peer group leaders to instruct their sensors to observe the Gamma radiation event (reaction). • system adapts to changes in the environment Stanford Oct 5, 2001 Nick Roussopoulos 71 MOCHAs Code Shipping feature for • upgrades to fix bugs • fresh code to gather data – at different resolution – new aggregates or functions • dynamically configured code – application-specific security protocol – location-dependent encryption Stanford Oct 5, 2001 Nick Roussopoulos 72