The Future of MOCHA Nick Roussopoulos October 5, 2001

advertisement
The Future of MOCHA
Nick Roussopoulos
October 5, 2001
The Problem
Distributed and
heterogeneous data sources
• Data Sources for an enterprise are:
– Distributed
• Internet, intranets, extranets
– Heterogeneous
• Web servers, relational databases, file systems
– Mission-critical
• Weather service, ocean temperature, stock status, …
– Costly to replace or upgrade
• Risk of breaking it and loss of investment
Stanford Oct 5, 2001
Nick Roussopoulos
2
The Problem
Client
Client
Client
Client
Client
High volume access from everywhere
Client
Client
Client
Client
Client
Client
Client
Client
Client
Client
Client
Client
Client
Client
Client
Internet
Oracle 8i
Stanford Oct 5, 2001
Informix
XML Data
Nick Roussopoulos
Text Data
3
Client-Server 2-tier architecture
complex FAT clients
Client
Client
Client
Client
Client
Bad Idea
Internet
Oracle 8i
Stanford Oct 5, 2001
Informix
XML Data
Nick Roussopoulos
Text Data
4
Middleware 3-tier architecture
Thin & fit clients
Client
Client
Client
Client
Client
Client
Integration
Server
Catalog
Internet
Translator
Translator
Translator
Translator
Oracle 8i
Informix
XML Data
Text Data
Stanford Oct 5, 2001
Nick Roussopoulos
5
Nice but…
• Most middleware solutions are static
• Not flexible for dynamic environments
• Not scalable to hundreds of client and server sites
• Development cost is high
• One-site-at-a-time at a fixed cost
• Maintenance cost is high
• Upgrades are practically redevelopments
Stanford Oct 5, 2001
Nick Roussopoulos
6
A dynamic world needs Code extensibility
& auto-deployment
• Need for user-defined types and functions
– Polygon
– Composite() – image aggregation
• Porting and manual installation of code (C/C++)
– Operating System
– Hardware Platform
• High cost of code maintenance
– Updates on all platforms
– Version management
• Security in hostile platforms
Stanford Oct 5, 2001
Nick Roussopoulos
7
Code Deployment Problem
Client
Client
Integration
Server
Catalog
Internet
Translator
Translator
Translator
Translator
Oracle 8i
Informix
XML Data
Text Data
Stanford Oct 5, 2001
Nick Roussopoulos
8
Query Processing
• Query execution options
– Limited by site-dependent software
• Composite() – must be ported before use
• Most processing done at the Integration Server
– Powerful Data Servers are under-utilized
• I/O Nodes
– Excessive data movement over the network
• Network bottleneck
• Slow internet access
Stanford Oct 5, 2001
Nick Roussopoulos
9
Query Processing Problem
Client
Client
Integration
Server
100MB
100MB
Translator
Stanford Oct 5, 2001
200MB
200MB
Translator
Translator
100MB
Oracle 8i
Catalog
Internet
Translator
200MB
Informix
XML Data
Nick Roussopoulos
Text Data
10
Solution
MOCHA
Middleware Based On a Code SHipping Architecture
Stanford Oct 5, 2001
Nick Roussopoulos
11
MOCHA Solution: Ship Java Code Mochlets
Code
Repository
Catalog
Oracle
Informix
QPC
DAP
DAP
Q
Q
Texas
Q
Q
Client
Virginia
Stanford Oct 5, 2001
Q
Maryland
Virginia
Q
Q
Q
Q
Select location, Composite(image)
From Rasters
Where week BETWEEN t1 and t2
Group By location
Nick Roussopoulos
12
MOCHA Solution: Filter Data @ Source
200MB
100MB
tuples
Code
Repository
Catalog
Oracle
DAP 150KB
results
Texas 150KB
results
150KBQPC
200KB
results
results
Virginia
350KB
Informix
tuples
200KB
DAP
results
200KB
Maryland
results
results
350KB
results
Client
Virginia
Stanford Oct 5, 2001
Select location, Composite(image)
From Rasters
Where week BETWEEN t1 and t2
Group By location
Nick Roussopoulos
13
Software architecture
DBMS
DAP
Code
Repository
Catalog
QPC
OS File
DAP
Client
Stanford Oct 5, 2001
Nick Roussopoulos
14
QPC: The Query Processing Coordinator
QPC Controls and Coordinates
Query Execution
Client API
Query Parser
XML
Catalog
Query Optimizer
Code
Repository
Catalog Manager
Execution Engine
Proc.
Interface
SQL &
XML
Code
Loader
DAP Access API
Stanford Oct 5, 2001
DAP
Nick Roussopoulos
15
DAP: The Data Access Provider
DAP Provides QPC with
Remote Access to the Data
DAP Access API
Control Module
Execution Engine
Proc.
Interface
SQL &
XML
Code
Loader
Data Source Access Layer
JDBC
I/O API
Stanford Oct 5, 2001
DOM
JNI
Nick Roussopoulos
Data Source
16
Data Server: Storage System
• Stores and Manages the data sets
– database, web server, file system, XML repository
Data Server
Stanford Oct 5, 2001
Nick Roussopoulos
17
Processing a Query in MOCHA
 Query Parsing
 Resource Discovery
Query:
 Query Optimization
 Metadata and Control
Exchange
Select location, Composite(image)
From Rasters
Where week BETWEEN t1 and t2
Group By location
 Code Deployment Phase
 Query Execution
Stanford Oct 5, 2001
Table Rasters
location
image
week
band
Nick Roussopoulos
18
Plan Generation
Coordination Thread
Client
Execution Thread
Client
Execution Thread
QPC
Code
Repository
Catalog
DAP
DAP
Informix
Oracle
Stanford Oct 5, 2001
Nick Roussopoulos
Select location, Composite(image)
From Rasters
Where week BETWEEN t1 and t2
Group By location
19
Automatic Code Deployment
Coordination Thread
Client
Execution Thread
Client
Execution Thread
QPC
Code
Repository
Catalog
DAP
DAP
Informix
Oracle
Stanford Oct 5, 2001
Nick Roussopoulos
Select location, Composite(image)
From Rasters
Where week BETWEEN t1 and t2
Group By location
20
Data Processing
Coordination Thread
Client
Execution Thread
Client
Execution Thread
QPC
Code
Repository
Catalog
DAP
DAP
Informix
Oracle
Stanford Oct 5, 2001
Nick Roussopoulos
Select location, Composite(image)
From Rasters
Where week BETWEEN t1 and t2
Group By location
21
Features of MOCHA
• Automatic code deployment
• “Plug-N-Play”
• no system-wide installations
• Metadata and Schema Mapping framework
• XML, RDF
• easy to exchange and map schemas
• semi-automatic mapping
• Query optimization based on code shipping
– reduce data movement overhead
• filters at the source
• expands at the client
• metrics for code (operator) placement
• optimization for selection, union and join plans
Stanford Oct 5, 2001
Nick Roussopoulos
22
MOCHA Demo: Global Land Cover
Facility
• Integrates the following DAP sites
– University of New Hampshire (Webster), NASA GSFC, UMD-CS,
UMD-Geography, UMD-UMIACS SP-2 HPSS
• GLCF hosts the QPC
• Operations supported:
–
–
–
–
Coverage queries
Visualization of preview images for
Data sets MODIS, TM, AVHRR
GIS Features
• Dynamic Sub-setting of TM scenes
• Composites of GIS Features and AVHRR images
Stanford Oct 5, 2001
Nick Roussopoulos
23
Multi-Sensor Analysis of the
Los Alamos Fire Event Using MOCHA
• Data Synergy and Multi-Resolution Instrument Analysis
using MOCHA
– Access data residing at various data sources
– Utilize image processing tools
• Fire Analysis required a multi-resolution approach
– MOCHA is independent of instrument or resolution specifics
• High Resolution: IKONOS and TM data
• Moderate Resolution: 250m MODIS
• Coarse Resolution: AVHRR and DMSP
Stanford Oct 5, 2001
Nick Roussopoulos
24
MOCHA Search Utility
Stanford Oct 5, 2001
Nick Roussopoulos
25
MOCHA Search Utility (cont’d)
Stanford Oct 5, 2001
Nick Roussopoulos
26
MOCHA Search Utility (cont’d)
Stanford Oct 5, 2001
Nick Roussopoulos
27
MOCHA Query Results
Stanford Oct 5, 2001
Nick Roussopoulos
28
MOCHA ETM+ Subsetting Utility
Stanford Oct 5, 2001
Nick Roussopoulos
29
May 9, 2000 Los Alamos (Bands 1,2,3)
Stanford Oct 5, 2001
Nick Roussopoulos
30
May 9, 2000 Los Alamos (Bands 7,5,4)
Stanford Oct 5, 2001
Nick Roussopoulos
31
Multi-Sensor Query
Stanford Oct 5, 2001
Nick Roussopoulos
32
Tabular Query Results
Stanford Oct 5, 2001
Nick Roussopoulos
33
MODIS: May 11, 2000: During Fire
Stanford Oct 5, 2001
Nick Roussopoulos
34
MODIS: May 24, 2000: After Fire
Stanford Oct 5, 2001
Nick Roussopoulos
35
DMSP: Night Visibility of Fire
Stanford Oct 5, 2001
Nick Roussopoulos
36
IKONOS 4m resolution
Stanford Oct 5, 2001
Nick Roussopoulos
37
IKONOS 4m Subset
Stanford Oct 5, 2001
Nick Roussopoulos
38
IKONOS 1m resolution
Stanford Oct 5, 2001
Nick Roussopoulos
39
IKONOS 1m Subset
Stanford Oct 5, 2001
Nick Roussopoulos
40
MOCHA Metadata Publishing Framework
• Provides information about system resources
• Data sources
• schemas and mappings
• user-defined types and functions
• Automates operation of MOCHA
• Incremental system growth
• neither fixed nor hardwired parameters
• no extension by re-compilation
• Share metadata with others (Internet)
• machine readable form
Stanford Oct 5, 2001
Nick Roussopoulos
41
MOCHA Catalog Organization
• Metadata about “resources”
–
–
–
–
Local and global tables
UDF data types and operators
Schema mapping rules
DAPs
• Each one has Uniform Resource Identifier (URI)
 global namespace
– e.g.: mocha://cs1.umd.edu/EarthSci/Polygon
• Modeled with RDF, serialized with XML
 easy to understand, use and exchange
Stanford Oct 5, 2001
Nick Roussopoulos
43
RDF Model: Data Types
mocha://cs1.umd.edu/EarthSci/Raster
user1@cs.umd.edu
Raster
Raster.class
Stanford Oct 5, 2001
cs1.umd.edu/EarthSci
Nick Roussopoulos
1 megabyte
44
XML Serialization: Data Types
• W3C Standards
• Easy to specify using
GUI tools
• Easy to exchange
• Crawlers can harvest it
• Stored in
– DB
– File System
Stanford Oct 5, 2001
<rdf:Description about=
“mocha://cs1.umd.edu/EarthSci/Raster”>
<mocha:Type>Raster</mocha:Type>
<mocha:Class>
Raster.class
</mocha:Class>
<mocha:Repository>
cs1.umd.edu/EarthSci
</mocha:Repository>
<mocha:Size> 1 MB</mocha:Size>
<mocha:Creator>user1@cs1.umd.edu
</mocha:Creator>
</rdf:Description>
Nick Roussopoulos
45
Other Resources in MOCHA
• Local and Global tables
– data sources + columns + types
• UDF Functions
– argument types + return type
– code repository
• Schema mapping rules
• DAPs
– URL
– login information
Stanford Oct 5, 2001
Nick Roussopoulos
46
Schema Mapping in MOCHA
• Direct column mappings
• Complex Expressions
RastersMD
point1
point2
photo
date
band
Stanford Oct 5, 2001
Rasters
rect()
week()
Nick Roussopoulos
location
image
week
band
47
MOCHA Schema Mapping Rules
• Use XML to encode mapping
rules
• Schema mapping sub-plans
– leaf nodes
Plan
Tree
SMP
SMP
Stanford Oct 5, 2001
<MapList>
<mi mapped = “direct”>
<mocha:Column>image</mocha:Column>
<mocha:Expr>photo</mocha:Expr>
</mi>
<mi mapped = “expression”>
<mocha:Column> location
</mocha:Column>
<mocha:Expr>
rect(point1, point2)
</mocha:Expr>
</mi>
…
SMP
Nick Roussopoulos
48
MOCHA Optimization Framework
• Query optimization based on heuristics
• cost = network + CPU + I/O
• Network is the dominant factor (WAN)
• optimize for it first
• CPU and I/O are cheaper
• optimize for them later
• Operator placement: Enhanced Hybrid Shipping
• Code
• Data
Stanford Oct 5, 2001
Nick Roussopoulos
50
Operator Placement in MOCHA
• Data-Reducing Operators
– “Filter” the data
– aggregates, predicates, projections, semi-joins
• Composite(), Overlaps() , AvgEnergy()
Composite()
 Push to the DAPs
• Return distilled results
• Less data movement
Stanford Oct 5, 2001
Nick Roussopoulos
51
Operator Placement in MOCHA
• Data-Inflating Operators
• “Expand” the data
• projections, image processing, some joins …
• DoubleResolution(), RotateSolid()
DoubleRes()
• Pull to the QPC
• Data Shipping policy [FJK96]
• Only send back raw arguments
• Less data movement
Stanford Oct 5, 2001
Nick Roussopoulos
52
Placement Metric: VRF
Volume Reduction Factor:
Given operator f and relation R, then
VDT
VRF ( f ) 
VDA
•VDT - volume of data transmitted after applying f to R
•VDA - volume of data originally present in R
f is Data-Reducing  VRF < 1
f is Data-Inflating  VRF  1
Composite()
Stanford Oct 5, 2001
DoubleRes()
Nick Roussopoulos
53
Goal: Plans with small CVRF
Cumulative Volume Reduction Factor:
Given a plan P to solve query Q over
relations R1, …, Rn
CVDT
CVRF ( P) 
CVDA
• CVDT - volume of data transmitted by applying
all operators in P to R1, …, Rn
• CVDA- volume of data originally present in R1, …, Rn
Search Space
Optimizer searches
for plans that move
minimal amount of data.
CVRF(Plan)  [0,1]
Stanford Oct 5, 2001
Nick Roussopoulos
54
MOCHA Query Optimizer
• System R style
–
–
–
–
Left-deep plans (joins at QPC)
cost: execution time (network + CPU + I/O)
operator placement : VRF and plan cost
selections, unions and joins
• Placement Policy: Enhanced Hybrid Shipping
– Code Shipping: operators at DAPs
– Data Shipping: operators at QPC
– generalizes Hybrid Shipping [FJK96]
Stanford Oct 5, 2001
Nick Roussopoulos
55
Sequoia 2000 Benchmark
• Goals of first experiment:
– Measure how good code shipping can be
– Validate heuristics being proposed
• VRF
• CVRF
• Configured MOCHA with plans that place operators
– at DAP with code shipping
– at QPC with data shipping
Stanford Oct 5, 2001
Nick Roussopoulos
56
Reducing vs. Inflating
• Query classes
 Performance
– composites
• 99% data reduction
• 4-1 better performance
– clipping and expansion
• 80% data reduction
• 3-1 better performance
DB
Running Time (secs)
– Q1: Composite of all images
– Q2: Clipping and sub-setting
– Q3: Double resolution of
images
2000
1800
1600
1400
1200
1000
800
600
400
200
0
NET
MISC
DAP
QPC
QPC
QPC
DAP
Q1
DAP
Q2
Q3
Query
Class
 Validates heuristics
Stanford Oct 5, 2001
CPU
Nick Roussopoulos
57
VRF vs. Selectivity
DB
Consider 50% selectivity
• DAP  CVRF = 0.01
• QPC  CVRF = 1
NET
MISC
800
700
600
500
400
300
200
0
.25
.50
.75
DAP
QPC
DAP
QPC
DAP
QPC
DAP
DAP
0
QPC
100
QPC
•
Selectivity and cardinality not
enough for distributed predicate
placement
Running Time (secs)
•
CPU
1
Selectivity
 VRF is a better metric
Stanford Oct 5, 2001
Nick Roussopoulos
58
WAN Experiment
• Sites used:
–
–
–
–
–
University of Maryland (QPC)
University of Puerto Rico
Oregon Graduate Institute
University of North Dakota
University of Alabama
Stanford Oct 5, 2001
Nick Roussopoulos
59
Union with Data-Reducing
Resouce Usage for Q6
700
Usage Time (secs)
Execution Time (secs)
Execution Time Q6
600
500
400
300
200
100
0
1200
1000
800
600
400
200
0
DS
QS
DS
EHS
Execution Policy
Q6:
Select landuse, location
From polygons
Where perimeter(location) > 2000.0
Sites: UPR and OGI
Stanford Oct 5, 2001
QS
EHS
Execution Policy
• EHS is the better option
– Filters data
– 2-1 better performance
– Minimal resource usage
Nick Roussopoulos
60
Union with Reducing and Inflating
Resource Usage for Q5
2000
Usage Time (secs)
Execution Time (secs)
Execution of Q5
1500
1000
500
0
DS
QS
3500
3000
2500
2000
1500
1000
500
0
DS
EHS
Q5:
Select landuse, location,
triangulate(location)
From Polygons
Where perimeter(location) > 2000.0
Stanford Oct 5, 2001
EHS
Execution Policy
Execution Policy
Sites: UPR and OGI
QS
 EHS is better than DS and QS
• 2-1 better than QS
• 6-1 better than DS
• Consumes least resources
Nick Roussopoulos
61
Join with Data-Reducing
Resouce Usage for Q8
700
600
500
400
300
200
100
0
Usage Time (secs)
Execution Tim e (secs)
Execution Time Q8
DS
QS
EHS
1400
1200
1000
800
600
400
200
0
DS
Execution Policy
Stanford Oct 5, 2001
EHS
Execution Policy
Q8:
Select P.landuse, R.location, R.week
From polygons P, rasters R
Where overlaps(P.location, R.location)
And perimeter(P.location) > 2000.0
Sites: UPR and OGI
QS
• EHS is the better option
• 3-1 better performance
– Minimal resource usage
• Same pattern as with unions
– Data movement is the key
Nick Roussopoulos
62
MOCHA System Status
• Operational MOCHA prototype
– It’s real!
– over 40,000 lines of 100% Java code (JDK 1.3)
– People involved:
• Manuel Rodriguez-Martinez (lead)
•
•
•
•
•
Mike McGann
Steve Kelley
Vadim Katz
John Towshend, Frank Lindsay, Ben White (Geographers)
Joseph JaJa (Algorithms)
– Tested with NASA ESIP Federation
• Los Alamos fire
– Supports: Oracle, Postgres, Informix, Sybase, HPSS
Stanford Oct 5, 2001
Nick Roussopoulos
64
Features of MOCHA
•
•
•
•
Automatic Code Deployment
Scalable middleware architecture
Query optimization based on data movement reduction
Metadata publishing framework [RMR00a]
• RDF and XML
• Publish schemas, mappings, types and functions
• Drives automatic code deployment
• Schema mapping rules expressed in XML
• attach as leaf nodes in query plan
• extensible
Stanford Oct 5, 2001
Nick Roussopoulos
65
MOCHA Publications
• Research papers and talks
– ACM SIGMOD 2000
– EDBT 2000
• Demos
– ACM SIGMOD 2000
– SSDBM 2001
– NASA ESIP meetings and workshops
– U.S. National Academy of Sciences
Stanford Oct 5, 2001
Nick Roussopoulos
66
The Future of MOCHA
A Million Site MOCHA
Stanford Oct 5, 2001
Nick Roussopoulos
67
The Future of MOCHA
• The role of MOCHA in distributed software systems
–
–
–
–
–
–
–
sensors
satellites
network switches and routers
laptops, palm computers
custom-built devices
cars, planes, boats
people (fireman), animals (whales)
Stanford Oct 5, 2001
Nick Roussopoulos
68
Network of MOCHA enabled sensors
• Sensors are deployed in an area
using ad hoc network techniques
DAP
DAP
• Sensors run Java JDK 1.3
DAP
DAP
DAP
DAP
• Lighter Sensors run Java JDK 1.3
Micro Edition
DAP
DAP
DAP
DAP
DAP
DAP
DAP
Stanford Oct 5, 2001
Nick Roussopoulos
69
Organization of sensors
• Sensors are grouped together
for specific goal or service
• data acquisition
• data aggregation, analysis
• data streaming
• Group leaders are responsible for
Leader
Normal Sensor
Groups
• establishing themselves
(broadcast, voting, …)
• coordination among sensors
• making decisions (agents)
• participate in other higher level
groups (hybrid P2P)
Stanford Oct 5, 2001
Nick Roussopoulos
70
Concrete Example (from NASA)
• Constellation of Satellites (with sensors)
• A group observes Gamma radiation
– aggregates measurements
– determines an important radiation event
• Group leader tells other peer group leaders to instruct
their sensors to observe the Gamma radiation event
(reaction).
• system adapts to changes in the environment
Stanford Oct 5, 2001
Nick Roussopoulos
71
MOCHAs Code Shipping feature for
• upgrades to fix bugs
• fresh code to gather data
– at different resolution
– new aggregates or functions
• dynamically configured code
– application-specific security protocol
– location-dependent encryption
Stanford Oct 5, 2001
Nick Roussopoulos
72
Download