URL: http://www.csee.umbc.edu/~kalpakis/monet
• Motivation
– Many catalogs in astronomy, few old and most of the new ones, contain data for very large number of spatial objects
– Fast methods to access these catalogs are needed
– Budget constraints limit the amount of $$ allocated to S/W and H/W to support fast access to the catalogs
– Thus, low-cost solutions that achieve efficient and effective access to very large catalogs are needed
• We demonstrate
– how to achieve fast access to catalog data on a low cost desktop for a very large catalog using an Object-Relational Database System (ORDBMS)
– how to enhance the functionality of the system by extending the
ORDBMS with IDL
2
• We use the USNO-A2.0 catalog (called the Monet catalog here) which has approximately 500 million stars (points)
Image of the Monet catalog found at: http:/www.nofs.navy.mil/projects/pmm/universe.html
3
•
We experiment with a sample from the Monet catalog and the following four approaches (schemas/datablades/indexes)
– Relational only
• Traditional relational datatypes with B-tree indexes (clustered) on dec-RA and RA-dec ( B schemas )
– Object-Relational
• Datatypes from the Geodetic Datablade (created for the EOSDIS project for geo-location) with an R-tree index on geo-location ( G schemas )
• Datatypes from Shapes2 Informix datablade with R-tree index on points
( S schemas )
• Datatypes from our custom-build
SimpleShape datablade with R-tree index on points built bottom up using Hilbert coding of the point’s coordinates ( I schemas )
4
•
Spatial Window (Online catalog access)
Find all stars within a user-defined bounding rectangle
•
Spatial Or-Window (Online catalog access)
Find all the stars within at least one of two different bounding rectangles
•
Spatial Self-Join (Catalog mining)
Find all stars that are within a specified box centered on each star
• Spatial Multi-Join (Catalog Correlations)
– Spatial chain-join
Find a spatial region in a “chain-joined” sequence of tables (catalogs)
– Spatial Star-join
Find a spatial region in a “star-joined” sequence of tables (catalogs)
5
• The custom-made datablade SimpleShape extends the Informix Dynamic
Server 2000 by providing
– a fixed length opaque type (UDT) storing 4 small floats (16 bytes total)
– full R-tree index support including new bottom-up index building
– full B-tree index support based on 2 dimensional Hilbert Curves
– the
SimplePoint UDT holds the coordinates of a 2 dimensional point
– the
SimpleBox UDT holds the lower left and upper right SimplePoints of a rectangle to make a box
– input/output functions and constructors
6
• Each star in the catalog comes packaged as three integers.
– (int RA, int DEC, int MAG)
– RA = (RA in decimal hours) * 15 * 3600 * 100
– DEC = ((DEC in decimal degrees) + 90) * 3600 * 100
– MAG = combination of field , blue , red values plus flags
•
The catalog data are stored in a database table with schema (coordinates
SimplePoint, magnitude integer)
– the actual data in the table are stored in their original (compressed) format as above
– user-defined functions (UDFs) change coords and magnitude into meaningful fields
7
• the RA and DEC fields are stored in a SimplePoint point (dec, ra) of type
SimplePoint
• the data format is not changed from packaged monet data
• data conversion takes place in the DBMS when displaying results
• the MAG field is stored as a single integer field in DBMS directly
• additional utility UDFs are created
– Ra() takes a SimplePoint and returns RA in decimal hours
– Dec() takes a SimplePoint and returns DEC in decimal degrees
– MonetBox() contructs a SimpleBox from RA and DEC in decimal hours and decimal degrees respectively
– Blue() takes MAG and returns the blue magnitude
– Red() takes MAG and returns the red magnitude
– Field() takes MAG and returns the field of the star
8
• Custom High Performance Loader
– Built from the Virtual Table Interface (VTI) of Informix Server
– Maps binary OS file into database table
– Only select queries with NO where clause can be issued on table
• Reasons for development
– Lack of Informix High Performance Loader for PC platform
– Bug in regular table load command for PC platform
– USNO-A2.0 is distributed in binary format
– absense of string conversions make loading fast
9
• R-tree index on coordinates
– built using new bottom-up algorithm and the 2nd order Hilbert values of the points
– allows fast window queries on entire catalog
– faster build times than B-tree index
– takes about 1 day to index the entire catalog
• Second order Hilbert values
– space filling curve converts two dimensions into one dimension
– points are sorted based on their Hilbert value and stored adjacently in R-tree leaf pages
– Hilbert value code and R-tree bottom-up index building algorithms are built into SimpleShape software
10
• Sample SQL statements using the SimpleShape datablade
– create table test (product_id int, location SimplePoint)
– insert into test values (1, SimplePoint(2.333,5.77))
– insert into test values (2, ‘point(4.123, 8.234)’ )
– create table products (product_id int, location_x int, location_y int)
– insert into test select product_id, SimplePoint(location_x, location_y) from products
– select * from test where within (location, ‘box(0, 0, 100, 100)’ )
• Sample SQL statements for the USNO-A2.0 specialization
– create table monet_data(coords SimplePoint, mag int)
– select Ra(coords), Dec(coords), Red(mag), Blue(mag), Field(mag) from monet_data where within (coords, monetbox (-3.22, 22.45, 1.23, 22.78) )
– select (Blue(mag) - Red(mag)) as energy, count(*) as count from monet_data where within (coords, monetbox (-3.22, 22.45, 1.23, 22.78) ) group by energy order by count desc
11
• Custom High Performance Loader
– create table load_data(value1 int, value2 int) using vti_load(file=‘/tmp/data.bin’);
– select * from load_data
– insert into real_table select * from load_data
• Loading Monet Data
– create table monet_all (coords SimplePoint, mag int)
– create table monet_load(ra int, dec int, mag int) using vti_load(file=‘/tmp/zone0000.cat’)
– insert into monet_all select SimplePoint(dec,ra), mag from monet_load
• Indexing Monet Data
– create index monet_all_idx on monet_all (coords SIMPLE_OPS) using rtree
• Selecting Monet Data
– select * from monet_all where within (coords, MonetBox(dec,ra,dec,ra))
12
•
The experiments with our custom datablade were done on a system with the following configuration
– Hardware (costing less than $2K)
• AMD Duron 650MHz Processor
• FIC model AZ11 motherboard w/IDE ATA-66 controller
• 256MB of PC100 RAM
• Two IBM Ultra/100 7200RPM IDE disks model DTLA307045
– Software
• Linux Mandrake version 7.0
• Updated Linux Kernel version 2.4.0-test4
• Informix Dynamic Server 2000 version 9.20.UC1
• Custom datablade and data loading software
13
• The experiments with other datablades and the traditional relational solution were done on a system with the following configuration
– Hardware
• Sun UltraSparc 60 with 512 MB memory
• two 8GB and two 4GB SCSI disks
– Software
• Solaris 2.6
• Informix Dynamic Server 9.14 with the Universal DataServer Option
• Geodetic and Shapes2 datablades
14
Elapsed time to load data from the Monet catalog
Schema 100K stars 500M stars
Relational
Geodetic
Shapes2
SimpleShape
45.829 secs
233.618 secs
161.714 secs
27.216 secs
65 hrs
334 hrs
226 hrs
32 hrs
Elapsed time in creating R-tree indexes for the three datablades
Schema
Geodetic
Shapes2
SimpleShape
100K stars
448 secs
256 secs
8 secs
500M stars
607 hrs
371 hrs
18 hrs
15
• Loading and Indexing all the 500M stars
Schema
Relational
Geodetic
Shapes2
SimpleShape
Load Time Index Time
B-tree
3
14
10
1
6
90
57
Size Index Size
R-tree
13GB
25 190GB
B-tree
56GB
66GB
R-tree
104GB
15 140GB
1 15GB
86GB
14GB
77GB
20GB
All times in days
The numbers for the Relational, Geodetic, and Shapes2 schemas are estimates based on regression from measurements for up to 140K stars .
•
Beware: Results are not for the faint of heart!
16
Loading times for the Monet catalog using the SimpleShape datablade
17
Bulding an R-tree index for the Monet catalog using the SimpleShape datablade
Elapsed time
Page Reads
18
• Number of page reads for various queries and schemas for 60K stars
•
The query window(s) used are 0.0666x0.0666 decimal degrees
Query
Spatial Self-Join
Spatial Or-Window
2-Chain Spatial Join
3-Star Spatial Join
Schema
Relational Geodetic Shapes2 SimpleShape
4440
22
1005023
35
60443
44
6775
14
23
24
49
50
87
43
19
12
19
• Performance of the custom-datablade (schema) for queries against a table with
60K stars
Query
Spatial Window
Spatial Self-Join
Spatial OR Window
2-Chain Spatial Join
3-Chain Spatial Join
4-Chain Spatial Join
3-Star Spatial Join
4-Star Spatial Join
Elapsed time (secs) Page reads
0.071
38.9
3072.000
0.163
2.377
9.492
44.396
9.532
44.370
6775.0
13.6
18.5
26.1
27.6
11.9
24.1
20
21
22
• Powerful data-mining applications can be built by providing an interface between an object-relational DBMS (Informix) and a statistical/scientific-computing package (IDL).
•
The IDL extension to Informix allows users to tap the data-analysis and visualization capabilities of IDL through
Informix.
•
Implementation is based on UDFs and the RPC mechanism .
Informix
DBMS
IDL user functions
Efficient storage
and retrieval
Indexing and
clustering of data
Concurrency
control
Security
Data-analysis
routines
Visualization capabilities
IDL server
Data-analysis
routines
Visualization capabilities
23
•
The IDL interface provides two main functions
• idl_exec is a simple user defined function that executes the idl-script for each row of the table
• idl_agg is a user defined aggregate function.
•
The functions are used along with casting functions that cast the pointer returned, to appropriate data type.
Data type Last IDL Data type Cast
.
returned statement returned function from IDL in script by function
.
Integer $RETURN pointer to int toInt
.
Double $RETURN pointer to float toFloat
.
Binary arr $BLOB LO_Handle toBlob structure $RETURN pointer to ROW toRow of lvarchars
2D Array $RETURN ptr to collection toList of rows
Computing distance from the origin using idl_exec
The return variable is a double, i.e. float in Informix. So we use toFloat as casting function.
Select toFloat (
Columns of the table that are used as arguments by idl_exec .$1
x ;
$2
y
$RETURN used to return a double idl_exec ( ROW ( x , y ),
"r = double( sqrt ( ( $ 1)^2 + ( $ 2)^2 ) ) ; $RETURN r ”
)
) from Points;
IDL script
24
Clustering using idl_agg: We use the CLUST_WTS function in IDL that computes the weights (the cluster centers) of an mxn array and returns an array of cluster centers. idl_agg, in this case, returns a collection of rows that can be made to look like a virtual table.
The return variable is a 2D-array. toList is casting function for the pointer to collection of rows that idl_exec returns
The initialization IDL script ; executed once for each group select toList ( idl_agg ( ROW ( ra, dec ),
ROW ( "count=-1",
Iteration script; executed once for every row in the group; here it assigns values to the matrix
)
"count=count+1 &if count eq 0 then y = [$1, $2] else y = [[y], [$1,$2]]",
"w = CLUST_WTS (y, N_CLUSTERS =3); $RETURN w ”
) ) :: list( ROW (a float, b float) not null) from monet_temp ;
Casts the collection into required list.
Final script; executed once at the end of each group; here it clusters data using some IDL routines
$RETURN used to return a collection of rows of two doubles each.
25
Getting a plot using idl_agg : We use a IDL UDF, MONET_JPEG that creates a plot with 1 st two arguments (ra and dec) as the axis and the next two (red & blue) determining the color of the point. MONET_JPEG returns the jpeg of the plot in binary. The binary array is returned as a blob by idl_agg.
Since we have $BLOB , idl_agg returns an LO_H andle to the blob. Hence toBlob is used for casting.
select toBlob ( idl_agg ( ROW ( ra, dec, blue, red ) ,
Initialization script
Iteration script
ROW (
Final script
'flag=0&i=long(-1)&temp=dblarr(4,1000)',
'i=i+1&if i ge 1000 then begin if flag eq 0 then begin a=temp & flag=1&end else a=[[a],[temp]] & i=0 & end & temp(*,i) = [$1,$2,$3,$4]',
'if i gt 0 then begin if flag eq 0 then a=temp(*,0:i) else a=[[a], [temp(*,0:i)]]
& end & b = MONET_JPEG (a(1,*),a(0,*),a(2,*),a(3,*)); $BLOB b’
)
)
) from monet_all;
$BLOB used to return a blob that contains the jpeg image
26