A System for Fast Low-Cost Access to Very Large Catalogs

advertisement

A System for Fast Low-Cost Access to Very

Large Catalogs

Michael Riggs, Vashundara Puttagunta, Maitreyee

Pasad, Konstantinos Kalpakis, and Jeanne Behnke

CSEE, UMBC and NASA GSFC

URL: http://www.csee.umbc.edu/~kalpakis/monet

Introduction

Motivation

– Many catalogs in astronomy, few old and most of the new ones, contain data for very large number of spatial objects

– Fast methods to access these catalogs are needed

– Budget constraints limit the amount of $$ allocated to S/W and H/W to support fast access to the catalogs

– Thus, low-cost solutions that achieve efficient and effective access to very large catalogs are needed

We demonstrate

– how to achieve fast access to catalog data on a low cost desktop for a very large catalog using an Object-Relational Database System (ORDBMS)

– how to enhance the functionality of the system by extending the

ORDBMS with IDL

2

Test Catalog Used

We use the USNO-A2.0 catalog (called the

Monet

catalog here) which has approximately 500 million stars (points)

Image of the Monet catalog found at: http:/www.nofs.navy.mil/projects/pmm/universe.html

3

Schemas Used

We experiment with a sample from the Monet catalog and the following four approaches (schemas/datablades/indexes)

– Relational only

• Traditional relational datatypes with B-tree indexes (clustered) on dec-RA and RA-dec (

B schemas

)

– Object-Relational

• Datatypes from the Geodetic Datablade (created for the EOSDIS project for geo-location) with an R-tree index on geo-location (

G schemas

)

• Datatypes from Shapes2 Informix datablade with R-tree index on points

(

S schemas

)

• Datatypes from our custom-build

SimpleShape datablade with R-tree index on points built bottom up using Hilbert coding of the point’s coordinates (

I schemas

)

4

Types of Queries Tested

Spatial Window

(Online catalog access)

Find all stars within a user-defined bounding rectangle

Spatial Or-Window

(Online catalog access)

Find all the stars within at least one of two different bounding rectangles

Spatial Self-Join

(Catalog mining)

Find all stars that are within a specified box centered on each star

Spatial Multi-Join

(Catalog Correlations)

Spatial chain-join

Find a spatial region in a “chain-joined” sequence of tables (catalogs)

Spatial Star-join

Find a spatial region in a “star-joined” sequence of tables (catalogs)

5

Our Custom Datablade

The custom-made datablade

SimpleShape

extends the Informix Dynamic

Server 2000 by providing

– a fixed length opaque type (UDT) storing 4 small floats (16 bytes total)

– full R-tree index support including new bottom-up index building

– full B-tree index support based on 2 dimensional Hilbert Curves

– the

SimplePoint UDT holds the coordinates of a 2 dimensional point

– the

SimpleBox UDT holds the lower left and upper right SimplePoints of a rectangle to make a box

– input/output functions and constructors

6

Format of the USNO-A2.0

Each star in the catalog comes packaged as three integers.

– (int RA, int DEC, int MAG)

– RA = (RA in decimal hours) * 15 * 3600 * 100

– DEC = ((DEC in decimal degrees) + 90) * 3600 * 100

– MAG = combination of

field

,

blue

,

red

values plus flags

The catalog data are stored in a database table with schema (coordinates

SimplePoint, magnitude integer)

– the actual data in the table are stored in their

original

(compressed) format as above

– user-defined functions (UDFs) change coords and magnitude into meaningful fields

7

Specializing the SimpleShape to the

USNO-A2.0 Catalog

• the RA and DEC fields are stored in a SimplePoint point (dec, ra) of type

SimplePoint the data format is not changed from packaged monet data data conversion takes place in the DBMS when displaying results the MAG field is stored as a single integer field in DBMS directly additional utility UDFs are created

– Ra() takes a SimplePoint and returns RA in decimal hours

– Dec() takes a SimplePoint and returns DEC in decimal degrees

– MonetBox() contructs a SimpleBox from RA and DEC in decimal hours and decimal degrees respectively

– Blue() takes MAG and returns the blue magnitude

– Red() takes MAG and returns the red magnitude

– Field() takes MAG and returns the field of the star

8

Custom Data Loader

• Custom High Performance Loader

– Built from the Virtual Table Interface (VTI) of Informix Server

– Maps binary OS file into database table

– Only select queries with NO where clause can be issued on table

• Reasons for development

– Lack of Informix High Performance Loader for PC platform

– Bug in regular table load command for PC platform

– USNO-A2.0 is distributed in binary format

– absense of string conversions make loading fast

9

Indexing for the Monet Data

• R-tree index on coordinates

– built using new bottom-up algorithm and the 2nd order Hilbert values of the points

– allows fast window queries on entire catalog

– faster build times than B-tree index

– takes about 1 day to index the entire catalog

• Second order Hilbert values

– space filling curve converts two dimensions into one dimension

– points are sorted based on their Hilbert value and stored adjacently in R-tree leaf pages

– Hilbert value code and R-tree bottom-up index building algorithms are built into SimpleShape software

10

Sample SQL Statements for the

SimpleShape Datablade

Sample SQL statements using the SimpleShape datablade

– create table test (product_id int, location SimplePoint)

– insert into test values (1, SimplePoint(2.333,5.77))

– insert into test values (2, ‘point(4.123, 8.234)’ )

– create table products (product_id int, location_x int, location_y int)

– insert into test select product_id, SimplePoint(location_x, location_y) from products

– select * from test where within (location, ‘box(0, 0, 100, 100)’ )

Sample SQL statements for the USNO-A2.0 specialization

– create table monet_data(coords SimplePoint, mag int)

– select Ra(coords), Dec(coords), Red(mag), Blue(mag), Field(mag) from monet_data where within (coords, monetbox (-3.22, 22.45, 1.23, 22.78) )

– select (Blue(mag) - Red(mag)) as energy, count(*) as count from monet_data where within (coords, monetbox (-3.22, 22.45, 1.23, 22.78) ) group by energy order by count desc

11

More Sample SQL Statements

• Custom High Performance Loader

– create table load_data(value1 int, value2 int) using vti_load(file=‘/tmp/data.bin’);

– select * from load_data

– insert into real_table select * from load_data

• Loading Monet Data

– create table monet_all (coords SimplePoint, mag int)

– create table monet_load(ra int, dec int, mag int) using vti_load(file=‘/tmp/zone0000.cat’)

– insert into monet_all select SimplePoint(dec,ra), mag from monet_load

• Indexing Monet Data

– create index monet_all_idx on monet_all (coords SIMPLE_OPS) using rtree

• Selecting Monet Data

– select * from monet_all where within (coords, MonetBox(dec,ra,dec,ra))

12

System Configuration

The experiments with our custom datablade were done on a system with the following configuration

– Hardware (costing less than $2K)

• AMD Duron 650MHz Processor

• FIC model AZ11 motherboard w/IDE ATA-66 controller

• 256MB of PC100 RAM

• Two IBM Ultra/100 7200RPM IDE disks model DTLA307045

– Software

• Linux Mandrake version 7.0

• Updated Linux Kernel version 2.4.0-test4

• Informix Dynamic Server 2000 version 9.20.UC1

• Custom datablade and data loading software

13

System Configuration

The experiments with other datablades and the traditional relational solution were done on a system with the following configuration

– Hardware

• Sun UltraSparc 60 with 512 MB memory

• two 8GB and two 4GB SCSI disks

– Software

• Solaris 2.6

• Informix Dynamic Server 9.14 with the Universal DataServer Option

• Geodetic and Shapes2 datablades

14

Experimental Results

Elapsed time to

load

data from the

Monet

catalog

Schema 100K stars 500M stars

Relational

Geodetic

Shapes2

SimpleShape

45.829 secs

233.618 secs

161.714 secs

27.216 secs

65 hrs

334 hrs

226 hrs

32 hrs

Elapsed time in creating R-tree indexes for the three datablades

Schema

Geodetic

Shapes2

SimpleShape

100K stars

448 secs

256 secs

8 secs

500M stars

607 hrs

371 hrs

18 hrs

15

Experimental Results

Loading and Indexing all the 500M stars

Schema

Relational

Geodetic

Shapes2

SimpleShape

Load Time Index Time

B-tree

3

14

10

1

6

90

57

R-tree

25

15

1

Size

13GB

190GB

140GB

15GB

Index Size

B-tree

56GB

66GB

86GB

14GB

R-tree

104GB

77GB

20GB

All times in days

The numbers for the Relational, Geodetic, and Shapes2 schemas are estimates based on regression from measurements for up to 140K stars

.

Beware: Results are not for the faint of heart!

16

Experimental Results

Loading times for the Monet catalog using the SimpleShape datablade

17

Experimental Results

Bulding an R-tree index for the Monet catalog using the SimpleShape datablade

Elapsed time

Page Reads

18

Experimental Results

Number of page reads for various queries and schemas for 60K stars

The query window(s) used are 0.0666x0.0666 decimal degrees

Query

Spatial Self-Join

Spatial Or-Window

2-Chain Spatial Join

3-Star Spatial Join

Relational

4440

22

Schema

Geodetic

1005023

35

23

24

49

50

Shapes2 SimpleShape

60443

44

6775

14

87

43

19

12

19

Experimental Results

Performance of the custom-datablade (schema) for queries against a table with

60K stars

Query

Spatial Window

Spatial Self-Join

Spatial OR Window

2-Chain Spatial Join

3-Chain Spatial Join

4-Chain Spatial Join

3-Star Spatial Join

4-Star Spatial Join

Elapsed time (secs)

0.071

3072.000

0.163

2.377

9.492

44.396

9.532

44.370

Page reads

38.9

6775.0

13.6

18.5

26.1

27.6

11.9

24.1

20

21

22

IDL Extension for the Informix Server

Powerful data-mining applications can be built by providing an interface between an object-relational DBMS (Informix) and a statistical/scientific-computing package (IDL).

The IDL extension to Informix allows users to tap the data-analysis and visualization capabilities of IDL through

Informix.

Implementation is based on UDFs and the

RPC

mechanism

.

Informix

DBMS

IDL user functions

 Efficient storage

 and retrieval

Indexing and

 clustering of data

Concurrency

 control

Security

Data-analysis

 routines

Visualization capabilities

IDL server

Data-analysis

 routines

Visualization capabilities

23

The IDL interface provides two main functions

idl_exec

is a simple user defined function that executes the idl-script for each row of the table

idl_agg

is a user defined aggregate function.

The functions are used along with casting functions that cast the pointer returned, to appropriate data type.

Data type Last IDL Data type Cast

.

returned statement returned function from IDL in script by function

.

Integer

$RETURN pointer to int toInt

.

Double

$RETURN pointer to float toFloat

.

Binary arr

$BLOB

LO_Handle toBlob structure

$RETURN

2D Array

$RETURN pointer to

ROW toRow of lvarchars ptr to collection toList of rows

Computing distance from the origin using idl_exec

The return variable is a double, i.e. float in Informix. So we use

toFloat

as casting function.

Select toFloat (

Columns of the table that are used as arguments by

idl_exec

.$1

 x ;

$2

 y

$RETURN used to return a double

idl_exec (

ROW

( x , y ),

"r = double( sqrt ( (

$

1)^2 + (

$

2)^2 ) ) ;

$RETURN r ”

)

) from Points;

IDL script

24

Clustering using idl_agg: We use the

CLUST_WTS

function in IDL that computes the weights (the cluster centers) of an mxn array and returns an array of cluster centers. idl_agg, in this case, returns a collection of rows that can be made to look like a virtual table.

The return variable is a 2D-array.

toList

is casting function for the pointer to collection of rows that

idl_exec

returns

The initialization IDL script ; executed once for each group

select toList ( idl_agg

(

ROW

( ra, dec ),

ROW

( "count=-1",

Iteration script; executed once for every row in the group; here it assigns values to the matrix

)

"count=count+1 &if count eq 0 then y = [$1, $2] else y = [[y], [$1,$2]]",

"w =

CLUST_WTS

(y,

N_CLUSTERS

=3);

$RETURN w ”

) ) :: list(

ROW

(a float, b float) not null) from monet_temp ;

Casts the collection into required list.

Final script; executed once at the end of each group; here it clusters data using some IDL routines

$RETURN used to return a collection of rows of two doubles each.

25

Getting a plot using idl_agg

: We use a

IDL UDF,

MONET_JPEG

that creates a plot with 1 st two arguments (ra and dec) as the axis and the next two (red & blue) determining the color of the point.

MONET_JPEG returns the jpeg of the plot in binary. The binary array is returned as a blob by idl_agg.

Since we have

$BLOB

, idl_agg returns an

LO_H andle to the blob. Hence

toBlob

is used for casting.

select toBlob ( idl_agg (

ROW

( ra, dec, blue, red ) ,

Initialization script

Iteration script

ROW

(

Final script

'flag=0&i=long(-1)&temp=dblarr(4,1000)',

'i=i+1&if i ge 1000 then begin if flag eq 0 then begin a=temp & flag=1&end else a=[[a],[temp]] & i=0 & end & temp(*,i) = [$1,$2,$3,$4]',

'if i gt 0 then begin if flag eq 0 then a=temp(*,0:i) else a=[[a], [temp(*,0:i)]]

& end & b =

MONET_JPEG

(a(1,*),a(0,*),a(2,*),a(3,*));

$BLOB b’

)

)

) from monet_all;

$BLOB used to return a blob that contains the jpeg image

26

Download