Document

advertisement
SciDB Tutorial
Technical Overview, and Best
Practices
Overview
• What is SciDB?
• Historical Context, Project Goals and Motives, current Status
• Architecture, Installation
• How to install the software
• Application Development
• Basic Schemas, Queries, Data Loading, Client Options
• Advanced Schemas, Plugins, Math
• Managing dimensions
• User-defined types, functions, operators, etc
• General Advice and Best Practice
• Will be scattered throughout the tutorial
• Conclusions and Closing
Running Time: 1 hour, with 30 minutes for discussion / Q&A.
Background, Motivation and Status
•
XLDB - Survey of Scientific Data Management 2008
•
•
•
•
•
SciDB and Paradigm4
•
•
•
•
Who they were: Astronomy, Remote Sensing, Geology
What they wanted: Provenance, Dense Arrays, Legacy Data Format support
What they didn’t want: SQL a non-starter in this problem domain
What was difficult: “Big data” file explosion unwieldy, and parallel data processing
SciDB is an open-source (GPL-3) platform, available through http://www.scidb.org/forum
Paradigm4 is a commercial (venture backed) company that sponsors SciDB development and …
… makes a living selling “Enterprise” features to customers who can pay for them.
What We Learned Since 2010
•
•
•
•
•
•
Little real-world enthusiasm for Provenance
About half of our use-cases emphasize sparse arrays
Most data arrives in .csv files or UTF-8 triples
Commercial demand for: Time-series, statistical/numeric analysis, funky-flavored-OLAP
Industries: Bio-IT, Industrial Sensor Data, Financial Analytics
Notable Science Successes: NIH 1000 genomes (400T), NERSC (128 instances, 100+T)
Why SciDB?
Big analytics
without big hassles
R, Python, Matlab, Julia,…
MPP
Database
Array
data
model
Complex
analytics
Commodity clusters or cloud
SciDB Architectural Overview
SciDB Coordinator Node(s)
SciDB
Engine
1
SciDB Client
( iquery, ‘R’, Java,
Python )
2
3
PostgreSQL
Persistent System
Catalog Service
Local
Store
PostgreSQL
Connection
SciDB Inter-Node
Communication
SciDB
Engine
SciDB
Engine
SciDB
Engine
SciDB
Engine
5
Local
Store
Local
Store
Local
Store
Local
Store
6
SciDB Node
SciDB Node
SciDB Node
SciDB Node
SciDB Worker Nodes
4
SciDB Architectural Overview
• Massively Parallel Data Management
• Installs onto a cluster of physical nodes
• You can install N SciDB instances per 1 physical node
• (Optional) Data redundancy for reliability
• Orthodox Query Processing
•
•
•
•
•
•
Client / server connections
RESTful API using mid-tier shim (mostly for ‘R’ and Python)
Parsing and plan generation on Coordinator(s)
Limited query optimization
Physical plan distribution to Worker(s)
Run-time coordination of data movement
Installation and Configuration
• Download the user guide PDF from the forum
• http://www.scidb.org/forum
• Also at:
• http://www.paradigm4.com/HTMLmanual/14.8/scidb_ug/
• See Section 2 for script Instructions
• See Appendix A for non-scripted steps (yum / apt-get)
• Cluster install script:
• http://github.com/Paradigm4/deployment
• Non-root option for RHEL or CentOS
• How-to video:
• http://www.paradigm4.com/scidb-installation-video/
SciDB Configuration Guide
• Basic Set of Questions:
•
•
•
•
•
How many physical compute nodes?
How many physical disks per node?
How many cores (CPU cores) per node?
How man concurrently connected users?
How much DRAM per node?
• Use the config.ini generator
http://htmlpreview.github.io/?https://raw.github.com/Paradigm
4/configurator/master/config.14.8.html
SciDB Client Options
• C/C++ Client Library libscidbclient.so
• Used inside the iquery
• Low level, Array API based
• ‘shim’ – 3 tier web server model
‘R’
https
shim
libscidbclient
‘Python’
• JDBC Driver implemented in Java
SciDB
SciDB: Data Model Description
• ‘Arrays’ instead of ‘Relations’
• Multi-dimensional (up to 99-D)
• Multi-attribute (theoretically limited to 2^64, but we test arrays to 1,000)
• Extensible (ie User-definable) types/functions/aggregates/operators
• Straight-forward Theoretical Mapping
•
Array.dimensions === Relation.key
• Constraints and Data Integrity Rules
• Arrays can be dense or sparse
• Dimension lengths can be constrained or unbounded
• Sophisticated (maybe too sophisticated?) missing information management
• Subtle but Significant Differences between Array model and Relational
• Dimensions implicitly order cells (not true of SQL tuples)
• Underlying algebra more explicit in the query language (AFL)
For Example
CREATE ARRAY CALLS
< bytes : int32 DEFAULT 16>
[ CALLING=0:*,168000,0, CALLED=0:*,168000,0,
WHEN=0:*,100000,0 ];
CREATE ARRAY CALL_SUMMARY
< total_bytes : int32 NULLS, total_calls : int64 >
[ CALLING=0:*,168000,0, CALLED=0:*,168000,0 ];
CREATE ARRAY MODIS
< probe : double, rgb : int64, q : double >
[ LAT=-90000000:90000000, 10000, 100,
LONG=-180000000:180000000, 10000, 100,
WHEN=0:*,100000,0 ];
Queries: Composible Array Algebra
• Analogs from Relational / OLAP
• project, filter, join, group-by, union (merge)
• window, cross join
• theta-join
• Non-Relational Operators
• regrid, multi-dimensional window, cumulate
• Pure Numerical / Mathematical Operators
• multiply, transpose, reverse, gaussian-dc, gesvd, tsvd, gemm, spgemm
• Exotics
• Operators are extensible (hard to do!)
• Users have added their own gaussian smooth, feature detection,
fourier transforms, histogram …
AQL Examples
--- Straight-forward SQL-like queries.
SELECT SUM ( C.total_bytes ) AS total_bytes,
COUNT ( * ) AS total_calls
INTO CALLS_SUMMARY
FROM CALLS AS C
GROUP BY C.CALLING, C.CALLED;
--- More exotic … n-dimensional windowing
SELECT MEDIAN ( M.probe ) AS M_Probe
FROM MODIS
WINDOW AS ( PARTITION BY
LAT 50 PRECEDING AND 50 FOLLOWING,
LONG 50 PRECEDING AND 50 FOLLOWING );
AFL Query Language
filter (
-apply (
-- “Query” to compute a single iteration of
apply (
-- Conway’s Game of Life over a 2D array.
join (
-window (
-- Compute the number of "live" cells in 3x3
Life, 1, 1, 1, 1, sum ( alive ) as sum_n
) AS N_Step,
Life AS P_Step -- Join neighbor_count (N_Step) with Previous
),
-- state of game (was Life, aliased as P_Step)
neighbor_count,
N_Step.sum_n – P_Step.alive -- Trim out cell itself if alive
),
next_alive,
-- Apply Conway’s Life rules to neighbor_count
iif ( P_Step.alive = 1 ,
iif (( neighbor_count < 2 OR neighbor_count > 3 ), 1, 0 ),
iif (( 3 = neighbor_count ), 1, 0)
)
),
next_alive = 1 )
Data Loading into SciDB ( 1 / 2 )
• Lengthy Tutorial with Scripting at:
http://www.scidb.org/forum/viewtopic.php?f=11&t=1308#p2724
• Support for multiple file format options
– Text, binary, OPAQUE and in 14.12, tsv
– Performance: OPAQUE x 100 ~= Binary x 10 ~= text
• Simplest Method
1. load (file,Load_Array,format ) to
< X, Y, data1, data2, … datan > [ Row ]
2. store(redimension(Load_Array,Target),Target) to
< data1, data2, … datan > [ X, Y ]
Data Loading into SciDB ( 2 / 2 )
• load( file, array, format ) ->
store ( input ( file, format ), array )
• redimension(…) is expensive (sort)
• insert(…) substitutes for store(…)
– Difference: insert appends, doesn’t overwrite
Chunk Sizing ( 1 / 2 )
• Really: List of Per-Dimension Chunk Lengths
• Easy when the data is completely dense
• Harder when the data is sparse (and skewed)
Brief Review:
Per-dimension chunk length is part of the CREATE
TABLE dimension specification.
CREATE ARRAY …
< data >
[ dimension = 0 : * , length, overlap ];
Chunk length is a measure in logical space.
Set the per-dimension chunk lengths of your array so that
the average number of cells-per-chunk ~= 1,000,000
Chunk Sizing ( 2 / 2 )
• calculate_chunk_length.py
– Ships with 14.8
– Soon to be internalized (see below)
• Give it:
1. Name of an array that has data to be placed into your
eventual target array.
2. Specification of the target’s dimensional “shape”, with
“?” indicating the values you want it to compute.
$ calculate_chunk_length.py modis_load"latitude=?:?,?,?,
longitude=?:?,?,?"
latitude=180937:393179,37916,0,
longitude=-1206600:-921726,37969,0
• Internalized Version (14.12)
AFL%> CREATE ARRAY Target
< data >
[ latitude=?:?,?,?,
longitude=?:?,?,? ] AS modis_load;
Best Practice Tip: Give
these tools as much data
as you can to seed them.
SciDB Query Writing ( 1 / 3 )
• AFL Queries are trees of operators
• Every operator:
1. Accepts one or more arrays as inputs
–Plus additional parameters
2. Returns one array to the operator above it
–Or the client if the root operator of the query
Best Practice Tip: Use the built-in help.
AFL% help('filter');
{i} help
{0} 'Operator: filterUsage: filter(<input>, <expression>)’
AFL%
SciDB Query Writing ( 2 / 3 )
• SciDB is an example of a “meta-data driven” system
•
•
•
•
•
•
•
•
Built-in tools to help you discover what’s possible.
list(‘operators’) – list of operators callable from AFL.
list(‘functions’) – list of functions usable in AFL apply(…) or filter(…) and AQL
list(‘arrays’)
– list of arrays in the SciDB database.
list(‘queries’)
– list of currently running queries in the installation
list(‘chunk map’) – list of all the chunks in the installation
show ( array_name ) – returns the shape of the named array
show ( ‘query string’, ‘afl|aql’) – returns the shape of the array
produced by the query.
• These operators return arrays, like any other data, and can be used in queries.
SELECT *
FROM list (‘functions’)
WHERE regex ( signature, ‘bool(.*)string,string(.*)’ );
SciDB Query Writing ( 3 / 3 )
aggregate (
Query Illustrates Several Ideas:
filter (
• Composibility of operators
cross_join (
• Meta-data as data source
filter (
• Useful details about physical design
list('arrays'),
name = 'Foo'
) AS A,
list('chunk map') AS C
),
A.uaid = C.uaid
),
min ( name ) AS Name_Of_Array,
COUNT(*) AS Number_of_Physical_Chunks,
avg ( nelem ) AS Avg_Number_of_Cells_Per_Chunk
);
Plugins and Extensions
Functions, aggregates, types, operators and macros
• Implement your plugin in C/C++
– Compile into shared object library file (say, plugin.so)
– Copy plugin .so file to 
pluginsdir on each instance
– Load the library using the SciDB command
AFL% load_library('dense_linear_algebra’)
AFL% unload_library('dense_linear_algebra')
–restart scidb
AFL% list('libraries')
Example implementations of all flavors of C/C++
extensions to be found in the examples directory.
SciDB Plugins
•
•
•
•
https://github.com/Paradigm4/dmetric
https://github.com/Paradigm4/r_exec
https://github.com/Paradigm4/knn
https://github.com/Paradigm4/superfunpack
•
•
•
•
•
•
…
https://github.com/wangd/SciDB-HDF5
https://github.com/slottad/scidb-genotypes
https://github.com/parkerabercrombie/SciDB-GDAL
https://github.com/mkim48/TensorDB
https://github.com/tshead/scidb-string
https://github.com/ljiangjl/Percentile-in-SciDB
© Paradigm4 23
Functions, aggregates, types and operators
Provided by P4 as well as the community
Notes on Macros ( 1 / 2 )
• Macros are basic “functional” extensions
– Not written in C/C++ - more like AFL syntax
– First stab at a much more elaborate scheme
• How are they implemented?
– Currently, implement them in a (central) file
AFL%> load_module (‘macro_file.txt’);
– Examples in lib/scidb/modules/prelude.txt
– Limited: naïve “macro expansion” approach
Notes on Macros ( 2 / 2 )
• Example: In a file named /tmp/macro.txt
array_chunk_details ( __BAR__ ) = aggregate (
filter (
cross_join (
filter (list ('arrays'), name = __BAR__ ) AS A,
list('chunk map') AS C
), A.uaid = C.uaid
), count(*) AS num_chunks,
min ( C.nelem ) AS min_cells_per_chunk,
max ( C.nelem) AS max_cells_per_chunk,
avg ( C.nelem ) AS avg_cells_per_chunk
);
• Load the macro using load_module(‘/tmp/macro.txt’)
• Invoke it like any other operator
AFL%> array_chunk_details ( ‘Foo’ );
Conclusions
Too much to pack into one slide!
Download