SciDB Tutorial Technical Overview, and Best Practices Overview • What is SciDB? • Historical Context, Project Goals and Motives, current Status • Architecture, Installation • How to install the software • Application Development • Basic Schemas, Queries, Data Loading, Client Options • Advanced Schemas, Plugins, Math • Managing dimensions • User-defined types, functions, operators, etc • General Advice and Best Practice • Will be scattered throughout the tutorial • Conclusions and Closing Running Time: 1 hour, with 30 minutes for discussion / Q&A. Background, Motivation and Status • XLDB - Survey of Scientific Data Management 2008 • • • • • SciDB and Paradigm4 • • • • Who they were: Astronomy, Remote Sensing, Geology What they wanted: Provenance, Dense Arrays, Legacy Data Format support What they didn’t want: SQL a non-starter in this problem domain What was difficult: “Big data” file explosion unwieldy, and parallel data processing SciDB is an open-source (GPL-3) platform, available through http://www.scidb.org/forum Paradigm4 is a commercial (venture backed) company that sponsors SciDB development and … … makes a living selling “Enterprise” features to customers who can pay for them. What We Learned Since 2010 • • • • • • Little real-world enthusiasm for Provenance About half of our use-cases emphasize sparse arrays Most data arrives in .csv files or UTF-8 triples Commercial demand for: Time-series, statistical/numeric analysis, funky-flavored-OLAP Industries: Bio-IT, Industrial Sensor Data, Financial Analytics Notable Science Successes: NIH 1000 genomes (400T), NERSC (128 instances, 100+T) Why SciDB? Big analytics without big hassles R, Python, Matlab, Julia,… MPP Database Array data model Complex analytics Commodity clusters or cloud SciDB Architectural Overview SciDB Coordinator Node(s) SciDB Engine 1 SciDB Client ( iquery, ‘R’, Java, Python ) 2 3 PostgreSQL Persistent System Catalog Service Local Store PostgreSQL Connection SciDB Inter-Node Communication SciDB Engine SciDB Engine SciDB Engine SciDB Engine 5 Local Store Local Store Local Store Local Store 6 SciDB Node SciDB Node SciDB Node SciDB Node SciDB Worker Nodes 4 SciDB Architectural Overview • Massively Parallel Data Management • Installs onto a cluster of physical nodes • You can install N SciDB instances per 1 physical node • (Optional) Data redundancy for reliability • Orthodox Query Processing • • • • • • Client / server connections RESTful API using mid-tier shim (mostly for ‘R’ and Python) Parsing and plan generation on Coordinator(s) Limited query optimization Physical plan distribution to Worker(s) Run-time coordination of data movement Installation and Configuration • Download the user guide PDF from the forum • http://www.scidb.org/forum • Also at: • http://www.paradigm4.com/HTMLmanual/14.8/scidb_ug/ • See Section 2 for script Instructions • See Appendix A for non-scripted steps (yum / apt-get) • Cluster install script: • http://github.com/Paradigm4/deployment • Non-root option for RHEL or CentOS • How-to video: • http://www.paradigm4.com/scidb-installation-video/ SciDB Configuration Guide • Basic Set of Questions: • • • • • How many physical compute nodes? How many physical disks per node? How many cores (CPU cores) per node? How man concurrently connected users? How much DRAM per node? • Use the config.ini generator http://htmlpreview.github.io/?https://raw.github.com/Paradigm 4/configurator/master/config.14.8.html SciDB Client Options • C/C++ Client Library libscidbclient.so • Used inside the iquery • Low level, Array API based • ‘shim’ – 3 tier web server model ‘R’ https shim libscidbclient ‘Python’ • JDBC Driver implemented in Java SciDB SciDB: Data Model Description • ‘Arrays’ instead of ‘Relations’ • Multi-dimensional (up to 99-D) • Multi-attribute (theoretically limited to 2^64, but we test arrays to 1,000) • Extensible (ie User-definable) types/functions/aggregates/operators • Straight-forward Theoretical Mapping • Array.dimensions === Relation.key • Constraints and Data Integrity Rules • Arrays can be dense or sparse • Dimension lengths can be constrained or unbounded • Sophisticated (maybe too sophisticated?) missing information management • Subtle but Significant Differences between Array model and Relational • Dimensions implicitly order cells (not true of SQL tuples) • Underlying algebra more explicit in the query language (AFL) For Example CREATE ARRAY CALLS < bytes : int32 DEFAULT 16> [ CALLING=0:*,168000,0, CALLED=0:*,168000,0, WHEN=0:*,100000,0 ]; CREATE ARRAY CALL_SUMMARY < total_bytes : int32 NULLS, total_calls : int64 > [ CALLING=0:*,168000,0, CALLED=0:*,168000,0 ]; CREATE ARRAY MODIS < probe : double, rgb : int64, q : double > [ LAT=-90000000:90000000, 10000, 100, LONG=-180000000:180000000, 10000, 100, WHEN=0:*,100000,0 ]; Queries: Composible Array Algebra • Analogs from Relational / OLAP • project, filter, join, group-by, union (merge) • window, cross join • theta-join • Non-Relational Operators • regrid, multi-dimensional window, cumulate • Pure Numerical / Mathematical Operators • multiply, transpose, reverse, gaussian-dc, gesvd, tsvd, gemm, spgemm • Exotics • Operators are extensible (hard to do!) • Users have added their own gaussian smooth, feature detection, fourier transforms, histogram … AQL Examples --- Straight-forward SQL-like queries. SELECT SUM ( C.total_bytes ) AS total_bytes, COUNT ( * ) AS total_calls INTO CALLS_SUMMARY FROM CALLS AS C GROUP BY C.CALLING, C.CALLED; --- More exotic … n-dimensional windowing SELECT MEDIAN ( M.probe ) AS M_Probe FROM MODIS WINDOW AS ( PARTITION BY LAT 50 PRECEDING AND 50 FOLLOWING, LONG 50 PRECEDING AND 50 FOLLOWING ); AFL Query Language filter ( -apply ( -- “Query” to compute a single iteration of apply ( -- Conway’s Game of Life over a 2D array. join ( -window ( -- Compute the number of "live" cells in 3x3 Life, 1, 1, 1, 1, sum ( alive ) as sum_n ) AS N_Step, Life AS P_Step -- Join neighbor_count (N_Step) with Previous ), -- state of game (was Life, aliased as P_Step) neighbor_count, N_Step.sum_n – P_Step.alive -- Trim out cell itself if alive ), next_alive, -- Apply Conway’s Life rules to neighbor_count iif ( P_Step.alive = 1 , iif (( neighbor_count < 2 OR neighbor_count > 3 ), 1, 0 ), iif (( 3 = neighbor_count ), 1, 0) ) ), next_alive = 1 ) Data Loading into SciDB ( 1 / 2 ) • Lengthy Tutorial with Scripting at: http://www.scidb.org/forum/viewtopic.php?f=11&t=1308#p2724 • Support for multiple file format options – Text, binary, OPAQUE and in 14.12, tsv – Performance: OPAQUE x 100 ~= Binary x 10 ~= text • Simplest Method 1. load (file,Load_Array,format ) to < X, Y, data1, data2, … datan > [ Row ] 2. store(redimension(Load_Array,Target),Target) to < data1, data2, … datan > [ X, Y ] Data Loading into SciDB ( 2 / 2 ) • load( file, array, format ) -> store ( input ( file, format ), array ) • redimension(…) is expensive (sort) • insert(…) substitutes for store(…) – Difference: insert appends, doesn’t overwrite Chunk Sizing ( 1 / 2 ) • Really: List of Per-Dimension Chunk Lengths • Easy when the data is completely dense • Harder when the data is sparse (and skewed) Brief Review: Per-dimension chunk length is part of the CREATE TABLE dimension specification. CREATE ARRAY … < data > [ dimension = 0 : * , length, overlap ]; Chunk length is a measure in logical space. Set the per-dimension chunk lengths of your array so that the average number of cells-per-chunk ~= 1,000,000 Chunk Sizing ( 2 / 2 ) • calculate_chunk_length.py – Ships with 14.8 – Soon to be internalized (see below) • Give it: 1. Name of an array that has data to be placed into your eventual target array. 2. Specification of the target’s dimensional “shape”, with “?” indicating the values you want it to compute. $ calculate_chunk_length.py modis_load"latitude=?:?,?,?, longitude=?:?,?,?" latitude=180937:393179,37916,0, longitude=-1206600:-921726,37969,0 • Internalized Version (14.12) AFL%> CREATE ARRAY Target < data > [ latitude=?:?,?,?, longitude=?:?,?,? ] AS modis_load; Best Practice Tip: Give these tools as much data as you can to seed them. SciDB Query Writing ( 1 / 3 ) • AFL Queries are trees of operators • Every operator: 1. Accepts one or more arrays as inputs –Plus additional parameters 2. Returns one array to the operator above it –Or the client if the root operator of the query Best Practice Tip: Use the built-in help. AFL% help('filter'); {i} help {0} 'Operator: filterUsage: filter(<input>, <expression>)’ AFL% SciDB Query Writing ( 2 / 3 ) • SciDB is an example of a “meta-data driven” system • • • • • • • • Built-in tools to help you discover what’s possible. list(‘operators’) – list of operators callable from AFL. list(‘functions’) – list of functions usable in AFL apply(…) or filter(…) and AQL list(‘arrays’) – list of arrays in the SciDB database. list(‘queries’) – list of currently running queries in the installation list(‘chunk map’) – list of all the chunks in the installation show ( array_name ) – returns the shape of the named array show ( ‘query string’, ‘afl|aql’) – returns the shape of the array produced by the query. • These operators return arrays, like any other data, and can be used in queries. SELECT * FROM list (‘functions’) WHERE regex ( signature, ‘bool(.*)string,string(.*)’ ); SciDB Query Writing ( 3 / 3 ) aggregate ( Query Illustrates Several Ideas: filter ( • Composibility of operators cross_join ( • Meta-data as data source filter ( • Useful details about physical design list('arrays'), name = 'Foo' ) AS A, list('chunk map') AS C ), A.uaid = C.uaid ), min ( name ) AS Name_Of_Array, COUNT(*) AS Number_of_Physical_Chunks, avg ( nelem ) AS Avg_Number_of_Cells_Per_Chunk ); Plugins and Extensions Functions, aggregates, types, operators and macros • Implement your plugin in C/C++ – Compile into shared object library file (say, plugin.so) – Copy plugin .so file to pluginsdir on each instance – Load the library using the SciDB command AFL% load_library('dense_linear_algebra’) AFL% unload_library('dense_linear_algebra') –restart scidb AFL% list('libraries') Example implementations of all flavors of C/C++ extensions to be found in the examples directory. SciDB Plugins • • • • https://github.com/Paradigm4/dmetric https://github.com/Paradigm4/r_exec https://github.com/Paradigm4/knn https://github.com/Paradigm4/superfunpack • • • • • • … https://github.com/wangd/SciDB-HDF5 https://github.com/slottad/scidb-genotypes https://github.com/parkerabercrombie/SciDB-GDAL https://github.com/mkim48/TensorDB https://github.com/tshead/scidb-string https://github.com/ljiangjl/Percentile-in-SciDB © Paradigm4 23 Functions, aggregates, types and operators Provided by P4 as well as the community Notes on Macros ( 1 / 2 ) • Macros are basic “functional” extensions – Not written in C/C++ - more like AFL syntax – First stab at a much more elaborate scheme • How are they implemented? – Currently, implement them in a (central) file AFL%> load_module (‘macro_file.txt’); – Examples in lib/scidb/modules/prelude.txt – Limited: naïve “macro expansion” approach Notes on Macros ( 2 / 2 ) • Example: In a file named /tmp/macro.txt array_chunk_details ( __BAR__ ) = aggregate ( filter ( cross_join ( filter (list ('arrays'), name = __BAR__ ) AS A, list('chunk map') AS C ), A.uaid = C.uaid ), count(*) AS num_chunks, min ( C.nelem ) AS min_cells_per_chunk, max ( C.nelem) AS max_cells_per_chunk, avg ( C.nelem ) AS avg_cells_per_chunk ); • Load the macro using load_module(‘/tmp/macro.txt’) • Invoke it like any other operator AFL%> array_chunk_details ( ‘Foo’ ); Conclusions Too much to pack into one slide!