Jacek Becla SLAC National Accelerator Laboratory Jacek Becla 1 Outline • LSST (challenges, baseline) • SciDB • Summary Jacek Becla 2 Jacek Becla 3 LSST Database Challenges • Scale – ~50 billion astronomical objects – ~1,000 observations per object – At least 1 release per year – 30-50 PB catalog • Query complexity – Spatial and temporal correlations Jacek Becla 4 Database Challenges (II) • ~Live updates, 1/3 of Object Catalog / 24h • Provenance tracking, virtual data • Would be nice… – Better APIs – Tighter integration with images – Better support of data uncertainty Jacek Becla 5 Baseline Architecture • Shared nothing MPP database – On top of open source DBMS (such as MySQL) • Large tables partitioned – With overlaps! • Shared scans for high volume queries • Heavily indexed for low-latency low-volume queries Jacek Becla 6 Qserv Architecture Deploying for wide use by LSST science collaborations on ~10 TB data set this year Jacek Becla 7 Qserv Highlights • Distributed processing (map) and result combine (reduce) use off-the-shelf DBMS • Built for analyses on immutable data sets • Overlapping partitioning, fixed chunks, 1st level materialized, 2nd on the fly • Have working prototype – – – – – Alpha state Large scale tests (100+ nodes) underway Shared scans planned for 1stQ next year Beta planned mid next year All open source Jacek Becla 8 Open source DBMS for scientific research Jacek Becla Special thanks to Paul Brown and Marilyn Matz (SciDB/Zetics) for help with contents of the SciDB slides 9 Arrays Are the Basic Data Structure CREATE ARRAY Reads ( Read::CHAR(1), Confidence::FLOAT ) [ SEQ_ID=0:*, POSITION=0:50 ] Array has • name - 'Reads' in this case • element structure - a tuple of named, typed attributes • dimensional specification - named whole number index ranges • Arrays can be sparse (missing values) Jacek Becla 10 Data Model – Cells can be tuples or other arrays • Time is an automatically supported extra dimension QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. Dimension 2 • Nested multi-dimensional arrays QuickTime™ and a decompressor are needed to see this picture. (0,0) a1 = 3 a2 = 6.52 a3 = (5,2) Dimension 1 – No overwrites • Extensible type system • Ragged arrays allow each row/column to have a different dimensionality Jacek Becla 11 Data Model: More Good Stuff • Support for multiple flavors of ‘null’ – Array cells can be ‘EMPTY’ – User-definable, context-sensitive treatment • Native support for uncertain or probabilistic data – Error bars, range data, normal probability distribution functions • Support for provenance and named versions – Records lineage of derived data Jacek Becla 12 Architecture • Shared nothing cluster – – – – Application 10’s–1000’s of nodes Commodity hardware TCP/IP between nodes Linear scale-up • Each node has a processor and storage • Queries refer to arrays as if not distributed • Query planner optimizes queries for efficient data access & processing • Query plan runs on a node’s local executor & storage manager • Runtime supervisor coordinates execution Jacek Becla Application Layer Java, C++, … Language Specific UI Doesn’t require JDBC, ODBC Runtime Supervisor Server Layer Query Interface and Parser AQL an extension of SQL Also supports UDFs Plan Generator Node 3 Node 2 Node 1 Storage Layer Local Executor Storage Manager 13 Array-oriented Storage Manager • Optimized for both dense and sparse array data – Different data storage, compression, and access • Arrays are “chunked” (in multiple dimensions) • Chunks are partitioned across a collection of nodes • Chunks have ‘overlap’ to support neighborhood operations • No overwrite • Replication provides efficiency and back-up • Shared scans provide efficient I/O • In-situ data, including netCDF and HDF5 Jacek Becla 14 Internals - 3 steps 1. Vertical Partitioning into multiple single-attribute arrays. Compressed. Jacek Becla 15 Internals 2. Divide the single-value array into overlapping partitions or chunks Jacek Becla 16 Internals 3. Assign chunks to physical nodes in a massively parallel architecture Jacek Becla 17 Query Language & Operators • Array Query Language -- AQL – Declarative SQL-like language extended for working with array data – Not currently available: coming in next release • Array Internal Language -- AIL – Functional language for internal implementation of the operators – Available now • Array Specific Operators – Aggregate, Apply, Filter, Join, Regrid, Subsample, et al • Linear Algebra, Statistical, and other applied math operators – Matrix operations, clustering, covariance, linear and logistic regression, et al • Extensible with Postgres-style user-defined functions – e.g., to ‘cook’ images, perform customized search, FFT, feature-detect • Interfaces to other open source packages – MatLab, R, SAS Jacek Becla 18 Array Query Language (AQL) CREATE ARRAY Test_Array < A: integer NULLS, B: double, C: USER_DEFINED_TYPE > [I=0:99999,1000, 10, J=0:99999,1000, 10 ] PARTITION OVER ( Node1, Node2, Node3 ) USING block_cyclic(); attribute names A, B, C index names I, J chunk size overlap 10 1000 Jacek Becla 19 Array Query Language (AQL) SELECT Geo-Mean ( T.B ) FROM Test_Array T WHERE T.I BETWEEN :C1 AND :C2 AND T.J BETWEEN :C3 AND :C4 AND T.A = 10 GROUP BY T.I; User-defined aggregate on an attribute B in array T Subsample Filter Group-by So far as SELECT / FROM / WHERE / GROUP BY queries are concerned, there is little logical difference between AQL and SQL Jacek Becla 20 Matrix Multiply CREATE ARRAY TS_Data < A1:int32, B1:double > [ I=0:99999,1000,0, J=0:3999,100,0 ] 10K x 4K multiply (project (TS_Data, B1), transpose(project (TS_Data, B1))) Project on one attribute • Smaller of the two arrays is replicated at all nodes – Scatter-gather • Each node does its “core” of the bigger array with the replicated smaller one • Produces a distributed answer Jacek Becla 21 User-defined functions (UDFs) • Will run in parallel • Enables – Operations not yet natively supported – Image processing – Domain-specific custom operations Jacek Becla 22 Traditional RDBMS vs Arrays • Data model – In many places, n-d arrays better than tables • Simulating arrays on top of tables costs ~x100 – Locality, adjacency is natural in n-dimensional space – Tracking uncertainty, versioning, updates, units, etc. becomes just another dimension • Operations – Need array operators and parallel user-defined-functions more than SQL – Think regrid, smooth, transpose, matrix multiply, curve fitting, covariance, not join Jacek Becla 23 Analytics in SciDB • Advanced – Matlab, SAS, R, S+ like functionality • Scalable: disk based, massively parallel – Does not fall off performance cliff when main memory exceeded – Add resources to do more work • Integrated with data mgmt – Data not cycled between db and analytics engine Jacek Becla 24 SciDB Team • ~25 – including world-class database pioneers • Mostly volunteers from academic, science and industrial communities Jacek Becla 25 SciDB POCs • LSST (demo’ed @VLDB) • Quantitative finance • Genomic sequencing • Tests with LHC Atlas tag data Jacek Becla 26 Just For Fun • Given densely populated 10K x 10K array: CREATE ARRAY Big < A1:int32, B1:double, C1:char > [ I=0:9999,1000,0, J=0:9999,1000,0 ] • "What is the probability the value of "B1" increases from J-1 to J, over a small range of [ I -> I, J->J ], when the C1 is the same for the next and previous?" Jacek Becla apply ( join ( count ( filter ( project ( filter ( apply ( join ( subsample ( Big , 5000, 5000, 5500, 5500 ) AS Ar1, subsample ( Big , 5000, 4999, 5500, 5499 ) AS Ar2 ), 'Diff', Ar1.B1 - Ar2.B1 ), Ar1.C1 = Ar2.C1 ) 'Diff' ), Diff > 0 ) ) AS N, count ( project ( filter ( apply ( join ( subsample ( Big , 5000, 5000, 5500, 5500 ) AS Ar1, subsample ( Big , 5000, 4999, 5500, 5499 ) AS Ar2 ), 'Diff', Ar1.B1 - Ar2.B1 ), Ar1.C1 = Ar2.C1 ), 'Diff' ), ) AS D ), 'P', abs(N.count) / abs(D.count) ) 27 Roadmap • R0.5 available: very early stage “work-in-progress” – Basic functionality • iquery interpreter • Internal query language - AIL • Small set of operators - create, aggregate, join, filter, subsample – Minimal ‘wiki-style’ documentation – Breathes, crawls, hiccups • R0.75 targeted for end-of-year – – – – • AQL - the SQL-like array query language Error handling Scalable math operations Better documentation & stability Linux only R1.0 beta in April 2011 – More functionally complete (UDFs, uncertainty, provenance et al) – Robust, high performance Jacek Becla 28 SciDB and LSST • SciDB inspired by LSST needs – Promises to address all key challenges – And more… • Detailed evaluation planned (R0.75 or R1.0) • Off-the-shelf, open source, well supported much more attractive than maintaining a custom-built solution over the long term Jacek Becla 29 Summary • Baseline LSST – shared nothing architecture on top of open source DBMS • SciDB a new promising option – Shared nothing, scalable – Focuses on complex analytics, spatial and temporal correlations – Strong team behind it • …and rapidly growing interest! – More at http://scidb.org Jacek Becla 30