Beyond SDSS: SDSS-III, PS1 and the Cloud Ani Thakar Center for Astrophysical Sciences and Institute for Data Intensive Engineering and Science (IDIES) The Johns Hopkins University Outline • SDSS-III – The SDSS is dead, long live the SDSS! – New science program, new instruments – Data management • Pan-STARRS (PS1) – Survey overview – DM challenges – System architecture • SSDM in the Cloud? – Experiments with Amazon EC2 and SQL Azure – Moving data into cloud is biggest challenge SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 2 SDSS-III Surveys • BOSS: Baryon Oscillation Spectroscopic Survey – Map spatial distribution of LRGs and quasars – Detect characteristic scale imprinted by baryon acoustic oscillations in the early universe • APOGEE: APO Galactic Evolution Experiment – High-res, high S/N IR spectroscopy of 100k stars – Penetrate dust that obscures the inner Galaxy • SEGUE-2: Sloan Ext. for Galactic Understdg. & Exp. – Mapping outer Milky Way with spectra of over 200k stars – Doubling the SEGUE sample of SDSS-II • MARVELS - Multi-object APO Radial Velocity Exoplanet Large-area Survey – Monitor radial velocities of 11,000 bright stars SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 3 SDSS-III Instruments • Same telescope and camera as SDSS-I & II • New spectrographs – APOGEE HiRez spectrograph (R ~ 20,000 in H band) – BOSS 2 identical spectrographs rebuilt from SDSS-II spectrographs – 1000 fibers/plate – MARVELS new technology (DFDI) spectrograph (R ~ 6,000 – 10,000) • SEGUE-2 used SDSS-II spectrographs – Additional SEGUE-2 data will use BOSS insts. SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 4 SDSS-III Data Management • Data 2-3x size of SDSS-II (10-12 TB/copy) • Schema changes – Different instruments, science objectives • Use basically same technology – Evolution rather than revolution – Single server DB model • DB file partitioning between multiple disk volumes • Upgrade to SQL Server 2008 (better partitioning) – Multiple copies for load balancing • Segregation of queries for different workloads • Site change from FNAL to JHU SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 5 SDSS-III Data Loading and Access • Data loading – Upgraded version of SDSS sqlLoader pipeline – New photo and spectro pipelines at NYU • Data Access – Use the SDSS workhorses (upgraded for SDSS-III) SkyServer ImgCutout Synchronous Query Tools Visual Query Tools DR8_1 DR8_2 Data1 Data2 Data1 Data2 SSDM Edinburgh, Oct 27-28, 2010 CasJobs Asynchronous Query Tools DR8_3 Data1 Data2 Ani Thakar, JHU IIA, MyDB User DBs Data1 Data2 6 Pan-STARRS SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 7 Pan-STARRS Overview • Panoramic Survey Telescope & Rapid Response System • Wide-field imaging survey led by UHawaii/IfA • Ultimately 4 x 1.8m telescopes (PS4) – Haleakala Observatory, Maui (Hawaii) • 1.4 Gpixel camera (largest to date) – 7 sq.deg. in each 30-sec exposure – 6000 sq. deg. per night • NEOs, asteroid mapping major goal SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 8 PS1 Prototype • PS1 is a single-telescope prototype – International PS1 Science Consortium • MPIA, MPE, JHU, CfA, Durham, NCU (Taiwan) – Started May 2010, complete in ~ 3.5 yrs • PS1 science – 3 survey, 3/4th of sky, 30,000 sq deg • Complete one pass in ~ 1 week • Each part of the sky observed 3-4 times/mth – Other surveys – MDF, SS Sweet Spot, etc. SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 9 PS1 Data Challenges • Data (esp. ingest) more complex than SDSS – Time domain requires that we are constantly loading new information, updating what is already there, recalibrating, etc. • Volume much bigger – 25x SDSS-II (100 TB) – – – – Single (monolithic) DB no longer feasible Move to distributed GrayWulf architecture Makes loading even more challenging Work around SQL Server distributed limitations • Distributed Partition Views • Preceded Parallel Data Warehouse (Madison) release SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 10 PSPS: Published Science Products Subsystem • Will manage the PS1 catalog data • PSPS will not receive image files, which are retained by IPP • Three significant PS1 I/O threads: – Ingest detections and initial celestial object data from IPP – Ingest moving object data from MOPS – User queries on detection/object data records Courtesy: J, Heasley, U.Hawaii SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 11 Data accessed via PSPS • The MOPS database • The ODM database (JHU): science data on astronomical objects generated by the IPP – 3π survey – MDF, Pan-Planets, & PANdromeda – Solar System Sweetspot Survey • CasJobs/MyDB Query Workbench (JHU) – Local space to hold query results – Local analysis environment – Download, upload, plotting interface, sharing • Data products from PS1 Science Servers SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 12 PSPS Components • PS1 Science Interface (PSI) – Human the “link” with the human • Data Retrieval Layer – “gatekeeper” of the data collections Other S/W PSI Client • PS1 data collection managers – Object Data Manager (JHU) – Solar System Data Manager DRL • Other (future/PS4) data collection managers; e.g., Future – “Postage stamp” cutouts ODM SSDM DMs IPP MOPS – Metadata database (OTIS) – Cumulative sky image server – Filtered transient database etc. Courtesy: J, Heasley, U.Hawaii SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 13 ODM GrayWulf Architecture ← Behind the Cloud|| User facing services → Data Valet Workflows The Pan-STARRS Science Cloud Data Creators Image Procesing Pipeline (IPP) Telescope CSV Files CSV Files Validation Exception Notification Load Workflow Load DB Merge Workflow Load Workflow Cold Slice DB 1 Flip Workflow Load DB Merge Workflow Cold Slice DB 2 Flip Workflow Admin & Load-Merge Machines Data flows in one direction→, except for error recovery Slice Fault Recover Workflow Warm Slice DB 1 Data Consumer Queries & Workflows MyDB Hot Slice DB 2 Warm Slice Hot DB 2 Slice DB 1 Astronomers (Data Consumers) MainDB Distributed View MainDB Distributed View CASJobs Query Service MyDB Production Machines Courtesy: M. Nieto-Santisteban. STScI SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 14 ODM Distributed Data Layout Image Pipeline csv Load Merge 1 S 1 S 2 csv Load Merge 2 S 3 S 4 S 5 csv Load Merge 3 S 6 S 7 S 8 csv csv csv Load-Merge Nodes L O A D Load Merge 4 S 9 S 10 S 11 Load Merge 5 S 12 S 13 Load Merge 6 S 14 S 15 C O L D S 16 Slice Nodes Slice 1 Slice 2 Slice 3 Slice 4 Slice 5 Slice 6 Slice 7 Slice 8 S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 S 10 S 11 S 12 S 13 S 14 S 15 S 16 S 16 S 3 S 2 S 5 S 4 S 7 S 6 S 9 S 8 S 11 S 10 S 13 S 12 S 15 S 14 S 1 Courtesy: M. Nieto-Santisteban. STScI Head 1 SSDM Edinburgh, Oct 27-28, 2010 Main Main Nodes Head 2 Head Ani Thakar, JHU IIA, 15 H O T W A R M ODM Logical Data Organization • A Head node holds information about Objects and metadata. • Slices are data partitioning of detections – Across cluster of nodes – By ranges in declination • Divide data into small declination zones • Zone partitioning (Gray, Nieto-Santisteban, Szalay 2006) • Object IDs chosen to physically group data – Both on the Head and Slices – Sources nearby on the sky are nearby on disk – Cuts down the number of killer disk seeks SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 16 ObjectID Clusters Data Spatially Dec = –16.71611583 ZH = 0.008333 ZID = (Dec+90) / ZH = 08794.0661 ObjectID = 087941012871550661 RA = 101.287155 ObjectID unique when objects are separated by >0.0043 arcsec Courtesy: M. Nieto-Santisteban. STScI SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 17 ODM Physical Schema • Objects are kept on the head (main) node of the system – Queries involving objects should be very fast • Detections and stack attributes are stored on the slice machines – Queries that involve only detections can run fast (with a little care) – Those that combine both object and detection attributes can be slower • But only if large numbers of objects involved and nonindexed attributes are used SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 18 Data in the Cloud SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 19 Cloudy with a chance of … pain? • Can entire DB be moved to the cloud as is? • Restrictions on data size – For now, use subset of original database – In future, will have to partition the data first • Migrating the data to the cloud – – – – – How to copy data into the cloud? Will it need to be extracted from the DBMS? Will schema have to be modified? Will science functionality be compromised? What about query performance? SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 20 Cloud Experiments • Dataset – – – – – Sloan Digital Sky Survey (SDSS-II) Public dataset, no restrictions Reasonably complex schema, usage model Easy to generate subset of arbitrary size Only available as SQL Server database • Clouds – Amazon Elastic Cloud 2 (EC2) • AWS wanted to host public SDSS dataset – Microsoft Azure / SQL Azure • Natural fit for SQL Server databases SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 21 Migrating data to Amazon EC2 • 1 TB data size limit (per instance) • SDSS DR6 100 GB subset: BestDR6_100GB – Actually more like 150 GB – Large enough for performance tests, small enough to be migrated in a few days/weeks • Several manual steps to create DB instance • Could not connect to DB from outside • Preliminary performance test, but without optimizing within cloud SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 22 SDSS public dataset on EC2 • Public datasets on Amazon – Stored as snapshot available on AWS – Advertised on AWS blog – Anyone can pull into their account • Data is free, but not the usage – Create a running instance – Multiple instances deployed manually • First SQL Server dataset on EC2 (?) – AWS also created a LINUX snapshot of SQL Server DB! SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 23 AWS Blog Entry http://aws.typepad.com/ SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 24 How it’s supposed to work • With other db dumps, assumption is that users will set up their own DB and import the data • Public SDSS dataset to serve two kinds of users: a) People who currently access the SDSS CAS b) General AWS users who are interested in the data • For a), should be able to replicate the same services that SDSS currently has, but using a SQL Server instance on EC2 • For b), users should have everything they need on the AWS public dataset page for SDSS SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 25 Steps to create DB in EC2 • Create “snapshot” of database first • Create storage for DB: 200 GB EBS volume – Instantiate snapshot as volume of required size • Create SQL Server 2005 Amazon Machine Image (AMI) – AMI instance from snapshot • Attach AMI instance to EBS volume – Creates running instance of DB • Get Elastic IP to point to instance SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 26 Steps to create Web interface • Create volume (small) with win2003 • Create instance with Win 2003 AMI (only IIS, no SQL Server) • Attach volume to instance • Get public DNS, admin account • BUT … couldn’t connect to SQL Server IP • So outside world cannot connect to the data as yet SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 27 Migrating data to Microsoft Azure • 10 GB data size limit ( 50 GB now) • SDSS DR6 10 GB subset: BestDR6_10GB • Two ways to migrate database – Script it out of source db (very painful) • Many options to select • Runs out of memory on server! – Use SQL Azure Migration Wizard (much better!) • Runs for few hours • Produces huge trace of errors, many items skipped • But does produce a working db in Azure SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 28 SQL Azure Migration Wizard SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 29 Unsupported MS-SQL features • Can’t run command shells from inside DB • Global temp objects disallowed – Can’t use performance counters in test queries, so hard to benchmark/compare performance • SQL-CLR function bindings not supported – Can’t use our HTM spatial indexing library • T-SQL directives – e.g., to set the level of parallelism • Built-in T-SQL functions – Need to find workaround for these SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 30 Azure sign of success? • Ok, so data in Azure, but at what cost?! – Meaningful performance comparison not possible – Dataset too small – Schema features stripped – No spatial index, so many queries crippled • Connected to Azure DB from outside cloud – Hooked up SkyServer WI – Connected with SQL Server client (SSMS) – Ran simple queries from both SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 31 Cloudy Conclusions • Migrating scientific databases to the cloud not really feasible at the moment • Even migrating smaller DBs can be painful – Several steps to deploy each copy – Cloud may not support full functionality – Problems limited to SQL Server DBs? • Large DBs will have to be partitioned – Set up distributed databases in the cloud – Query distributed databases from outside • Haven’t even talked about economics yet SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 32 Thank you! SSDM Edinburgh, Oct 27-28, 2010 Ani Thakar, JHU IIA, 33