Beyond SDSS: SDSS-III, PS1 and the Cloud Ani Thakar

advertisement
Beyond SDSS:
SDSS-III, PS1 and the
Cloud
Ani Thakar
Center for Astrophysical Sciences and
Institute for Data Intensive Engineering and Science (IDIES)
The Johns Hopkins University
Outline
• SDSS-III
– The SDSS is dead, long live the SDSS!
– New science program, new instruments
– Data management
• Pan-STARRS (PS1)
– Survey overview
– DM challenges
– System architecture
• SSDM in the Cloud?
– Experiments with Amazon EC2 and SQL Azure
– Moving data into cloud is biggest challenge
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
2
SDSS-III Surveys
• BOSS: Baryon Oscillation Spectroscopic Survey
– Map spatial distribution of LRGs and quasars
– Detect characteristic scale imprinted by baryon acoustic
oscillations in the early universe
• APOGEE: APO Galactic Evolution Experiment
– High-res, high S/N IR spectroscopy of 100k stars
– Penetrate dust that obscures the inner Galaxy
• SEGUE-2: Sloan Ext. for Galactic Understdg. & Exp.
– Mapping outer Milky Way with spectra of over 200k stars
– Doubling the SEGUE sample of SDSS-II
• MARVELS - Multi-object APO Radial Velocity
Exoplanet Large-area Survey
– Monitor radial velocities of 11,000 bright stars
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
3
SDSS-III Instruments
• Same telescope and camera as SDSS-I & II
• New spectrographs
– APOGEE HiRez spectrograph (R ~ 20,000 in H
band)
– BOSS 2 identical spectrographs rebuilt from
SDSS-II spectrographs – 1000 fibers/plate
– MARVELS new technology (DFDI) spectrograph
(R ~ 6,000 – 10,000)
• SEGUE-2 used SDSS-II spectrographs
– Additional SEGUE-2 data will use BOSS insts.
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
4
SDSS-III Data Management
• Data 2-3x size of SDSS-II (10-12 TB/copy)
• Schema changes
– Different instruments, science objectives
• Use basically same technology
– Evolution rather than revolution
– Single server DB model
• DB file partitioning between multiple disk volumes
• Upgrade to SQL Server 2008 (better partitioning)
– Multiple copies for load balancing
• Segregation of queries for different workloads
• Site change from FNAL to JHU
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
5
SDSS-III Data Loading and Access
• Data loading
– Upgraded version of SDSS sqlLoader pipeline
– New photo and spectro pipelines at NYU
• Data Access
– Use the SDSS workhorses (upgraded for SDSS-III)
SkyServer
ImgCutout
Synchronous
Query Tools
Visual
Query Tools
DR8_1
DR8_2
Data1
Data2
Data1
Data2
SSDM Edinburgh, Oct 27-28, 2010
CasJobs
Asynchronous
Query Tools
DR8_3
Data1
Data2
Ani Thakar, JHU
IIA,
MyDB
User DBs
Data1
Data2
6
Pan-STARRS
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
7
Pan-STARRS Overview
• Panoramic Survey Telescope & Rapid
Response System
• Wide-field imaging survey led by UHawaii/IfA
• Ultimately 4 x 1.8m telescopes (PS4)
– Haleakala Observatory, Maui (Hawaii)
• 1.4 Gpixel camera (largest to date)
– 7 sq.deg. in each 30-sec exposure
– 6000 sq. deg. per night
• NEOs, asteroid mapping major goal
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
8
PS1 Prototype
• PS1 is a single-telescope prototype
– International PS1 Science Consortium
• MPIA, MPE, JHU, CfA, Durham, NCU (Taiwan)
– Started May 2010, complete in ~ 3.5 yrs
• PS1 science
– 3 survey, 3/4th of sky, 30,000 sq deg
• Complete one pass in ~ 1 week
• Each part of the sky observed 3-4 times/mth
– Other surveys – MDF, SS Sweet Spot, etc.
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
9
PS1 Data Challenges
• Data (esp. ingest) more complex than SDSS
– Time domain requires that we are constantly
loading new information, updating what is
already there, recalibrating, etc.
• Volume much bigger – 25x SDSS-II (100 TB)
–
–
–
–
Single (monolithic) DB no longer feasible
Move to distributed GrayWulf architecture
Makes loading even more challenging
Work around SQL Server distributed limitations
• Distributed Partition Views
• Preceded Parallel Data Warehouse (Madison) release
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
10
PSPS: Published Science Products Subsystem
• Will manage the PS1
catalog data
• PSPS will not receive
image files, which are
retained by IPP
• Three significant PS1
I/O threads:
– Ingest detections and
initial celestial object
data from IPP
– Ingest moving object
data from MOPS
– User queries on
detection/object data
records
Courtesy: J, Heasley, U.Hawaii
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
11
Data accessed via PSPS
• The MOPS database
• The ODM database (JHU): science data on
astronomical objects generated by the IPP
– 3π survey
– MDF, Pan-Planets, & PANdromeda
– Solar System Sweetspot Survey
• CasJobs/MyDB Query Workbench (JHU)
– Local space to hold query results
– Local analysis environment
– Download, upload, plotting interface, sharing
• Data products from PS1 Science Servers
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
12
PSPS Components
• PS1 Science Interface (PSI) –
Human
the “link” with the human
• Data Retrieval Layer – “gatekeeper” of the data collections
Other S/W
PSI
Client • PS1 data collection managers
– Object Data Manager (JHU)
– Solar System Data Manager
DRL
• Other (future/PS4) data
collection managers; e.g.,
Future
– “Postage stamp” cutouts
ODM
SSDM
DMs
IPP
MOPS
– Metadata database (OTIS)
– Cumulative sky image server
– Filtered transient database etc.
Courtesy: J, Heasley, U.Hawaii
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
13
ODM GrayWulf Architecture
← Behind the Cloud|| User facing services →
Data Valet Workflows
The Pan-STARRS Science Cloud
Data Creators
Image
Procesing
Pipeline (IPP)
Telescope
CSV
Files
CSV
Files
Validation
Exception
Notification
Load
Workflow
Load
DB
Merge
Workflow
Load
Workflow
Cold
Slice
DB 1
Flip
Workflow
Load
DB
Merge
Workflow
Cold
Slice
DB 2
Flip
Workflow
Admin & Load-Merge Machines
Data flows in one
direction→, except for
error recovery
Slice Fault
Recover
Workflow
Warm
Slice
DB 1
Data Consumer
Queries & Workflows
MyDB
Hot
Slice
DB 2
Warm
Slice Hot
DB 2 Slice
DB 1
Astronomers
(Data Consumers)
MainDB
Distributed
View
MainDB
Distributed
View
CASJobs
Query
Service
MyDB
Production Machines
Courtesy: M. Nieto-Santisteban. STScI
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
14
ODM Distributed Data Layout
Image
Pipeline
csv
Load
Merge 1
S
1
S
2
csv
Load
Merge 2
S
3
S
4
S
5
csv
Load
Merge 3
S
6
S
7
S
8
csv
csv
csv
Load-Merge Nodes
L
O
A
D
Load
Merge 4
S
9
S
10
S
11
Load
Merge 5
S
12
S
13
Load
Merge 6
S
14
S
15
C
O
L
D
S
16
Slice Nodes
Slice
1
Slice
2
Slice
3
Slice
4
Slice
5
Slice
6
Slice
7
Slice
8
S
1
S
2
S
3
S
4
S
5
S
6
S
7
S
8
S
9
S
10
S
11
S
12
S
13
S
14
S
15
S
16
S
16
S
3
S
2
S
5
S
4
S
7
S
6
S
9
S
8
S
11
S
10
S
13
S
12
S
15
S
14
S
1
Courtesy: M. Nieto-Santisteban. STScI
Head 1
SSDM Edinburgh, Oct 27-28, 2010
Main
Main
Nodes
Head
2 Head
Ani Thakar,
JHU
IIA,
15
H
O
T
W
A
R
M
ODM Logical Data Organization
• A Head node holds information about
Objects and metadata.
• Slices are data partitioning of detections
– Across cluster of nodes
– By ranges in declination
• Divide data into small declination zones
• Zone partitioning (Gray, Nieto-Santisteban, Szalay
2006)
• Object IDs chosen to physically group data
– Both on the Head and Slices
– Sources nearby on the sky are nearby on disk
– Cuts down the number of killer disk seeks
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
16
ObjectID Clusters Data Spatially
Dec = –16.71611583
ZH = 0.008333
ZID = (Dec+90) / ZH = 08794.0661
ObjectID = 087941012871550661
RA = 101.287155
ObjectID unique when objects are separated by >0.0043 arcsec
Courtesy: M. Nieto-Santisteban. STScI
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
17
ODM Physical Schema
• Objects are kept on the head (main) node
of the system
– Queries involving objects should be very fast
• Detections and stack attributes are stored
on the slice machines
– Queries that involve only detections can run
fast (with a little care)
– Those that combine both object and detection
attributes can be slower
• But only if large numbers of objects involved and nonindexed attributes are used
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
18
Data in the Cloud
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
19
Cloudy with a chance of … pain?
• Can entire DB be moved to the cloud as is?
• Restrictions on data size
– For now, use subset of original database
– In future, will have to partition the data first
• Migrating the data to the cloud
–
–
–
–
–
How to copy data into the cloud?
Will it need to be extracted from the DBMS?
Will schema have to be modified?
Will science functionality be compromised?
What about query performance?
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
20
Cloud Experiments
• Dataset
–
–
–
–
–
Sloan Digital Sky Survey (SDSS-II)
Public dataset, no restrictions
Reasonably complex schema, usage model
Easy to generate subset of arbitrary size
Only available as SQL Server database
• Clouds
– Amazon Elastic Cloud 2 (EC2)
• AWS wanted to host public SDSS dataset
– Microsoft Azure / SQL Azure
• Natural fit for SQL Server databases
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
21
Migrating data to Amazon EC2
• 1 TB data size limit (per instance)
• SDSS DR6 100 GB subset: BestDR6_100GB
– Actually more like 150 GB
– Large enough for performance tests, small
enough to be migrated in a few days/weeks
• Several manual steps to create DB
instance
• Could not connect to DB from outside
• Preliminary performance test, but without
optimizing within cloud
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
22
SDSS public dataset on EC2
• Public datasets on Amazon
– Stored as snapshot available on AWS
– Advertised on AWS blog
– Anyone can pull into their account
• Data is free, but not the usage
– Create a running instance
– Multiple instances deployed manually
• First SQL Server dataset on EC2 (?)
– AWS also created a LINUX snapshot of SQL
Server DB!
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
23
AWS Blog Entry
http://aws.typepad.com/
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
24
How it’s supposed to work
• With other db dumps, assumption is that users
will set up their own DB and import the data
• Public SDSS dataset to serve two kinds of
users:
a) People who currently access the SDSS CAS
b) General AWS users who are interested in the data
• For a), should be able to replicate the same
services that SDSS currently has, but using a
SQL Server instance on EC2
• For b), users should have everything they need
on the AWS public dataset page for SDSS
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
25
Steps to create DB in EC2
• Create “snapshot” of database first
• Create storage for DB: 200 GB EBS volume
– Instantiate snapshot as volume of required size
• Create SQL Server 2005 Amazon Machine
Image (AMI)
– AMI instance from snapshot
• Attach AMI instance to EBS volume
– Creates running instance of DB
• Get Elastic IP to point to instance
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
26
Steps to create Web interface
• Create volume (small) with win2003
• Create instance with Win 2003 AMI (only
IIS, no SQL Server)
• Attach volume to instance
• Get public DNS, admin account
• BUT … couldn’t connect to SQL Server IP
• So outside world cannot connect to the
data as yet
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
27
Migrating data to Microsoft Azure
• 10 GB data size limit ( 50 GB now)
• SDSS DR6 10 GB subset: BestDR6_10GB
• Two ways to migrate database
– Script it out of source db (very painful)
• Many options to select
• Runs out of memory on server!
– Use SQL Azure Migration Wizard (much better!)
• Runs for few hours
• Produces huge trace of errors, many items skipped
• But does produce a working db in Azure
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
28
SQL Azure Migration Wizard
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
29
Unsupported MS-SQL features
• Can’t run command shells from inside DB
• Global temp objects disallowed
– Can’t use performance counters in test queries,
so hard to benchmark/compare performance
• SQL-CLR function bindings not supported
– Can’t use our HTM spatial indexing library
• T-SQL directives
– e.g., to set the level of parallelism
• Built-in T-SQL functions
– Need to find workaround for these
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
30
Azure sign of success?
• Ok, so data in Azure, but at what cost?!
– Meaningful performance comparison not
possible
– Dataset too small
– Schema features stripped
– No spatial index, so many queries crippled
• Connected to Azure DB from outside cloud
– Hooked up SkyServer WI
– Connected with SQL Server client (SSMS)
– Ran simple queries from both
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
31
Cloudy Conclusions
• Migrating scientific databases to the
cloud not really feasible at the moment
• Even migrating smaller DBs can be painful
– Several steps to deploy each copy
– Cloud may not support full functionality
– Problems limited to SQL Server DBs?
• Large DBs will have to be partitioned
– Set up distributed databases in the cloud
– Query distributed databases from outside
• Haven’t even talked about economics yet
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
32
Thank you!
SSDM Edinburgh, Oct 27-28, 2010
Ani Thakar, JHU
IIA,
33
Download