Document 15482860

advertisement
The Key Players





Maria Nieto-Santisteban (JHU)
Ani Thakar (JHU)
Alex Szalay (JHU)
Jim Gray (Microsoft)
Catherine van Ingen (Microsoft)
What is Pan-STARRS?



Pan-STARRS - a new telescope facility
4 smallish (1.8m) telescopes, but with extremely
wide field of view
Can scan the sky rapidly and repeatedly, and can
detect very faint objects


Unique time-resolution capability
Project was started by IfA with help from Air
Force, Maui High Performance Computer Center,
MIT’s Lincoln Lab and Science Applications
International Corp. SAIC has dropped out & the
JHU database team has joined.
The PS-4 Telescope Array Concept
The PS1 Prototype – Walk before you
run!



Pan-STARRS pushes 4 areas of technology:
wide-field imaging telescope, large format CCD
mosaic camera, high throughput image
processing pipeline, & data-intensive database
server.
We were advised to build a functional prototype,
PS1, to test and integrate these new approaches.
The prototype, PS1, is now nearing operational
readiness on Haleakala, Maui.
The PS1 Science Consortium









University of Hawaii, Institute for Astronomy
Max Plank Society, Institutes in Garching & Heidelberg
Harvard-Smithsonian Center for Astrophysics
Las Cumbres Observatory Global Telescope Network
Johns Hopkins University, Department of Physics and
Astronomy
University of Edinburgh, Institute of Astronomy
Durham University, Extragalactic Astronomy & Cosmology
Research Group
Queen’s University Belfast, Astrophysics Research Center
National Central University, Taiwan
PS1 Key Science Projects












Population of objects in the inner solar system
Population of objects in the outer solar system (beyond
Jupiter)
Low mass stars, brown dwarfs, & young stellar objects
Search for exo-planets by stellar transits
Structure of the Milky Way and Local Group
Dedicated deep survey of M31
Massive stars and SN progenitors
Cosmology investigations with variables and explosive
transients
Galaxy properties
Active galactic nuclei and high redshift quasars
Cosmological lensing
Large scale structure
PS1 Observatory on Haleakala Telescope and
Camera operational by interactive or queue control
1.4 Gigapixel Camera Assembly with L3
Corrector Lens as Dewar Window
Gibbous
Moon
1millisec
exposure
M31
Poster at
the
January
2008
AAS
Meeting
M51
Astronomy Is Happening Now!

The project is not yet to the Operational
Readiness Review (November 2008) but data
taken with PS1 and processed through the
system has been used to:
Discover brown dwarf candidates
 Discover new asteroids
 Monitor one of the medium deep target fields for
supernovae.

What is the PSPS?
The Published Science Products Subsystem of
Pan-STARRS will:
 Provide access to the data products generated by
the Pan-STARRS telescopes and data reduction
pipelines
 Provide a data archive for the Pan-STARRS data
products
 Provide adequate security to protect the integrity
of the Pan-STARRS data products & protect the
operational systems from malicious attacks.
PSPS Design Driving
Requirements







Hold over 1.5x1011 detections and their supporting metadata for
~ 5.5x109 objects.
Support ~ 100 TBytes of disk storage on hardware that is >
99% reliable
Serve as an archive for the Pan-STARRS data products
Provide security for the data stored within the system, both
against accidental and intentional actions.
Provide users access to the data stored in the system, and the
ability to search it.
Hold sufficient metadata to allow users to determine the
observational legacy and processing history of the Pan-STARRS
data products.
The PSPS baseline configuration should accommodate future
additions of databases (i.e., be expandable).
What is PSPS?
From the PS1 System View



PS1 PSPS will not receive
image files, which are retained
by IPP
Three significant PS1 I/O
threads:
Responsible for managing the
catalogs of digital data
 Ingest of detections and
initial celestial object data
from IPP
 Ingest of moving object
data from MOPS
 User queries of
detection/object data
records
Published Science
Products Subsystem
(PSPS)
Web-Based
Interface
(WBI)
End Users
l
Te
es
co
pe
Photons
Data Retrieval
Layer
(DRL)
Records
Images
Solar System
Data Manager
(SSDM)
Records
Gigapixel
Camera
Object Data
Manager
(ODM)
Image
Processing
Pipeline
(IPP)
Moving Object
Processing
System
(MOPS)
Detection Records
PSPS Components
Overview/Terminology
PSPS

Published
Data Client
WBI
Legend
Published
Data Client
Administrator API
Pan-STARRS
Subsystem
Standard User API
Future
Pan-STARRS
Subsystem
DRL

Non
Pan-STARRS
System
Data Manager API
Data Manager API
Data Manager API
PSPS
Component
SSDM
Data
Manager
ODM
PSPS-MOPS Interface
PSPS-IPP Interface
science data
metadata,
detections
MOPS
IDs
Future
Component
raw
science data
IPP
Preferred
Science Client
(Data Provider)
science data
interface
contract
interface
dependency

DRL: Data Retrieval
Layer
 Software clients, not
humans, are PDCs
 Connects to DMs
PDC: Published Data
Client
 WBI: Web Based
Interface
 External PDCs (nonPSPS)
DM: Data Manager
(generic)
 ODM: Object Data
Manager
 SSDM: Solar System
Data Manager
Prototype ODM Structure
Data
DataTransformation
TransformationLayer
Layer(DX)
(DX)
objZoneIndx
orphans
Detections_l1
objZoneIndx
Linked servers
Load
Support1
LoadAdmin
Load
Supportn
PartitionMap
Data Loading Pipeline (DLP)
[LnkToObj_p1]
Linked servers
[Objects_pm]
P1
Pm
[Detections_p1]
Meta
Detections
PS1
PartionsMap
Objects
LnkToObj
Database
Full table
[partitioned table]
Output table
Partitioned View
[LnkToObj_pm]
[Detections_pm]
Data Storage (DS)
Legend
Detections_ln
LnkToObj_ln
LnkToObj_l1
[Objects_p1]
orphans
Meta
Query
QueryManager
Manager(QM)
(QM)
Web
WebBased
BasedInterface
Interface(WBI)
(WBI)
Meta
ODM Components
Cluster Manager (CLM)
Workflow Manager (WFM)
Query Manager (QM)
Performance Monitor
PS1 ODM Database
PS1 Schema Relationships
Detailed Design





Reuse SDSS software as much as possible
Data Transformation Layer (DX) – Interface to IPP
Data Loading Pipeline (DLP)
Data Storage (DS)
 Schema and Test Queries
 Database Management System
 Scalable Data Architecture
 Hardware
Query Manager (QM: CasJobs for prototype)
Data Storage – DBMS


Microsoft SQL Server 2005
 Relational DBMS with excellent query optimizer
Plus
 Spherical/HTM (C# library + SQL glue)
 Spatial index (Hierarchical Triangular Mesh)
 Zones (SQL library)
 Alternate spatial decomposition with dec zones
 Many stored procedures and functions
 From coordinate conversions to neighbor search
functions
 Self-extracting documentation (metadata) and diagnostics
Data Storage – Scalable
Architecture





Monolithic database design (a la SDSS) will not do it
SQL Server does not have cluster implementation
 Do it by hand
Partitions vs Slices
 Partitions are file-groups on the same server
 Parallelize disk accesses on the same machine
 Slices are data partitions on separate servers
 We use both!
Additional slices can be added for scale-out
For PS1, use SQL Server Distributed Partition Views (DPVs)
Distributed Architecture





The bigger tables will be spatially partitioned
across servers called Slices
Using slices improves system scalability
Tables are sliced into ranges of ObjectID, which
correspond to broad declination ranges
ObjectID boundaries are selected so that each
slice has a similar number of objects
Distributed Partitioned Views “glue” the data
together
Adding New Types of Data in
the ODM


Because of the interaction between our logical and physical
schema, we do not consider it prudent to arbitrarily add new
types of data to the ODM.
One area where expansion does fit naturally into our design is
the addition of new filters. These can accommodate new
detections (perhaps not even coming from Pan-STARRS) that
cover all or part (e.g., Medium Deep Survey fields) of the sky.
This would allow including into the data tables observations
from other sources (e.g., Galex Extended Mission, Spitzer Warm
Mission, UKIRT, CFHT) that range from the far ultraviolet to
the far infrared, provided the data are formatted consistently
with the ODM logical schema.
Client Databases

Client databases can be either


Standalone databases attached to the DRL (as shown in the
earlier slide)
MyDB instances attached to the ODM internal network.
These are SQL Server databases with





Ownership by individuals, groups, or key projects/science clients
Unidirectional (ODM to MyDB) write privilege
Bidirectional read privilege
Table access which can be defined at the user, group, or world level,
allowing selected export of results
The ability to load data into the MyDB from outside the ODM
Some Lessons Learned



“GrayWulf: Scalable Cluster Architecture for Data
Intensive Computing” submitted to HICCS-09
conference.
Big databases are not created equal -- user query
patterns will dictate the data storage
model/architecture.
“When” matters -- PS1 has to do things with today’s
technology & can’t count on Moore’s law. This also will
affect how much data you’ll have to deal with.
Some Lessons Learned

Resources are accessed by
End users who perform analyses on shared database
 Data valets who maintain shared databases
 Operators who maintain compute & storage


The
ButApproach
which set of 20 queries? Not all users
 “20 queries” capture science interests
will want to access the tables in the same
way. However, there are clear patterns of
queries that are common to all users and we
have designed to implement them.
Some Lessons Learned


Resources
are
accessed
by
This is an area where our team has spent a
 End users who perform analyses on shared database
great deal of effort. There are any possibilities
 Data valets who maintain shared databases
available and it’s unclear which is the best.
We’ve
Operators
who on
maintain
compute
storageheld in
decided
a model
with &
objects
The
the Approach
main data base and detections and copies
some
smaller
tables
in the
slices. OK, then
of
“20
queries”
capture
science
interests
do &you
choose
to partition?
What RAID
how
Divide
Conquer
determines
partitioning
model?
Some Lessons Learned


This is a second area that has involved a great
Resources are accessed by
deal of design effort. In SDSS much of the
 End users who perform analyses on shared database
work flow monitoring and error handling
 Data valets who maintain shared databases
occurred in the loading phase – but the PS1
 Operators who maintain compute & storage
ODM will be loading all the time. We expect the
The Approach
most potential problems in the load/merge
 “20 queries” capture science interests
process!We’re taking a Sunny, Sticky, and Cloudy
 Divide & Conquer determines partitioning
day approach to the testing and error handling
 Faults Happen – handling must be designed into all
implementation.
Ultimately real data will define
data valet processes
the Rainy day case – hopefully it won’t be a Cat 5
hurricane!
And Finally
Download