Managing Data for the World Wide Telescope aka: The Virtual Observatory Jim Gray Alex Szalay SLAC Data Management Workshop 1 The Evolution of Science • Observational Science – Scientist gathers data by direct observation – Scientist analyzes data • Analytical Science – Scientist builds analytical model – Makes predictions. • Computational Science – Simulate analytical model – Validate model and makes predictions • Data Exploration Science Data captured by instruments Or data generated by simulator – Processed by software – Placed in a database / files – Scientist analyzes database / files 2 Information Avalanche • In science, industry, government,…. – better observational instruments and – and, better simulations producing a data avalanche Image courtesy C. Meneveau & A. Szalay @ JHU • Examples – BaBar: Grows 1TB/day 2/3 simulation Information 1/3 observational Information – CERN: LHC will generate 1GB/s .~10 PB/y – VLBA (NRAO) generates 1GB/s today – Pixar: 100 TB/Movie BaBar, Stanford P&E Gene Sequencer From http://www.genome.uci.edu/ • New emphasis on informatics: – Capturing, Organizing, Summarizing, Analyzing, Visualizing 3 Space Telescope The Big Picture Experiments & Instruments Other Archives Literature questions facts facts ? answers Simulations The Big Problems • • • • • • Data ingest Managing a petabyte Common schema How to organize it? How to reorganize it How to coexist with others • Query and Vis tools • Support/training • Performance – Execute queries in a minute – Batch query scheduling 4 FTP - GREP • Download (FTP and GREP) are not adequate – – – – You can GREP 1 MB in a second You can GREP 1 GB in a minute You can GREP 1 TB in 2 days You can GREP 1 PB in 3 years. • Oh!, and 1PB ~3,000 disks • At some point we need indices to limit search parallel data search and analysis • This is where databases can help • Next generation technique: Data Exploration – Bring the analysis to the data! 5 The Speed Problem • Many users want to search the whole DB ad hoc queries, often combinatorial • Want ~ 1 minute response • Brute force (parallel search): – 1 disk = 50MBps => ~1M disks/PB ~ 300M$/PB • Indices (limit search, do column store) – 1,000x less equipment: 1M$/PB • Pre-compute answer – No one knows how do it for all questions. 6 Next-Generation Data Analysis • Looking for – Needles in haystacks – the Higgs particle – Haystacks: Dark matter, Dark energy • Needles are easier than haystacks • Global statistics have poor scaling – Correlation functions are N2, likelihood techniques N3 • As data and computers grow at same rate, we can only keep up with N logN • A way out? – Relax notion of optimal (data is fuzzy, answers are approximate) – Don’t assume infinite computational resources or memory • Combination of statistics & computer science 7 Analysis and Databases • Much statistical analysis deals with – – – – – – – – – Creating uniform samples – data filtering Assembling relevant subsets Estimating completeness censoring bad data Counting and building histograms Generating Monte-Carlo subsets Likelihood calculations Hypothesis testing • Traditionally these are performed on files • Most of these tasks are much better done inside a database • Move Mohamed to the mountain, not the mountain to 8 Mohamed. Organization & Algorithms • Use of clever data structures (trees, cubes): – – – – Up-front creation cost, but only N logN access cost Large speedup during the analysis Tree-codes for correlations (A. Moore et al 2001) Data Cubes for OLAP (all vendors) • Fast, approximate heuristic algorithms – No need to be more accurate than cosmic variance – Fast CMB analysis by Szapudi et al (2001) • N logN instead of N3 => 1 day instead of 10 million years • Take cost of computation into account – Controlled level of accuracy – Best result in a given time, given our computing resources 9 World Wide Telescope Virtual Observatory http://www.ivoa.net/ • Premise: Most data is (or could be online) • The Internet is the world’s best telescope: – – – – It has data on every part of the sky In every measured spectral band: optical, x-ray, radio.. As deep as the best instruments (2 years ago). It is up when you are up. The “seeing” is always great (no working at night, no clouds no moons no..). – It’s a smart telescope: links objects and data to literature on them. 10 Why Astronomy? • Community has lots of data • Data is real and well documented – High-dimensional (with confidence intervals) – Spatial, temporal • Diverse and distributed – Many different instruments from many different places and many different times • Community wants to share/cross compare – Can freely share data and algorithms. – “DataMining, Not Data MINE!!” Mark Ellisman, UCSD • They are well organized • Community is small and homogeneous • No commercial or privacy concerns – All the problems are technical or social. 11 The WWT Components • Data Sources – Literature – Archives • Unified Definitions – Units, – Semantics/Concepts/Metrics, Representations, – Provenance • Object model • Classes and methods • Portals 12 Data Sources • Literature online and cross indexed – Simbad, ADS, NED, http://simbad.u-strasbg.fr/Simbad, http://adswww.harvard.edu/, http://nedwww.ipac.caltech.edu/ • Many curated archives online – FIRST, DPOSS, 2MASS, USNO, IRAS, SDSS, VizeR,… – Typically files with English meta-data and some programs • Groups, Researchers, Amateurs Publish – Datasets online in various formats – Data publications are ephemeral (may disappear) – Many have unknown provenance • Documentation varies; some good and some none. 13 Unified Definitions • Universal Content Definitions http://vizier.u-strasbg.fr/doc/UCD.htx – Collated all table heads from all the literature – 100,000 terms reduced to ~1,500 – Rough consensus that this is the right thing. – Refinement in progress as people use UCDs • Defines – Units: • gram, radian, second, janski... – Semantic Concepts / Metrics • Std error, Chi2 fit, magnitude, flux @ passband, velocity, 14 Provenance • Most data will be derived. • To do science, need to trace derived data back to source. • So programs and inputs must be registered. • Must be able to re-run them. • Example: Space Telescope Calibrated Data – Run on demand – Can specify software version (to get old answers) • Scientific Data Provenance and Curation are largely unsolved problems (some ideas but no science). 15 Object Model Your • General acceptance of XML program • Recent acceptance of XML Schema (XSD over DTD) Web Server • Wait-and-See about SOAP/WSDL/… – “ Web Services are just Corba with angle brackets.” – FTP is good enough for me. • Personal opinion: – Web Services are much more than “Corba + <>” – Huge focus on interop – Huge focus on integrated tools Your program Data • But the community says “Show me!” In your address – Many technologists convinced, space but not yet the astronomers Web Service 16 Classes and Methods Your program • First Class: VO table http://www.us-vo.org/VOTable/ – Represents an answer set in XML Web Service Data In your address space • Defined by an XML Schema (XSD) • Metadata (in terms of UCDs) • Data representation (numbers and text) – First method • Cone Search: Get objects in this cone http://voservices.org/cone/ 17 Other Classes Your program • Space-Time class – http://hea-www.harvard.edu/~arots/nvometa/STCdoc.pdf • Image Class (returns pixels) – SdssCutout – Simple Image Access Protocol Web Service Data In your address space http://bill.cacr.caltech.edu/cfdocs/usvo-pubs/files/ACF8DE.pdf – HyperAtlas http://bill.cacr.caltech.edu/usvo-pubs/files/hyperatlas.pdf • Spectral – Simple Spectral Access Protocol – 500K spectra available at http://voservices.net/wave • Query Services – ADQL and SkyNode http://skyservice.pha.jhu.edu/develop/vo/adql/ – And http://SkyQuery.Net • Registry: – see below 18 The Registry • UDDI seemed inappropriate – Complex – Irrelevant questions – Relevant questions missing • Evolved Dublin Core – Represent Datasets, Services, Portals – Needs to be machine readable – Federation (DNS model) – Push & Pull: register then harvest • http://www.ivoa.net/twiki/bin/view/IVOA/IvoaResReg 19 Demo • SkyServer: – navigator showing cutout web service – List: showing many calls and variant use. • SkyQuery: – Show integration of various archives. – Explain spatial join xMatch operator. 20 SkyServer.SDSS.org • A modern Astronomy archive – Raw Pixel data lives in file servers – Catalog data (derived objects) lives in Database – Online query to any and all • Also used for education – 150 hours of online Astronomy – Implicitly teaches data analysis • Interesting things – – – – – – Spatial data search Client query interface via Java Applet Query interface via Emacs Popular Cloned by other surveys (a template design) Web services are core of it. 21 SkyQuery A Prototype WWT • Started with SDSS data and schema • Imported12 other datasets into that spine schema. (a day per dataset plus load time) • Unified them with a portal • Implicit spatial join among the datasets. • All built on Web Services – Pure XML – Pure SOAP – Used .NET toolkit 22 Federation: SkyQuery.Net • Combine 4 archives initially • Added 9 more • Send query to portal, portal joins data from archives. • Problem: want to do multi-step data analysis (not just single query). • Solution: Allow personal databases on portal • Problem: some queries are monsters • Solution: “batch schedule” on portal server, Deposits answer in personal database. 23 SkyQuery Structure • Portal is • Each SkyNode publishes – Plans Query (2 phase) – Schema Web Service – Integrates answers – Database Web Service – Is a web service Image Cutout SDSS INT SkyQuery Portal FIRST 2MASS 24 MyDB http://skyservice.pha.jhu.edu/devel/casjobs/ • Portal allows federation of data but… • Intermediate results may be large. • Intermediate results feed into next analysis step. • Sending them back-and-forth to client is costly and sometimes infeasible. • Solution: create a working DB for client at Portal: MyDB 25 MyDB http://skyservice.pha.jhu.edu/devel/casjobs/ • Anyone can create a personal DB at SkyServer portal. – It is about 100 MB – It is private • • • • • Simple queries done immediately Complex queries done by batch scheduler All queries can create/read/write MyDB tables Very popular with “serious” users. MyDB will be sharable with by a group. 26 Open SkyQuery • SkyQuery being adopted by AstroGrid as reference implementation for OGSA-DAI (Open Grid Services Architecture, Data Access and Integration). • SkyNode basic archive object http://www.ivoa.net/twiki/bin/view/IVOA/SkyNode • SkyQuery Language (VoQL) is evolving. http://www.ivoa.net/twiki/bin/view/IVOA/IvoaVOQL 27 The WWT Components Outline What we learned • Data Sources • Astro is a community of 10,000 • Homogenous & Cooperative • If you can’t do it for Astro, do not bother with 3M bio-info. • Agreement – Literature – Archives • Unified Definitions – Units, – Semantics/Concepts/Metrics, Representations, – Provenance • • • • – Takes time – Takes endless meetings • Big problems are non-technical Object model – Legacy is a big problem. Classes and methods • Plumbing and tools are there Portals But… WWT is a poster child for – What is the object model? the Data Grid. – What do you want to save? – How document provenance? 28