Databases Meet Astronomy a db view of astronomy data Jim Gray and Don Slutz Microsoft Research Collaborating with: Alex Szalay, Peter Kunszt, Ani Thakar @ JHU Robert Brunner, Roy Williams @ Caltech George Djorgovski, Julian Bunn @ Caltech 1 Outline • Astronomy data • The Virtual Observatory Concept • The Sloan Digital Sky Survey 2 Astronomy Data • • • • • In the “old days” astronomers took photos. Starting in the 1960’s they began to digitize. New instruments are digital (100s of GB/nite) Detectors are following Moore’s law. Data avalanche: double every 2 years 1000 100 Courtesy of Alex Szalay 10 1 0.1 1970 1975 1980 1985 1990 1995 2000 CCDs Glass Total area of 3m+ telescopes in the world in m2, total number of CCD pixels in megapixel, as a function of time. Growth over 25 years is a factor of 30 in glass, 30003 in pixels. Astronomy Data • Astronomers have a few Petabytes now. – 1 pixel (byte) / sq arc second ~ 4TB – Multi-spectral, temporal, … → 1PB • They mine it looking for new (kinds of) objects or more of interesting ones(quasars), density variations in 400-D space correlations in 400D space • • • • • Data doubles every 2 years. Data is public after 2 years. So, 50% of the data is public. Some have private access to 5% more data. So: 50% vs 55% access for everyone 4 Astronomy Data • But….. • How do I get at that 50% of the data? • Astronomers have culture of publishing. – FITS files and many tools. http://fits.gsfc.nasa.gov/fits_home.html – Encouraged by NASA. • But, data “details” are hard to document. Astronomers want to do it but it is VERY hard. (What programs where used? what were the processing steps? How were errors treated?…) • The optimistic hope: XML is the answer. • The reality: xml is syntax and tools: FITS on XML will be good but….. Explaining the data will still be very difficult. 5 Astronomy Data • And by the way, few astronomers have a spare petabyte of storage in their pocket. • But that is getting better: - Public SDSS is 5% of total - Public SDSS is ~50GB - Fits on a 200$ disk drive today. - (more on that later). - THESIS: Challenging problems are publishing data providing good query & visualization tools 6 Astronomy Data • Lots of it (petabytes) • Hundreds of dimensions per object • Cross-correlation is challenging because – Multi-resolution – Time varying – Data is dirty (cosmic rays, airplanes…) 7 The Multiwavelength Crab Nebulae Crab star 1053 AD X-ray, optical, infrared, and radio views of the nearby Crab Nebula, which is now in a state of chaotic expansion after a supernova explosion first sighted in 1054 A.D. by Chinese Astronomers. 8 Slide courtesy of Robert Brunner @ CalTech. Even in “optical” images are very different Optical Near-Infrared Galaxy Image Mosaics BJ RF IN J H K BJ RF IN J H K 10 Slide courtesy of Robert Brunner @ CalTech. Exploring Parameter Space Given an arbitrary parameter space: • • • • • • Data Clusters Points between Data Clusters Isolated Data Clusters Isolated Data Groups Holes in Data Clusters Isolated Points Nichol et al. 2001 11 Slide courtesy of Robert Brunner @ CalTech. Survey Cross-Identification Billions of Sources •High Source Densities Multi-Wavelength: Radio to g-Ray All Sky - Thousands of Sq. Degrees •Computational Challenge Probabilistic Associations •Optimized Likelihood Ratios •A Priori Astrophysical Knowledge Important •Secondary Parameters Temporal Variability Dynamic & Static Associations •User-Defined Cross-Identification Algorithms Optical-Infrared-Radio: Quasar-Environment Survey Radio Survey Cross-Identification: Steep Spectrum Sources Optical-Infrared-X-Ray: Serendipitous Chandra Identification 12 Slide courtesy of Robert Brunner @ CalTech. Data Federation: A Computational Challenge 2MASS vs. DPOSS Cross-identification: •2MASS J < 15 •DPOSS IN < 18 DPOSS unmatched 2MASS matched DPOSS matched 2MASS unmatched 13 Outline • Astronomy data • The Virtual Observatory Concept • The Sloan Digital Sky Survey 14 Virtual Observatory http://www.astro.caltech.edu/nvoconf/ http://www.voforum.org/ • Premise: Most data is (or could be online) • So, the Internet is the world’s best telescope: – – – – It has data on every part of the sky In every measured spectral band: optical, x-ray, radio.. As deep as the best instruments (2 years ago). It is up when you are up. The “seeing” is always great (no working at night, no clouds no moons no..). – It’s a smart telescope: 15 links objects and data to literature on them. Virtual Observatory The Age of Mega-Surveys • Large number of new surveys – – – – multi-TB in size, 100 million objects or more individual archives planned, or under way Data publication an integral part of the survey Software bill a major cost in the survey • Multi-wavelength view of the sky – more than 13 wavelength coverage in 5 years • Impressive early discoveries – finding exotic objects by unusual colors • L,T dwarfs, high-z quasars – finding objects by time variability MACHO 2MASS DENIS SDSS PRIME DPOSS GSC-II COBE MAP NVSS FIRST GALEX ROSAT OGLE ... • gravitational micro-lensing 16 Slide courtesy of Alex Szalay, modified by jim Virtual Observatory Federating the Archives • The next generation mega-surveys are different – – – – top-down design large sky coverage sound statistical plans well controlled/documented data processing • Each survey has a publication plan • Data mining will lead to stunning new discoveries • Federating these archives Virtual Observatory Slide courtesy of Alex Szalay 17 Virtual Observatory and Education • In the beginning science was empirical. • Then theoretical branches evolved. • Now, we have a computational branches. – The computational branch has been simulation – It is becoming data analysis/visualization • The Virtual Observatory can be used to – Teach astronomy: make it interactive, demonstrate ideas and phenomena – Teach computational science skills 18 Virtual Observatory Challenges • Size : multi-Petabyte 40,000 square degrees is 2 Trillion pixels – One band (at 1 sq arcsec) – Multi-wavelength – Time dimension – Need auto parallelism tools 4 Terabytes 10-100 Terabytes >> 10 Petabytes • Unsolved MetaData problem – Hard to publish data & programs – Hard to find/understand data & programs • Current tools inadequate – new analysis & visualization tools • Transition to the new astronomy – Sociological issues 19 Demo of Virtual Sky • Roy Williams @ Caltech Palomar Data with links to NED. • Shows multiple themes, shows link to other sites (NED, VizeR, Sinbad, …) • http://virtualsky.org/servlet/Page?T=3&S=21&P=1&X=0&Y=0&W=4&F=1 And NED @ http://nedwww.ipac.caltech.edu/index.html 20 Demo of Sky Server Alex Szalay of Johns Hopkins built SkyServer (based on TerraServer design). http://skyserver.fnal.gov/ 21 What Next? (after the queries, after the web server) • How to federate the Archives to make a VO? • Send XML: a non-answer equivalent to “send unicode” • Define a set of Astronomy Objects and methods. – Based on UDDI, WSDL, SOAP. – Each archive is a service • We have started this with TerraService – http://TerraService.net/ shows the idea. – Working with Caltech (Brunner, Williams, Djorgovski, Bunn) and JHU (Szalay et al) on this 22 3-steps to Virtual Observatory • Get SDSS and Palomar online – Alex Szalay, Jan Vandenberg, Ani Thacker…. – Roy Williams, Robert Brunner, Julian Bunn • Do queries and crossID matches with CalTech and SDSS to expose – Schema, Units,… – Dataset problems – the typical use scenarios. • Implement WebService that lets you get data from both CalTech and SDSS (WSDL/SOAP/…) 23 SkyServer as a WebServer WSDL+SOAP just add details Archive ss = new VOService(SkyServer); Attributes A[] = ss.GetObjects(ra,dec,radius) … ?? What are the objects (attributes…)? ?? What are the methods (GetObjects()...)? ?? Is the query language SQL or Xquery or what? 24 Outline • Astronomy data • The Virtual Observatory Concept • The Sloan Digital Sky Survey 25 Sloan Digital Sky Survey http://www.sdss.org/ • For the last 12 years a group of astronomers has been building a telescope (with funding from Sloan Foundation, NSF, and a dozen universities). 90M$. • Last year was engineer, calibrate, commission They are making the calibration data public. – 5% of the survey, 600 sq degrees, 15 M objects 60GB. – This data includes most of the known high z quasars. – It has a lot of science left in it but… that is just the start. • Now the data is arriving: – 250GB/nite (20 nights per year). – 100 M stars, 100 M galaxies, 1 M spectra. • http://www.sdss.org/ and http://www.sdss.jhu.edu/ 26 SDSS what I have been doing • Work with Alex Szalay, Don Slutz, and others to define 20 canonical queries and 10 visualization tasks. • Don Slutz did a first cut of the queries, I’m continuing that work. • Working with Alex Szalay on building Sky Server and making data it public (send out 80GB SQL DBs) 27 Two kinds of SDSS data • 15M Photo Objects ~ 400 attributes 20K Spectra with ~10 lines/ spectrum 28 Spatial Data Access (Szalay, Kunszt, Brunner) http://www.sdss.jhu.edu/ look at the HTM link • Implemented Hierarchical Triangular Mesh (HTM) as table-valued function for spatial joins. • Every object has a 20-deep Mesh ID. • Given a spatial definition: Routine returns up to 500 covering triangles. • Spatial query is then up to 500 range queries. • Very fast: 1,000s of triangles per second. 29 The 20 Queries Q1: Find all galaxies without unsaturated pixels within 1' of a given point of ra=75.327, dec=21.023 Q2: Find all galaxies with blue surface brightness between and 23 and 25 mag per square arcseconds, and -10<super galactic latitude (sgb) <10, and declination less than zero. Q3: Find all galaxies brighter than magnitude 22, where the local extinction is >0.75. Q4: Find galaxies with an isophotal surface brightness (SB) larger than 24 in the red band, with an ellipticity>0.5, and with the major axis of the ellipse having a declination of between 30” and 60”arc seconds. Q5: Find all galaxies with a deVaucouleours profile (r¼ falloff of intensity on disk) and the photometric colors consistent with an elliptical galaxy. The deVaucouleours profile Q6: Find galaxies that are blended with a star, output the deblended galaxy magnitudes. Q7: Provide a list of star-like objects that are 1% rare. Q8: Find all objects with unclassified spectra. Q9: Find quasars with a line width >2000 km/s and 2.5<redshift<2.7. Q10: Find galaxies with spectra that have an equivalent width in Ha >40Å (Ha is the main hydrogen spectral line.) Q11: Find all elliptical galaxies with spectra that have an anomalous emission line. Q12: Create a grided count of galaxies with u-g>1 and r<21.5 over 60<declination<70, and 200<right ascension<210, on a grid of 2’, and create a map of masks over the same grid. Q13: Create a count of galaxies for each of the HTM triangles which satisfy a certain color cut, like 0.7u-0.5g-0.2i<1.25 && r<21.75, output it in a form adequate for visualization. Q14: Find stars with multiple measurements and have magnitude variations >0.1. Scan for stars that have a secondary object (observed at a different time) and compare their magnitudes. Q15: Provide a list of moving objects consistent with an asteroid. Q16: Find all objects similar to the colors of a quasar at 5.5<redshift<6.5. Q17: Find binary stars where at least one of them has the colors of a white dwarf. Q18: Find all objects within 30 arcseconds of one another that have very similar colors: that is where the color ratios u-g, g-r, r-I are less than 0.05m. Q19: Find quasars with a broad absorption line in their spectra and at least one galaxy within 10 arcseconds. Return both the quasars and the galaxies. Q20: For each galaxy in the BCG data set (brightest color galaxy), in 160<right ascension<170, -25<declination<35 count of galaxies within 30"of it that have a photoz within 0.05 of that galaxy. Also some good queries at: http://www.sdss.jhu.edu/ScienceArchive/sxqt/sxQT/Example_Queries.html 30 An easy one Q7: Provide a list of star-like objects that are 1% rare. • Found 14,681 buckets, first 140 buckets have 99% time 104 seconds • Disk bound, reads 3 disks at 68 MBps. Select cast((u-g) as int) as ug, cast((g-r) as int) as gr, cast((r-i) as int) as ri, cast((i-z) as int) as iz, count(*) as Population from stars group by cast((u-g) as int), cast((g-r) as int), cast((r-i) as int), cast((i-z) as int) order by count(*) 31 Another easy one Q15: Provide a list of moving objects consistent with an asteroid. • Looks hard but there are 5 pictures of the object at 5 different times (colors) and so can compute velocity. • Image pipeline computes velocity. • Computing it from the 5 color x,y would also be fast • Finds 2167 objects in 7 minutes, 70MBps. select object_id, -- return object ID sqrt(power(rowv,2)+power(colv,2)) as velocity from sxPhotObj -- check each object. where (power(rowv,2) + power(colv, 2)) > 50 -- square of velocity and rowv >= 0 and colv >=0 -- negative values indicate32error A Hard One Q14: Find stars with multiple measurements that have magnitude variations >0.1. • This should work, but SQL Server does not allow table values to be piped to table-valued functions. Returns a table of nearby objects select S.object_ID, S1.object_ID -- return stars that from Stars S, -- S is a star getNearbyObjEq(s.ra, s.dec, 0.017) as N -- N within 1 arcsec (3 pixels) of S. Stars S1 -- N == S1 (S1 gets the colors) where S.Object_ID < N.Object_ID -- S1 different from S == N and N.Type = dbo.PhotoType('Star') -- S1 is a star (an optimization) and N.object_ID = S1.Object_ID -- N == S1 and ( abs(S.u-S1.u) > 0.1 -- one of the colors is different. or abs(S.g-S1.g) > 0.1 or abs(S.r-S1.r) > 0.1 or abs(S.i-S1.i) > 0.1 or abs(S.z-S1.z) > 0.1 ) order by S.object_ID, S1.object_ID -- group the answer by parent star. 33 A Hard one: Second Try Q14: Find stars with multiple measurements that have magnitude variations >0.1. • Write a program with a cursor, ran for 2 days -------------------------------------------------------------------------------- Table-valued function that returns the binary stars within a certain radius -- of another (in arc-minutes) (typically 5 arc seconds). -- Returns the ID pairs and the distance between them (in arcseconds). create function BinaryStars(@MaxDistanceArcMins float) returns @BinaryCandidatesTable table( S1_object_ID bigint not null, -- Star #1 S2_object_ID bigint not null, -- Star #2 distance_arcSec float) -- distance between them as begin declare @star_ID bigint, @binary_ID bigint;-- Star's ID and binary ID declare @ra float, @dec float; -- Star's position declare @u float, @g float, @r float, @i float,@z float; -- Star's colors ----------------Open a cursor over stars and get position and colors declare star_cursor cursor for select object_ID, ra, [dec], u, g, r, i, z from Stars; open star_cursor; while (1=1) -- for each star begin -- get its attribues fetch next from star_cursor into @star_ID, @ra, @dec, @u, @g, @r, @i, @z; if (@@fetch_status = -1) break; -- end if no more stars insert into @BinaryCandidatesTable -- insert its binaries select @star_ID, S1.object_ID, -- return stars pairs sqrt(N.DotProd)/PI()*10800 -- and distance in arc-seconds from getNearbyObjEq(@ra, @dec, -- Find objects nearby S. @MaxDistanceArcMins) as N, -- call them N. Stars as S1 -- S1 gets N's color values where @star_ID < N.Object_ID -- S1 different from S and N.objType = dbo.PhotoType('Star') -- S1 is a star and N.object_ID = S1.object_ID -- join stars to get colors of S1==N and (abs(@u-S1.u) > 0.1 -- one of the colors is different. or abs(@g-S1.g) > 0.1 or abs(@r-S1.r) > 0.1 or abs(@i-S1.i) > 0.1 or abs(@z-S1.z) > 0.1 ) end; -- end of loop over all stars -------------- Looped over all stars, close cursor and exit. close star_cursor; -deallocate star_cursor; return; -- return table end -- end of BinaryStars GO select * from dbo.BinaryStars(.05) 34 A Hard one: Third Try Q14: Find stars with multiple measurements that have magnitude variations >0.1. • Use pre-computed neighbors table. • Ran in 17 minutes, found 31k pairs. ================================================================================== -- Plan 2: Use the precomputed neighbors table select top 100 S.object_ID, S1.object_ID, -- return star pairs and distance str(N.Distance_mins * 60,6,1) as DistArcSec from Stars S, -- S is a star Neighbors N, -- N within 3 arcsec (10 pixels) of S. Stars S1 -- S1 == N has the color attibutes where S.Object_ID = N.Object_ID -- connect S and N. and S.Object_ID < N.Neighbor_Object_ID -- S1 different from S and N.Neighbor_objType = dbo.PhotoType('Star')-- S1 is a star (an optimization) and N.Distance_mins < .05 -- the 3 arcsecond test and N.Neighbor_object_ID = S1.Object_ID -- N == S1 and ( abs(S.u-S1.u) > 0.1 -- one of the colors is different. or abs(S.g-S1.g) > 0.1 or abs(S.r-S1.r) > 0.1 or abs(S.i-S1.i) > 0.1 or abs(S.z-S1.z) > 0.1 ) -- Found 31,355 pairs (out of 4.4 m stars) in 17 min 14 sec. 35 The Pain of Going Outside SQL (its fortunate that all the queries are single statements) • Count parent objects • 503 seconds for 14.7 M objects in 33.3 GB • 66 MBps • IO bound (30% of one cpu) • 100 k records/cpu sec select count(*) from sxPhotoObj where nChild > 0 • • • • • Use a cursor No cpu parallelism CPU bound 6 MBps, 2.7 k rps 5,450 seconds (10x slower) declare @count int; declare @sum int; set @sum = 0; declare PhotoCursor cursor for select nChild from sxPhotoObj; open PhotoCursor; while (1=1) begin fetch next from PhotoCursor into @count; if (@@fetch_status = -1) break; set @sum = @sum + @count; end close PhotoCursor; deallocate PhotoCursor; 36 print 'Sum is: '+cast(@sum as varchar(12)) Summary of Current Status • 19 of 20 queries run (still need to check the science) • Also 14 of 15 “user” queries • Run times: on 3k$ PC (2 cpu, 4 disk, 256MB) 10000 cpu vs IO 1E+7 IO count 1E+6 1E+5 100 IOs/cpu sec 1E+4 1E+3 ~100 IO/cpu sec ~5MB/cpu sec 1E+2 1E+1 1E+0 0.01 0.1 1. 10. 100. 1,000. 10,000. CPU sec cpu elapsed 1000 100 10 1 Q01 Q02: Q03 Q04 Q05: Q06 Q07 Q08 Q09 Q10 Q10A Q11 Q12*** Q13 Q14 Q15 Q16 Q17 Q18 Q19* 0.1 37 0.01 Q20 QSX01 Summary of Current Status • 16 of the queries are simple • 2 are iterative, 2 are unknown • Many are sequential one-pass and two-pass over data • Covering indices make scans run fast • Table valued functions are wonderful but limitations on parameters are a pain. • Counting is VERY common. • Binning (grouping by some set of attributes) is common • Did not request cube, but that may be cultural. 38 Reflections on the 20 Queries • Data loading/scrubbing is labor intensive & tedious – AUTOMATE!!! • This is 5% of the data, and some queries take an hour. • But this is not tuned (disk bound). • All queries benefit from parallelism (both disk and cpu) (if you can state the query right, e.g. inside SQL). • Parallel database machines will do great on this: – Hash machines – Data pumps – See paper in word or pdf on my web site. • Bottom line: SQL looks good. Once you get the answers, you need visualization 39 Call to Action • If you do data visualization: we need you (and we know it). • If you do databases: here is some data you can practice on. • If you do distributed systems: here is a federation you can practice on. • If you do astronomy educational outreach here is a tool for you. • The astronomy folks are very good, and very smart, and a pleasure to work with, and the questions are cosmic, so … 40