Data Explosion: Science with Terabytes Alex Szalay, JHU and Jim Gray, Microsoft Research Living in an Exponential World • Astronomers have a few hundred TB now – 1 pixel (byte) / sq arc second ~ 4TB – Multi-spectral, temporal, … → 1PB • They mine it looking for 1000 new (kinds of) objects or more of interesting ones (quasars), density variations in 400-D space correlations in 400-D space • • • • Data doubles every year Data is public after 1 year So, 50% of the data is public Same access for everyone 100 10 1 0.1 1970 1975 1980 1985 1990 1995 2000 CCDs Glass The Challenges Exponential data growth: Distributed collections Soon Petabytes Data Collection Discovery and Analysis New analysis paradigm: Data federations, Move analysis to data Publishing New publishing paradigm: Scientists are publishers and Curators New Science: Data Exploration • Data growing exponentially in many different areas – Publishing so much data requires a new model • Multiple challenges for different communities – publishing, data mining, data visualization, digital library, educational, web services `poster-child’, • Information at your fingertips: – Students see the same data as professional astronomers • More data coming: Petabytes/year by 2010 – We need scalable solutions – Move analysis to the data! • Same thing happening in all sciences – High energy physics, genomics, cancer research, medical imaging, oceanography, remote sensing, … • Data Exploration: an emerging new branch of science • Currently has no owner… Advances at JHU • Designed and built the science archive for the SDSS – – – – Currently 2 Terabytes, soon to reach 3 TB Built fast spatial search library Created novel pipeline for data loading Built the SkyServer, a public access website for SDSS with over 45M web hits, millions of free-form SQL queries • Built the first web-services used in science – SkyQuery, ImgCutout, various visualization tools • Leading the Virtual Observatory effort • Heavy involvement in Grid Computing • Exploring other areas Collaborative Projects • • • • • • • • Sloan Digital Sky Survey (11 inst) National Virtual Observatory (17 inst) International Virtual Observatory Alliance (14 countries) Grid For Physics Networks (10 inst) Wireless sensors for Soil Biodiversity (BES, Intel, UCB) Digital Libraries (JHU, Cornell, Harvard, Edinburgh) Hydrodynamic Turbulence (JHU Engineering) Informal exchanges with NCBI Directions We understand how to mine a few terabytes Directions: 1. We built an environment: now our tools allow new breakthroughs in astrophysics 2. Open collaborations beyond astrophysics (turbulence, sensor driven biodiversity, bioinformatics, digital libraries, education …) 3. Attack problems on 100 Terabyte scale, prepare for the Petabytes of tomorrow The JHU Core Group Faculty • Alex Szalay • Ethan Vishniac • Charles Meneveau Graduate Students • Tanu Malik • Adrian Pope Postdoctoral Fellows • Tamas Budavari Research Staff •George Fekete •Vivek Haridas •Nolan Li •Will O’Mullane •Maria Nieto-Santisteban •Jordan Raddick •Anirudha Thakar •Jan Vandenberg Examples I. Astrophysics inside the database II. Technology sharing in other areas III. Beyond Terabytes I. Astrophysics in the DB • Studies of galaxy clustering – Budavari, Pope, Szapudi • Spectro Service: Publishing spectral data – Budavari, Dobos • Cluster finding with a parallel DB-oriented workflow system – Nieto-Santisteban, Malik, Thakar, Annis, Sekhri • Complex spatial computations inside the DB – Fekete, Gray, Szalay • Visual tools with the DB – ImgCutout (Nieto), Geometry viewer (Szalay), Mirage+SQL (Carlisle) The SDSS Photo-z Sample All: 50M mr<21 : 15M 10 stripes: 10M 0.1<z<0.3 -20 > Mr 2.2M -20 > Mr >-21 -21 > Mr >-23 -21 > Mr >-22 1182k 931k 662k -22 > Mr >-23 343k 254k 185k 316k 280k 326k 185k 127k 269k The Analysis • eSpICE : I.Szapudi, S.Colombi and S.Prunet • Integrated with the database by T. Budavari • Extremely fast processing: – 1 stripe with about 1 million galaxies is processed in 3 mins – Usual figure was 10 min for 10,000 galaxies => 70 days • Each stripe processed separately for each cut • 2D angular correlation function computed • w(): average with rejection of pixels along the scan – Correlations due to flat field vector – Unavoidable for drift scan Angular Power Spectrum • Use photometric redshifts for LRGs • Create thin redshift slices and analyze angular clustering • From characteristic features (baryon bumps, etc) we obtain angular diameter vs distance -> Dark Energy • Healpix pixelization in the database • Each “redshift slice” is generated in 2 minutes • Using Spice over 160,000 pixels in N1.7 time Large Scale Power Spectrum • Goal: measure cosmological parameters: – Cosmological constant or Dark Energy? • Karhunen-Loeve technique – Subdivide slices into about 5K-15K cells – Compute correlation matrix of galaxy counts among cells from fiducial P(k)+ noise model – Diagonalize matrix – Expand data over KL basis – Iterate over parameter values: – Compute new correlation matrix – Invert, then compute log likelihood Vogeley and Szalay (1996) Wb/ Wm WMAP SDSS only: Wmh = 0.26 +/- 0.04 Wb/Wm = 0.29 +/- 0.07 With Wb=0.047+/-0.006 (WMAP): Wmh = 0.21 +/- 0.03 Wb/Wm = 0.16 +/- 0.03 SDSS: Pope et al (2004) WMAP: Verde et al. (2003), Spergel et al. (2003) Wmh Numerical Effort • • • • • • • Most of the time spent in data manipulation Fast spatial searches over data and MC (SQL) Diagonalization of 20Kx20K matrices Inversions of few 100K 5Kx5K matrices Has the potential to constrain the Dark Energy Accuracy enabled by large data set sizes But: new kind of problems: – Errors driven by the systematics, not by sample size – Scaling of analysis algorithms critical! • Monte Carlo realizations with few 100M points in SQL Cluster Finding Five main steps (Annis et al. 2002) 1. Get Galaxy List fieldPrep: Extracts from the main data set the measurements of interest. 2. Filter brgSearch: Calculates the unweighted BCG likelihood for each galaxy (unweighted by galaxy count) and discards unlikely galaxies. 3. Check Neighbors bcgSearch: Weights BCG likelihood with the number of neighbors. 4. Pick Most Likely bcgCoalesce: Determines whether a galaxy is the most likely galaxy in the neighborhood to be the center of the cluster. 5. Discard Bogus getCatalog: Removes suspicious results and produces and stores the final cluster catalog. SQL Server Cluster P3 Applying a zone strategy, P gets partitioned homogenously among 3 servers. • S1 provides 1 deg buffer on top • S2 provides 1 deg buffer on top and bottom • S3 provides 1 deg buffer on bottom P P2 Native to Server 3 P1 Native to Server 2 Native to Server 1 Total duplicated data = 4 x 13 deg2. Total duplicated work = (1 object processed more than once) = 2 x 11 deg2 Maximum time spent by the thicker partition=2h 15’ (other 2 servers ~ 1h 50’) SQL Server vs Files SQL Server Resolve a Target of 66 deg2 requires: • Step A: Find Candidates - Input data = 108 MB covering 104 deg2= (72 byte/row *1.574.656 row) - Time= ~ 6 h on a dual 2.6 GHz - Output data = 1.5 MB covering 84 2 deg =40 byte/row * 40.123 row • Step B: Find Clusters - Input Data = 1.5 MB - Time = 20 minutes - Output = 0.43 MB covering 66 deg2 = 40 byte/row * 11.249 row Total time = 6h 20’ • Some extra space is required for indexes and some other auxiliary tables. • Scales linearly with no of servers FILES Resolve a Target of 66 deg2 requires: - Input data = 66 * 4 * 16MB ~ 4GB - Output data = 66 * 4 * 6KB=1.5 MB - Time ~ 73 hours Using 10 nodes ~ 7.3 hours Notes: Buffer brgSearch z(0..1) in steps of Files 0.25 deg 0.01 SQL 0.5 deg 0.001 FILES would require 20 – 60 times longer to solve this problem for a buffer of 0.5 with steps of 0.001 II. Technology Sharing • Virtual Observatory • SkyServer database/website templates – Edinburgh, STScI, Caltech, Cambridge, Cornell • OpenSkyQuery/OpenSkyNodes – International standard for federating astro archives – Interoperable SOAP implementations working • • • • NVO Registry Web Service (O’Mullane, Greene) Distributed logging and harvesting (Thakar, Gray) MyDB: workbench for science (O’Mullane, Li) Publish your own data – A’la Spectro Service, but for images and databases • SkyServer-> Soil Biodiversity National Virtual Observatory • NSF ITR project, “Building the Framework for the National Virtual Observatory” is a collaboration of 17 funded and 3 unfunded organizations – – – – – • • • • Astronomy data centers National observatories Supercomputer centers University departments Computer science/information technology specialists PIs : Alex Szalay (JHU), Roy Williams (Caltech) Connect the disjoint pieces of data in the world Bridge the technology gap for astronomers Based on interoperable Web Services International Collaboration • Similar efforts now in 14 countries: – USA, Canada, UK, France, Germany, Italy, Holland, Japan, Australia, India, China, Russia, Hungary, South Korea, ESO • Total awarded funding world-wide is over $60M • Active collaboration among projects – Standards, common demos – International VO roadmap being developed – Regular telecons over 10 timezones • Formal collaboration International Virtual Observatory Alliance (IVOA) • Aiming to have production services by Jan 2005 Boundary Conditions Standards driven by evolving new technologies Exchange of rich and DB connectivity, Web structured data (XML…) Services, Grid computing Application to astronomy domain – – – – – Data dictionaries (UCDs) Data models Protocols Registries and resource/service discovery Provenance, data quality Dealing with the astronomy legacy FITS data format Software systems Boundary conditions Main VO Challenges • How to avoid trying to be everything for everybody? • Database connectivity is essential – Bring the analysis to the data • Core web services • Higher level applications built on top • Use the 90-10 rule: 0.9 0.8 0.7 # of users – Define the standards and interfaces – Build the framework – Build the 10% of services that are used by 90% – Let the users build the rest from the components 1 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 # of services 0.4 0.5 Core Services • Metadata information about resources – – – – Waveband Sky coverage Translation of names to universal dictionary (UCD) Registry • Simple search patterns on the resources – Spatial Search – Image mosaic – Unit conversions • Simple filtering, counting, histograms Higher Level Services • Built on Core Services • Perform more complex tasks • Examples – – – – – – Automated resource discovery Cross-identifications Photometric redshifts Image segmentation Outlier detections Visualization facilities • Expectation: – Build custom portals in matter of days from existing building blocks (like today in IRAF or IDL) Web Services in Progress • Registry – Harvesting and querying • Data Delivery – Query driven Queue management – Spectro service – Logging services • Graphics and visualization – Query driven vs interactive – Show spatial objects (Chart/Navi/List) • Footprint/intersect – It is a “fractal” • Cross-matching – SkyQuery and SkyNode – Ferris-wheel – Distributed vs parallel MyDB: eScience Workbench • • • • • • • • • • Prototype of bringing analysis to the data Everybody gets a workspace (database) Executes analysis at the data Store intermediate results there Long queries run in batch Results shared within groups Only fetch the final results Extremely successful – matches the pattern of work Next steps: multiple locations, single authentication Farther down the road: parallel workflow system eEducation Prototype • SkyServer: Educational Projects, aimed at advanced high school students, but covering middle school • Teach how to analyze data, discover patterns, not just astronomy • 3.7 million project hits, 1.25 million page views of educational content • More than 4000 textbooks • On the whole web site: 44 million web hits • Largely a volunteer effort by many individuals • Matches the 2020 curriculum 1 Ju l-0 1 O ct -0 1 Ja n02 Ap r-0 2 Ju l-0 2 O ct -0 2 Ja n03 Ap r-0 3 Ju l-0 3 O ct -0 3 Ja n04 80,000 70,000 60,000 50,000 40,000 30,000 20,000 10,000 0 Ap r-0 Page views SkyServer project page views Date Soil Biodiversity How does soil biodiversity affect ecosystem functions, especially decomposition and nutrient cycling in urban areas? • JHU is part of the Baltimore Ecosystem Study, one of the NSF LTER monitoring sites • High resolution monitoring will capture – Spatial heterogeneity of environment – Change over time Sensor Monitoring • Plan: use 400 wireless (Intel) sensors, monitoring • Air temperature, moisture • Soil temperature, moisture, at least in two depths (5cm, 20 cm) • Light (intensity, composition) • Gases (O2, CO2, CH4, …) • • • • • Long-term continuous data Small (hidden) and affordable (many) Less disturbance 200 million measurements/year Collaboration with Intel and UCB (PI: Szlavecz, JHU) • Complex database of sensor data and samples III. Beyond Terabytes • Numerical simulations of turbulence – 100TB of multiple SQL Servers – Storing each timestep, enabling backtracking to initial conditions – Also fundamental problem in cosmological simulations of galaxy mergers • Will teach us how to do scientific analysis of 100TBs • By the end of the decade several PB / year – One needs to demonstrate fault tolerance, fast enough loading speeds… Exploration of Turbulence For the first time, we can now “put it all together” • • • Large scale range, scale-ratio O(1,000) Three-dimensional in space Time-evolution and Lagrangian approach (follow the flow) Unique turbulence database • We will create a database of O(2,000) consecutive snapshots of a 1,0243 simulation of turbulence: Close to 100 Terabytes • Analysis cluster on top of DB • Treat it as a physics experiment, change configurations every 2 months 128 128 compute compute nodes, nodes, dual dual Xeon Xeon 22 GB GB RAM/node RAM/node == 356 356 GB GB total total 30 30 GB GB disk/node disk/node == 3.8 3.8 Tbytes Tbytes total total ++ 22 TB TB RAID5 RAID5 disk disk system system 88 Gigabit Gigabit Ethernet Ethernet switches switches 12 12 ports ports each, each, 11 Gbyte/sec Gbyte/sec throughput throughput across across layers layers 32 32 database database servers, servers, dual dual Xeon Xeon 3.2 3.2 TB TB disk/node disk/node == 102 102 TB TB total total 300 300 MB/sec/node MB/sec/node == 9.6Gbyte/sec 9.6Gbyte/sec aggregate aggregate data data access access speed speed Computational Computational Layer Layer Interconnect Interconnect Layer Layer Data Data Access Access Layer Layer LSST • • • • • Large Synoptic Survey Telescope (2012) Few PB/yr data rate Repeat SDSS in 4 nights Main issue is with data management Data volume similar to high energy physics, but need object granularity • Very high resolution time series, moving objects… • Need to build 100TB scale prototypes today • Hierarchical organization of data products The Big Picture Experiments & Instruments Other Archives Literature questions facts facts ? answers Simulations The Big Problems • • • • • • Data ingest Managing a petabyte Common schema How to organize it? How to reorganize it How to coexist with others • • • new SCIENCE! Query and Vis tools Support/training Performance – – Execute queries in a minute Batch query scheduling