Microsoft Research SKYSERVER Jim Gray Distinguished Engineer Microsoft Research San Francisco Microsoft Research Organization goal: Advance state of the art More than 700 staff, 55 areas Labs in US, Europe, Asia Internationally recognized teams University organizational model Open research environment Close ties to universities Close working relations with development. My Research Goal Information at your fingertips Bring all scientific literature and data online Focus on large database issues, and scalable servers. Experiments & Instruments Other Archives Literature Simulations questions facts facts ? answers World Wide Telescope Premise: Most Astronomy data is online The Internet is the world’s best telescope It has data on every part of the sky In every measured spectral band: As deep as the best instruments It is up when you are up. The “seeing” is always great (no working at night, no clouds no moons no..). It’s a smart telescope: links data with literature. SkyServer.SDSS.org Built with Johns Hopkins U. A modern archive Raw data in file servers Catalog data (derived objects) in Database 10 billon records, 2 TB Also used for education 150 hours of online Astronomy Interesting things Based on Web Services Spatial data search Cloned by other surveys (a design template) Service Oriented Architecture Data Federations of Web Services Massive datasets live near their owners: Near instrument software pipeline, apps DB Near data knowledge and curation Each Archive publishes a web service Schema: documents the data DB Methods on objects (queries) Uniform access to multiple Archives A common global schema DB DB DB Scientists get “personalized” extracts SkyQuery Structure Each SkyNode publishes Portal Schema Web Service Plans Query (2 phase) Data Query Web Service Integrates answers Is itself a web service Image Cutout SDSS SkyQuery Portal FIRST 2MASS INT Federation: SkyQuery.Net Combines 15 archives Send query to portal, portal joins data from archives. Problem: want to do multi-step data analysis (not just single query). Solution: Allow personal databases on portal Problem: some queries are monsters Solution: “batch scheduler” on portal server, Deposits answer in personal db. Current Status: CERN → Pasadena Multi Stream tpc/ip 7.1 Gbps ~900 MBps New speed record @ http://ultralight.caltech.edu/lsr-winhec/ Single Stream tpc/ip 6.5 Gbps ~800 MBps File Transfer Speed ~450 MBps mbps per second 7,000 6,000 5,000 4,000 3,000 2,000 1,000 0 2000 2001 2002 2003 2004 2005 Challenge: Move Data from CERN to Remote Centers @ 1GBps ~PBps Filter ~1 GBps • Disk-to-Disk Experiment CERN • gigabyte / second Tier 1~5 GBps data rates • 80TB/day ~1 GBps Tier 2 • 30 petabytes by 2008 ~1 GBps Tier 3 Physics • 1 exabyte by 2014 data cache INP3 RAL INFN FNAL … Tier 2 Tier 2 Tier 2 Tier 2 Tier 2 Institute Tier 4 .1 GBps Institute Institute Institute Workstations Graphics courtesy of Harvey Newman @ Caltech Summary Microsoft Research is active inside and outside Microsoft. World Wide Telescope is coming Exemplifies service oriented architecture Built with web services and databases Has interesting spatial database algorithms 10Gbps Networking is coming, x-64 is coming and we are investing to make them real. Details on my website: http://research.microsoft.com/~Gray © 2003 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.