Information At Your Fingertips Web Services Jim Gray & Tom Barclay

advertisement

Information At Your

Fingertips

Web Services

Jim Gray & Tom Barclay

Microsoft Research

Alex Szalay

Johns Hopkins University

1

Communications Excitement!!

Immediate

Point-to-Point Broadcast conversation money lecture concert

Net

Work

+ DB

Time

Shifted mail book newspaper

Data

Base

Its ALL going electronic

Immediate is being stored for analysis (so ALL database)

Analysis & Automatic Processing are being added

2

Slide borrowed from Craig Mundie

Information Excitement!

• All information will be online

(somewhere) text, speech, sound, vision, graphics, spatial, time…

• You might record everything

– read : 10MB/day, 400 GB/lifetime (5 disks today )

– hear : 400MB/day, 16 TB/lifetime (2 disks/year today )

– see : 1MB/s, 40GB/day, 1.6 PB/lifetime (150 disks/year maybe someday)

Information at Your Fingertips

–Make it easy to

capture

&

present

–Make it easy to

store

&

organize

&

access

–Make it easy to

analyze

&

summarize

3

How much information is there?

• Soon everything can be

Yotta recorded and indexed

• Most bytes will never be seen by humans.

Everything

!

Recorded

All Books

MultiMedia

Zetta

Exa

• Data summarization, trend detection, anomaly detection are key technologies

All LoC books

(words)

Peta

Tera

See Mike Lesk:

How much information is there : http://www.lesk.com/mlesk/ksg97/ksg.html

See Lyman & Varian:

How much information http://www.sims.berkeley.edu/research/projects/how-much-info/

24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli

.Movi

e

A Photo

A Book

Giga

Mega

4

Kilo

How do we get information today.

• Human searches web

(with an index)

• Human browses pages

5

How do we get information tomorrow?

• Agents gather and digest it for us.

Digital Dashboard

My Agents

• Q: How?

• A

Microsoft

: Dot Net

– Discovery:

UDDI,

WSDL

– Explore: SOAP

SOAP

WSDL

Web Services

6

How do you publish information?

• Get the data.

f, g, x, y…

• Conceptualize the data schema

• Provide methods that return data subsets.

– Challenge: how much processing on your server?

• Publish the schema and methods.

• We are exploring these issues.

7

TerraServer Example

• What is TerraServer?

– 3TB Internet Map DB available since June 1998

– USGS photo and topo maps of the US

– Integrated with Home Advisor

– Shows off SQL Server availability & scalability

– Designed for basic computer systems and low speed communications

• What is TerraService?

– A .NET web service

– Makes TerraServer data available to other apps

8

Application Goals

3 TB

• BIG — 1 TB of data including catalog, temporary space, etc.

• PUBLIC — available on the world wide web

• INTERESTING — to a wide audience

• ACCESSIBLE — using standard browsers (IE, Netscape)

• REAL — a LOB application (users can buy imagery)

• FREE — cannot require NDA or money to a user to access

• FAST — usable on low-speed (56kbps) and high speeds(T-1+)

• EASY — we do not want a large group to develop, deploy, or maintain the application

• Available – Always, 24x7x52 99.99% of the time

• Programmable

-- .NET applications can integrate

TerraServer data into their apps

10

Demo http://terraserver.microsoft.com

Show photo topo gazetteer demographics

11

Hardware

8 Compaq DL360 “Photon” Web Servers

One SQL database per rack

Each rack contains 4.5 tb

261 total drives / 13.7 TB total

Meta Data

Stored on 101 GB

“Fast, Small Disks”

(18 x 18.2 GB)

O O

J J

Imagery Data

Stored on 4 339 GB

“Slow, Big Disks”

(15 x 73.8 GB)

P Q K L

To Add 90 72.8 GB

Disks in Feb 2001 to create 18 TB SAN

R S

M N

E E

F

H

G

I

4 Compaq ProLiant 8500 Db Servers

Fiber SAN

Switches

SQL\Inst1

SQL\Inst2

12

TerraServer Experience

• Successful Web Site

– Met all 8 goals – interesting, big, real, public, fast, easy, accessible, and free

– High Availability – Windows Data Center &

Compaq SAN Technology

– Top 1000 Web Site – continues to be popular

• New Feature Requests

– Programmable access to meta-data

– User selectable image sizes, i.e. “a map server”

– Permission to use TerraServer data within server applications

13

What is a Web Service?

Open

Internet

Protocols

Web

Service

A programmable application component accessible via standard Web protocols

 Provide a Directory of Services on the

Internet

 You can ask a site for a description of the

Web Services it offers

 Web Services are defined in terms of the formats and ordering of messages

 Web Service consumers can send and receive messages using XML

 All these capabilities are built using open

Internet protocols

UDDI

Universal Description, Design, and Integration

SOAP

Discovery

SOAP

Contract Language

SOAP

XML & HTTP

14

.NET TerraService Architecture

Standard

Browsers

Smart

Clients

Windows

Forms

.NET

Framework

Map UI

Web Forms

Map Server

Http Handler

Existing

DB Server

705 m Rows

TerraServer

Web Service

ADO.NET

OLEDB

SQL 2000

1.0 TB Db

SQL 2000

1.0 TB Db

SQL 2000

1.0 TB Db

15

TerraServer Web Services

Terra-Tile-Service Landmark-Service

• Query Gazetteer

• Retrieve imagery meta-data

• Retrieve imagery

• Simple Projection conversions

• Geo-coded places, e.g. Schools, Golf

Courses, Hospitals, etc .

• Place Polygons e.g. Zip Codes, Cities, etc.

Clients can present

TerraServer imagery in new ways.

allows “overlay” information for

Terra-Tile-Service applications

16

Web Service Methods

• Place Search

• Tile

– GetPlaceFacts

– GetAreaFromPt

– GetPlaceList

– GetAreaFromRect

– GetPlaceListInRect

– GetAreaFromTileId

– CountPlacesInRect

– GetTileMetaFromLonLatPt

• Projection

– GetTileMetaFromTileId

– ConvertLonLatPtToUtmPt

– GetTile (Image)

• Landmark

– ConvertUtmPtToLonLatPt

– ConvertLonLatTo NearestPlace

GetLandmarkTypes

– GetTheme

– CountOfLandmarkPointsByRect

– GetLatLonMetrics

– GetLandmarkPointsByRect

– CountOfLandmarkShapesByRect

– GetLandmarkShapesByRect

17 http://terraservice.net

Soil Viewer Uses TerraService

18

Custom End Product

19

What Tom Showed You

• Converted a Web Server

–HTML get post

–Server returns pictures to people

• to a Web Service

–SOAP service

–returns XML self-describing data

–Application integrates data

(Agriculture and Geo data)

20

Rosetta Stone

• Distributed computing • Dot Net

+ basic services

• Yellow Pages • UDDI – Universal description, discovery, and integration

• ?

• RPC – remote procedure call, CORBA, DCOM, RMI

• IDL – interface definition language

• XDR - eXternal Data

Representation

• Schema, XLANG

• SOAP – simple object access protocol

• WSDL – web services definition language

• XML- eXtended Markup

Language

21

Sky Server

– Like TerraServer pictures of the sky.

– But also LOTS of data on each object

So

a data mining web service

• Luminosity (multi-spectra), morphology, spectrum

• So, it is a data mining application

• Cross-correlation is challenging because

–Multi-resolution

–Data is dirty/fuzzy

(error bars, cosmic rays, airplanes…)

–Time varying •50 K Spectro Objects

•~ 100 attributes + 30 lines

+

22

•15M Photo Objects ~ 400 attributes

Astronomy Data

• In the “old days” astronomers took photos.

• Starting in the 1960’s they began to digitize.

• New instruments are digital

(100s of GB/nite)

• Detectors are following Moore’s law.

• Data avalanche: double every year

Courtesy of

Alex

Szalay

1970

1975

1980

1985

1990

1995

2000

100

10

1

1000

0.1

Total area of 3m+ telescopes in the world in m 2 , total number of CCD pixels in megapixel, as a function of time. Growth over 25 years is a factor pixels .

CCDs Glass

Astronomy Data

• Astronomers have a few Petabytes now.

– 1 pixel (byte) / sq arc second ~ 4TB

– Multi-spectral, temporal, … → 1PB

• They mine it looking for new (kinds of) objects or more of interesting ones(quasars), density variations in 400-D space correlations in 400D space

• Data doubles every year.

• Data is public after a year.

• So, 50% of the data is public.

• Some have private access to 5% more data.

• So: 50% vs 55% access for everyone

24

Astronomy Data

• But…..

• How do I get at that 50% of the data?

• Astronomers have culture of publishing.

– FITS files and many tools.

http://fits.gsfc.nasa.gov/fits_home.html

– Encouraged by NASA.

• Publishing data “details” is difficult.

Astronomers want to do it but it is VERY hard.

(What programs where used? what were the processing steps? How were errors treated?…)

25

Virtual Observatory

http://www.astro.caltech.edu/nvoconf/ http://www.voforum.org/

• Premise: Most data is (or could be online)

• So, the Internet is the world’s best telescope:

– It has data on every part of the sky

– In every measured spectral band: optical, x-ray, radio..

– As deep as the best instruments (1 year ago).

– It is up when you are up.

The “seeing” is always great

(no working at night, no clouds no moons no..).

– It’s a smart telescope: links objects and data to literature on them.

26

Virtual Observatory

The Age of Mega-Surveys

• Large number of new surveys

MACHO

– multi-TB in size, 100 million objects or more

2MASS

– individual archives planned, or under way

DENIS

– Data publication an integral part of the survey

– Software bill a major cost in the survey

• Multi-wavelength view of the sky

– more than 13 wavelength coverage in 5 years

• Impressive early discoveries

– finding exotic objects by unusual colors

• L,T dwarfs, high-z quasars

– finding objects by time variability

• gravitational micro-lensing

SDSS

PRIME

DPOSS

GSC-II

COBE

MAP

NVSS

FIRST

GALEX

ROSAT

OGLE ...

27

Slide courtesy of Alex Szalay, modified by jim

Virtual Observatory

Federating the Archives

• The next generation mega-surveys are different

– top-down design

– large sky coverage

– sound statistical plans

– well controlled/documented data processing

• Each survey has a publication plan

• Data mining will lead to stunning new discoveries

• Federating these archives

Virtual Observatory

28

Slide courtesy of Alex Szalay

The Multiwavelength Crab Nebula

Crab star

1053 AD

Nova first sighted

1054 A.D.

by

Chinese Astronomers

Now: Crab Nebula

X-ray, optical, infrared, and radio

29

Slide courtesy of Robert Brunner @ CalTech.

Exploring Parameter Space

Given an arbitrary parameter space:

Data Clusters

Points between Data Clusters

Isolated Data Clusters

Isolated Data Groups

Holes in Data Clusters

Isolated Points

Nichol et al. 2001

Slide courtesy of Robert Brunner @ CalTech.

30

Virtual Observatory and Education

• In the beginning science was empirical.

• Then theoretical branches evolved.

• Now, we have a computational branches.

– The computational branch has been simulation

– It is becoming data analysis/visualization

• The Virtual Observatory can be used to

– Teach astronomy: make it interactive, demonstrate ideas and phenomena

– Teach computational science skills and the process of scientific discovery

31

Sloan Digital Sky Survey http://sdss.org/

• A group of astronomers has been building a telescope

(with 90M$ from Sloan Foundation, NSF, and a dozen universities).

for the last 12 years!

• Now data is arriving:

– 250GB/nite (20 nights per year).

– 100 M stars, 100 M galaxies, 1 M spectra.

• Public data at http://sdss.org/

– 5% of the survey, 600 sq degrees, 15 M objects 60GB.

– This data includes most of the known high z quasars.

– It has a lot of science left in it but… that is just the start.

32

Demo of Sky Server

Alex built SkyServer

(based on TerraServer design).

http://skyserver.sdss.org/

Demo: famous places navigator data shopping cart spectrum

SQL?

?

33

Virtual Observatory Challenges

• Size : multi-Petabyte

40,000 square degrees is 2 Trillion pixels

– One band (at 1 sq arcsec)

– Multi-wavelength

– Time dimension

4 Terabytes

10-100 Terabytes

>> 10 Petabytes

– Need auto parallelism tools

• Unsolved Meta-Data problem

– Hard to publish data & programs

– Hard to find/understand data & programs

• Current tools inadequate

– new analysis & visualization tools

• Transition to the new astronomy

– Sociological issues

34

3-steps to Virtual Observatory

• Get SDSS and Palomar online

– Alex Szalay, Jan Vandenberg, Ani Thakar….

– Roy Williams, Robert Brunner, Julian Bunn

• Do queries and crossID matches with CalTech and SDSS to expose

– Schema, Units,…

– Dataset problems

– the typical use scenarios.

• Implement WebServices at CalTech and SDSS

35

The Challenges

• How to federate the Archives to make a VO?

• The hope: XML is the answer.

• The reality: XML is syntax and tools:

FITS on XML will be good but…..

Explaining the data will still be very difficult.

• Define Astronomy Objects and Methods.

– Based on UDDI, WSDL, SOAP.

– Each archive is a service

• http://TerraService.net/ shows the idea.

– Working with Caltech

(Brunner, Williams, Djorgovski, Bunn)

– But, how does data mining work?

36

SkyServer as a WebService

WSDL+SOAP just add details

Archive ss = new VOService(SkyServer);

Attributes A[] = ss.GetObjects(ra,dec,radius)

?? What are the objects (attributes…)?

?? What are the methods (GetObjects()...)?

?? What query language? SQL, Xquery…?

37

Summary

• All information at your fingertips.

• How do we publish information so that our agents can digest it?

• Example: TerraServer -> TerraService

• The Virtual Observatory Concept

– The Internet is worlds best telescope

• For astronomy

• For teaching astronomy and

• For teaching computational science

38

39

Download