Presentation Slides - Data Science Research Center

advertisement

Accelerating data-intensive science by outsourcing the mundane

Ian Foster www.ci.anl.gov

www.ci.uchicago.edu

www.ci.anl.gov

www.ci.uchicago.edu

The data deluge

Genomic sequencing output x2 every 9 month

>300 public centers

MACHO et al.: 1 TB

Palomar: 3 TB

2MASS: 10 TB

GALEX: 30 TB

Sloan: 40 TB

Pan-STARRS:

40,000 TB

100,000 TB

1330 molec. bio databases

Nucleic Acids Research (96 in Jan 2001)

2004: 36 TB

2012: 2,300 TB

Climate model intercomparison project (CMIP) of the IPCC

Big science has achieved big successes

LIGO: 1 PB data in last science run, distributed worldwide

Robust production solutions

Substantial teams and expense

Sustained, multi-year effort

Application-specific solutions, built on common technology

OSG: 1.4M CPU-hours/day,

>90 sites, >3000 users,

>260 pubs in 2010

ESG: 1.2 PB climate data delivered to 23,000 users; 600+ pubs

4 www.ci.uchicago.edu

But small science is struggling

More data, more complex data

Ad-hoc solutions

Inadequate software, hardware

Data plan mandates

5 www.ci.anl.gov

www.ci.uchicago.edu

6

Medium-scale science struggles too!

Dark Energy Survey receives 100,000 files each night in Illinois

They transmit files to

Texas for analysis … then move results back to Illinois

Process must be reliable, routine, and efficient

The cyberinfrastructure team is not large

Blanco 4m on Cerro Tololo

Image credit: Roger Smith/NOAO/AURA/NSF www.ci.anl.gov

www.ci.uchicago.edu

The challenge of staying competitive

"Well, in our country," said Alice …

"you'd generally get to somewhere else — if you run very fast for a long time, as we've been doing.”

"A slow sort of country!" said the

Queen. "Now, here, you see, it takes all the running you can do, to keep in the same place. If you want to get somewhere else, you must run at least twice as fast as that!"

7 www.ci.anl.gov

www.ci.uchicago.edu

Current approaches are unsustainable

8

Small laboratories

– PI, postdoc, technician, grad students

– Estimate 5,000 across US university community

– Average ill-spent/unmet need of 0.5 FTE/lab?

Medium-scale projects

– Multiple PIs, a few software engineers

– Estimate 500 across US university community

– Average ill-spent/unmet need of 3 FTE/project?

Total 4000 FTE: at ~$100K/FTE => $400M/yr

Plus computers, storage, opportunity costs, … www.ci.anl.gov

www.ci.uchicago.edu

9

And don’t forget administrative costs

42% of the time spent by an average PI on a federally funded research project was reported to be expended on administrative tasks related to that project rather than on research

— Federal Demonstration Partnership faculty burden survey, 2007 www.ci.anl.gov

www.ci.uchicago.edu

10

You can run a company from a coffee shop www.ci.anl.gov

www.ci.uchicago.edu

11

Because businesses outsource their IT

Web presence

Email (hosted Exchange)

Calendar

Telephony (hosted VOIP)

Human resources and payroll

Accounting

Customer relationship mgmt

Software as a Service

(SaaS)

www.ci.anl.gov

www.ci.uchicago.edu

12

And often their large-scale computing too

Web presence

Email (hosted Exchange)

Calendar

Telephony (hosted VOIP)

Human resources and payroll

Accounting

Customer relationship mgmt

Data analytics

Content distribution

Software as a Service

(SaaS)

Infrastructure as a Service

(IaaS) www.ci.anl.gov

www.ci.uchicago.edu

Let’s rethink how we provide research IT

Accelerate discovery and innovation worldwide by providing research IT as a service

13

Leverage software-as-a-service to provide millions of researchers with unprecedented access to powerful tools; enable a massive shortening of cycle times in time-consuming research processes; and reduce research IT costs dramatically via economies of scale www.ci.anl.gov

www.ci.uchicago.edu

14

Time-consuming tasks in science

Run experiments

Collect data

Manage data

Move data

Acquire computers

Analyze data

Run simulations

Compare experiment with simulation

Search the literature

• Communicate with colleagues

• Publish papers

• Find, configure, install relevant software

• Find, access, analyze relevant data

• Order supplies

• Write proposals

• Write reports

• … www.ci.anl.gov

www.ci.uchicago.edu

15

Time-consuming tasks in science

Run experiments

Collect data

Manage data

Move data

Acquire computers

Analyze data

Run simulations

Compare experiment with simulation

Search the literature

• Communicate with colleagues

• Publish papers

• Find, configure, install relevant software

• Find, access, analyze relevant data

• Order supplies

• Write proposals

• Write reports

• … www.ci.anl.gov

www.ci.uchicago.edu

16

Data movement can be surprisingly difficult

A B www.ci.anl.gov

www.ci.uchicago.edu

Data movement can be surprisingly difficult

17

Discover endpoints, determine available protocols, negotiate firewalls, configure software, manage space, determine required credentials, configure protocols, detect and respond to failures, determine expected performance, determine actual performance, identify diagnose and correct network misconfigurations, integrate with file systems, …

A

It took 2 weeks and much help from many people to move 10 TB between

California and Tennessee.

(2007 BES report)

B www.ci.anl.gov

www.ci.uchicago.edu

18

Globus Online’s SaaS/Web 2.0 architecture

Web interface

HTTP REST interface

POST https://transfer.api.

globusonline.org/ v0.10/ transfer <transfer-doc>

Fire-and-forget data movement

Automatic fault recovery

High performance

No client software install

Across multiple security domains

Command line interface ls alcf#dtn:/ scp alcf#dtn:/myfile \ nersc#dtn:/myfile

OpenID

OAuth

Shibboleth

(Hosted on)

GridFTP servers

FTP servers

Other protocols:

HTTP, WebDAV, SRM, …

Globus Connect on local computers www.ci.anl.gov

www.ci.uchicago.edu

Example application: UC sequencing facility

Mac using Globus Connect

Delivery of data to customer

Mount drive iBi File Server

Sequencing instrument

19 iBi general-purpose compute cluster

Sequencing-specific compute cluster www.ci.anl.gov

www.ci.uchicago.edu

Statistics and user feedback

Launched November 2010

>1700 users registered

>500 TB user data moved

>30 million user files moved

>150 endpoints registered

Widely used on TeraGrid/

XSEDE; other centers & facilities; internationally

>20x faster than SCP

Faster than hand-tuned

20

“Last time I needed to fetch

100,000 files from NERSC, a graduate student babysat the process for a month.”

“I expected to spend four weeks writing code to manage my data transfers; with Globus

Online, I was up and running in five minutes.”

“Transferred 28 MB in 20 minutes instead of 61 hours.

Makes these global climate simulations manageable.” www.ci.anl.gov

www.ci.uchicago.edu

21

Moving 586 Terabytes in two weeks www.ci.anl.gov

www.ci.uchicago.edu

22

Monitoring provides deep visibility www.ci.anl.gov

www.ci.uchicago.edu

Terabyte

Gigabyte

Megabyte

Kilobyte

20 Terabytes in less than one day

20 Gigabyes in more than two days

24

Common research data management steps

Dark Energy Survey

Galaxy genomics

LIGO observatory

SBGrid structural biology consortium

NCAR climate data applications

Land use change; economics www.ci.anl.gov

www.ci.uchicago.edu

We have choices of where to compute

25

Campus systems

– First target for many researchers

XSEDE supercomputers

220,000 cores, peer-reviewed awards

Optimized for scientific computing

Open Science Grid

– 60,000 cores; high throughput

• Commercial cloud providers

Instant access for small tasks

Expensive for big projects

Users insist that they need everything connected www.ci.anl.gov

www.ci.uchicago.edu

26

Towards “research IT as a service”

Scienti fic data management as a service

GO-Store GO-Collaborate

GO-Compute GO-Catalog

GO-Galaxy

GO-Team

GO-Transfer

GO-User www.ci.anl.gov

www.ci.uchicago.edu

27

Research data management as a service

GO-User Today

– Credentials and other profile information

GO-Transfer

– Data movement

GO-Team

– Group membership

Fall

GO-Collaborate

– Connect to collaborative tools: Jira, Confluence, …

GO-Store

Prototype

Access to campus, cloud, XSEDE storage

GO-Catalog

– On-demand metadata catalogs

GO-Compute

– Access to computers

GO-Galaxy

– Share, create, run workflows www.ci.anl.gov

www.ci.uchicago.edu

SaaS services in action: The XSEDE vision

Academic institution

= Standard

interface

2

User Team Catalog Transfer Compute

...

InCommon

28

XSEDE service provider

Commercial provider

...

Data provider

Open

Science

Grid www.ci.anl.gov

www.ci.uchicago.edu

Data analysis as a service: Early steps

Securely and reliably:

1.

Assemble code

2.

Find computers

3.

Deploy code

4.

Run program

5.

Access data

6.

Store data

7.

Record workflow

8.

Reuse workflow

Data store

[7, 8]

We have built such systems for biological, environmental, and economics researchers

[5, 6]

29

[1, 2]

VM image

App code

Workflow

Galaxy

Condor

[3, 4] www.ci.anl.gov

www.ci.uchicago.edu

SaaS economics: A quick tutorial

30

Lower per-user cost (x10?)

$ via aggregation onto common infrastructure

– $400M/yr  $40M/yr?

Initial “cost trough” due to fixed costs

0

Per-user revenue permits positive return to scale

Further reduce per-user cost over time

Time

X10 reduction in per-user cost:

$50K  $5K/yr per lab

$300K  $30K/yr per project www.ci.anl.gov

www.ci.uchicago.edu

A national cyberinfrastructure strategy?

To provide

more capability for

more people at less cost …

L

Small and medium laboratories and projects

L

L

L

P

L

L

L

L

L

P

L

L

L

L

L

P

L

L

L

L

P

L

L

L

L

P

L

L

L

L

L

Create infrastructure

– Robust and universal

– Economies of scale

– Positive returns to scale

Research data management

Collaboration, computation

Research administration a a

S

Via the creative use of

Aggregation (“cloud”)

Federation (“grid”)

31 www.ci.anl.gov

www.ci.uchicago.edu

Acknowledgments

32

Colleagues at UChicago and Argonne

– Steve Tuecke, Ravi Madduri, Kyle Chard, Tanu Malik, and others listed at www.globusonline.org/about/goteam/

Carl Kesselman and other colleagues at other institutions

Participants in the recent ICiS workshop on

“Human-Computer Symbiosis: 50 Years On”

NSF OCI and MPS; DOE ASCR; and NIH for support www.ci.anl.gov

www.ci.uchicago.edu

For more information

33

• www.globusonline.org

; @globusonline: Twitter

Foster, I. Globus Online: Accelerating and democratizing science through cloud-based services. IEEE Internet

Computing(May/June):70-73, 2011.

Allen, B., Bresnahan, J., Childers, L., Foster, I.,

Kandaswamy, G., Kettimuthu, R., Kordas, J., Link,

M., Martin, S., Pickett, K. and Tuecke, S. Globus

Online: Radical Simplification of Data Movement via SaaS. Communications of the ACM, 2011.

www.ci.anl.gov

www.ci.uchicago.edu

Thank you!

foster@uchicago.edu

www.globusonline.org

@globusonline www.ci.anl.gov

www.ci.uchicago.edu

Download