Slides - Department of Computer Science

advertisement
Introduction to
Grid Computing
Slides adapted from Midwest Grid School Workshop 2008
(https://twiki.grid.iu.edu/bin/view/Education/MWGS08)
Overview







Supercomputers, Clusters, and Grids
Motivations and Applications
Grid Middlewares and Architectures
Job Management, Data Management
Grid Security
National Grids
Grid Workflow
2
Computing “Clusters” are today’s
Supercomputers
Cluster Management
“frontend”
I/O Servers typically
RAID fileserver
Disk Arrays
A few Headnodes,
gatekeepers and
other service nodes
Lots of
Worker
Nodes
Tape Backup robots
3
Cluster Architecture
Cluster
User
Head Node(s)
Login access (ssh)
Cluster Scheduler
(PBS, Condor, SGE)
Web Service (http)
Remote File Access
(scp, FTP etc)
…
Cluster
User
Node 0
Shared
Cluster
Filesystem
Storage
(applications
and data)
Job execution requests & status
…
…
…
Compute
Nodes
(10 to
10,000
PC’s
with
local
disks)
Node N
5
Scaling up Science:
Citation Network Analysis in Sociology
1975
1980
1985
1990
1995
2000
2002
Work of James Evans,
University of Chicago,
Department of
Sociology
6
Scaling up the analysis




Query and analysis of 25+ million citations
Work started on desktop workstations
Queries grew to month-long duration
With data distributed across
U of Chicago TeraPort cluster:



50 (faster) CPUs gave 100 X speedup
Many more methods and hypotheses can be tested!
Higher throughput and capacity enables deeper
analysis and broader community access.
7
Grids consist of distributed clusters
Grid Site 1:
Fermilab
Grid Client
Grid
Service
Middleware
Application
& User Interface
Grid Client
Middleware
Grid
Protocols
Resource,
Workflow
& Data Catalogs
Grid Site 2:
Sao Paolo
Grid
Service
Middleware
…Grid Site N:
UWisconsin
Grid
Service
Middleware
Grid
Storage
Grid
Storage
Grid
Storage
Compute
Cluster
Compute
Cluster
Compute
Cluster
8
Initial Grid driver: High Energy Physics
~PBytes/sec
Online System
~100 MBytes/sec
~20 TIPS
There are 100 “triggers” per second
Each triggered event is ~1 MByte in size
~622 Mbits/sec
or Air Freight (deprecated)
France Regional
Centre
SpecInt95 equivalents
Offline Processor Farm
There is a “bunch crossing” every 25 nsecs.
Tier 1
1 TIPS is approximately 25,000
Tier 0
Germany Regional
Centre
Italy Regional
Centre
~100 MBytes/sec
CERN Computer Centre
FermiLab ~4 TIPS
~622 Mbits/sec
Tier 2
~622 Mbits/sec
Institute
Institute Institute
~0.25TIPS
Physics data cache
Caltech
~1 TIPS
Institute
Tier2 Centre
Tier2 Centre
Tier2 Centre
Tier2 Centre
~1 TIPS ~1 TIPS ~1 TIPS ~1 TIPS
Physicists work on analysis “channels”.
Each institute will have ~10 physicists working on one or more
channels; data for these channels should be cached by the
institute server
~1 MBytes/sec
Tier 4
Physicist workstations
Image courtesy Harvey Newman, Caltech
9
Grids Provide Global Resources
To Enable e-Science
10
Grids can process vast datasets.

Many HEP and Astronomy experiments consist of:





Large datasets as inputs (find datasets)
“Transformations” which work on the input datasets (process)
The output datasets (store and publish)
The emphasis is on the sharing of these large datasets
Workflows of independent program can be parallelized.
Mosaic of M42 created on TeraGrid
= Data
Transfer
= Compute
Job
Montage Workflow: ~1200 jobs, 7 levels
NVO, NASA, ISI/Pegasus - Deelman et al.
11
For Example:
Digital Astronomy

Digital observatories provide online archives of
data at different wavelengths

Ask questions such as: what objects are visible in
infrared but not visible spectrum?
Virtual Organizations




Groups of organizations that use the Grid to share resources
for specific purposes
Support a single community
Deploy compatible technology and agree on working policies
 Security policies - difficult
Deploy different network accessible services:




Grid Information
Grid Resource Brokering
Grid Monitoring
Grid Accounting
13
The Grid Middleware Stack (and course modules)
Grid Application (M5)
(often includes a Portal)
Workflow system (explicit or ad-hoc) (M6)
Job
Management (M2)
Data
Management (M3)
Grid Information
Services (M5)
Grid Security Infrastructure (M4)
Core Globus Services (M1)
Standard Network Protocols and Web Services (M1)
14
Grids consist of distributed clusters
Grid Site 1:
Fermilab
Grid Client
Grid
Service
Middleware
Application
& User Interface
Grid Client
Middleware
Grid
Protocols
Resource,
Workflow
& Data Catalogs
Grid Site 2:
Sao Paolo
Grid
Service
Middleware
…Grid Site N:
UWisconsin
Grid
Service
Middleware
Grid
Storage
Grid
Storage
Grid
Storage
Compute
Cluster
Compute
Cluster
Compute
Cluster
15
Globus and Condor play key roles

Globus Toolkit provides the base middleware





Client tools which you can use from a command line
APIs (scripting languages, C, C++, Java, …) to build
your own tools, or use direct from applications
Web service interfaces
Higher level tools built from these basic components,
e.g. Reliable File Transfer (RFT)
Condor provides both client & server scheduling

In grids, Condor provides an agent to queue, schedule
and manage work submission
16
Grid architecture is evolving to a
Service-Oriented approach.
...but this is beyond our workshop’s scope.
See “Service-Oriented Science” by Ian Foster.

Service-oriented applications



Wrap applications as
services
Compose applications
into workflows
Service-oriented Grid
infrastructure

Provision physical
resources to support
application workloads
Users
Composition
Workflows
Invocation
Appln
Service
Appln
Service
Provisioning
“The Many Faces of IT as Service”, Foster, Tuecke, 2005
17
Job and resource management
Workshop Module 2
18
GRAM provides a uniform interface
to diverse cluster schedulers.
GRAM
User
Condor
VO
Site A
PBS
VO
Site C
VO
Site B
LSF
UNIX fork()
VO
Site D
Grid
20
DAGMan

Directed Acyclic Graph Manager

DAGMan allows you to specify the dependencies between
your Condor jobs, so it can manage them automatically for
you.

(e.g., “Don’t run job “B” until job “A” has completed
successfully.”)
Data Management
Workshop Module 3
22
Data management services provide the
mechanisms to find, move and share data


GridFTP

Fast, Flexible, Secure, Ubiquitous data transport

Often embedded in higher level services
RFT


Replica Location Service


Tracks multiple copies of data for speed and reliability
Storage Resource Manager


Reliable file transfer service using GridFTP
Manages storage space allocation, aggregation, and transfer
Metadata management services are evolving
23
Grids replicate data files for faster access




Effective use of the grid resources – more
parallelism
Each logical file can have multiple physical
copies
Avoids single points of failure
Manual or automatic replication

Automatic replication considers the demand for a file,
transfer bandwidth, etc.
24
Grid Security
Workshop Module 4
25
Grid security is a crucial component



Problems being solved might be sensitive
Resources are typically valuable
Resources are located in distinct administrative
domains


Each resource has own policies, procedures, security
mechanisms, etc.
Implementation must be broadly available &
applicable

Standard, well-tested, well-understood protocols;
integrated with wide variety of tools
26
National Grid
Cyberinfrastructure
Workshop Module 5
27
Open Science Grid
50 sites (15,000 CPUs) & growing
 400 to >1000 concurrent jobs
 Many applications + CS experiments;
includes long-running production operations
 Up since October 2003

Diverse job mix
www.opensciencegrid.org
28
TeraGrid provides vast resources via a
number of huge computing facilities.
29
To efficiently use a Grid, you must
locate and monitor its resources.




Check the availability of different grid sites
Discover different grid services
Check the status of “jobs”
Make better scheduling decisions with
information maintained on the “health” of sites
30
Virtual Organization (VO) Concept
Virtual Community C
Person B
(Administrator)
Compute Server C1'
Person A
(Principal Investigator)


Person E
(Researcher)
Person D
(Researcher)
Person B
(Staff)
Compute Server C2
File server F1
(disk A)
Compute Server C1
Person A
(Faculty)
Person C
(Student)
Organization A
Person D File server F1
(Staff) (disks A and B)
Compute Server C3
Person E
(Faculty)
Person F
(Faculty)
Organization B
VO for each application or workload
Carve out and configure resources for a
particular use and set of users
OSG Resource Selection Service: VORS
32
Grid Workflow
Workshop Module 6
33
A typical workflow pattern in image
analysis runs many filtering apps.
3a.h
3a.i
4a.h
4a.i
ref.h
ref.i
5a.h
5a.i
6a.h
align_warp/1
align_warp/3
align_warp/5
align_warp/7
3a.w
4a.w
5a.w
6a.w
reslice/2
reslice/4
reslice/6
reslice/8
3a.s.h
3a.s.i
4a.s.h
4a.s.i
5a.s.h
5a.s.i
6a.s.h
6a.i
6a.s.i
softmean/9
atlas.h
slicer/10
atlas.i
slicer/12
slicer/14
atlas_x.ppm
atlas_y.ppm
atlas_z.ppm
convert/11
convert/13
convert/15
atlas_x.jpg
atlas_y.jpg
atlas_z.jpg
Workflow courtesy James Dobson, Dartmouth Brain Imaging Center
34
Swift scripts: Parallelism via foreach
type imagefile;
(imagefile output) flip(imagefile input) {
app {
convert "-rotate" "180" @input @output;
}
}
imagefile observations[ ] <simple_mapper; prefix=“orion”>;
imagefile flipped[ ]
<simple_mapper; prefix=“orion-flipped”>;
foreach obs.i in observations {
flipped[i] = flip(obs);
}
Process all
dataset members
in parallel
Name
outputs
based on inputs
Conclusion: Why Grids?

New approaches to inquiry based on






Deep analysis of huge quantities of data
Interdisciplinary collaboration
Large-scale simulation and analysis
Smart instrumentation
Dynamically assemble the resources to tackle a new
scale of problem
Enabled by access to resources & services without
regard for location & other barriers
36
Based on:
Grid Intro and Fundamentals Review
Dr. Gabrielle Allen
Center for Computation & Technology
Department of Computer Science
Louisiana State University
gallen@cct.lsu.edu
37
Download