Introduction to Grid Computing Slides adapted from Midwest Grid School Workshop 2008 (https://twiki.grid.iu.edu/bin/view/Education/MWGS08) Overview Supercomputers, Clusters, and Grids Motivations and Applications Grid Middlewares and Architectures Job Management, Data Management Grid Security National Grids Grid Workflow 2 Computing “Clusters” are today’s Supercomputers Cluster Management “frontend” I/O Servers typically RAID fileserver Disk Arrays A few Headnodes, gatekeepers and other service nodes Lots of Worker Nodes Tape Backup robots 3 Cluster Architecture Cluster User Head Node(s) Login access (ssh) Cluster Scheduler (PBS, Condor, SGE) Web Service (http) Remote File Access (scp, FTP etc) … Cluster User Node 0 Shared Cluster Filesystem Storage (applications and data) Job execution requests & status … … … Compute Nodes (10 to 10,000 PC’s with local disks) Node N 5 Scaling up Science: Citation Network Analysis in Sociology 1975 1980 1985 1990 1995 2000 2002 Work of James Evans, University of Chicago, Department of Sociology 6 Scaling up the analysis Query and analysis of 25+ million citations Work started on desktop workstations Queries grew to month-long duration With data distributed across U of Chicago TeraPort cluster: 50 (faster) CPUs gave 100 X speedup Many more methods and hypotheses can be tested! Higher throughput and capacity enables deeper analysis and broader community access. 7 Grids consist of distributed clusters Grid Site 1: Fermilab Grid Client Grid Service Middleware Application & User Interface Grid Client Middleware Grid Protocols Resource, Workflow & Data Catalogs Grid Site 2: Sao Paolo Grid Service Middleware …Grid Site N: UWisconsin Grid Service Middleware Grid Storage Grid Storage Grid Storage Compute Cluster Compute Cluster Compute Cluster 8 Initial Grid driver: High Energy Physics ~PBytes/sec Online System ~100 MBytes/sec ~20 TIPS There are 100 “triggers” per second Each triggered event is ~1 MByte in size ~622 Mbits/sec or Air Freight (deprecated) France Regional Centre SpecInt95 equivalents Offline Processor Farm There is a “bunch crossing” every 25 nsecs. Tier 1 1 TIPS is approximately 25,000 Tier 0 Germany Regional Centre Italy Regional Centre ~100 MBytes/sec CERN Computer Centre FermiLab ~4 TIPS ~622 Mbits/sec Tier 2 ~622 Mbits/sec Institute Institute Institute ~0.25TIPS Physics data cache Caltech ~1 TIPS Institute Tier2 Centre Tier2 Centre Tier2 Centre Tier2 Centre ~1 TIPS ~1 TIPS ~1 TIPS ~1 TIPS Physicists work on analysis “channels”. Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server ~1 MBytes/sec Tier 4 Physicist workstations Image courtesy Harvey Newman, Caltech 9 Grids Provide Global Resources To Enable e-Science 10 Grids can process vast datasets. Many HEP and Astronomy experiments consist of: Large datasets as inputs (find datasets) “Transformations” which work on the input datasets (process) The output datasets (store and publish) The emphasis is on the sharing of these large datasets Workflows of independent program can be parallelized. Mosaic of M42 created on TeraGrid = Data Transfer = Compute Job Montage Workflow: ~1200 jobs, 7 levels NVO, NASA, ISI/Pegasus - Deelman et al. 11 For Example: Digital Astronomy Digital observatories provide online archives of data at different wavelengths Ask questions such as: what objects are visible in infrared but not visible spectrum? Virtual Organizations Groups of organizations that use the Grid to share resources for specific purposes Support a single community Deploy compatible technology and agree on working policies Security policies - difficult Deploy different network accessible services: Grid Information Grid Resource Brokering Grid Monitoring Grid Accounting 13 The Grid Middleware Stack (and course modules) Grid Application (M5) (often includes a Portal) Workflow system (explicit or ad-hoc) (M6) Job Management (M2) Data Management (M3) Grid Information Services (M5) Grid Security Infrastructure (M4) Core Globus Services (M1) Standard Network Protocols and Web Services (M1) 14 Grids consist of distributed clusters Grid Site 1: Fermilab Grid Client Grid Service Middleware Application & User Interface Grid Client Middleware Grid Protocols Resource, Workflow & Data Catalogs Grid Site 2: Sao Paolo Grid Service Middleware …Grid Site N: UWisconsin Grid Service Middleware Grid Storage Grid Storage Grid Storage Compute Cluster Compute Cluster Compute Cluster 15 Globus and Condor play key roles Globus Toolkit provides the base middleware Client tools which you can use from a command line APIs (scripting languages, C, C++, Java, …) to build your own tools, or use direct from applications Web service interfaces Higher level tools built from these basic components, e.g. Reliable File Transfer (RFT) Condor provides both client & server scheduling In grids, Condor provides an agent to queue, schedule and manage work submission 16 Grid architecture is evolving to a Service-Oriented approach. ...but this is beyond our workshop’s scope. See “Service-Oriented Science” by Ian Foster. Service-oriented applications Wrap applications as services Compose applications into workflows Service-oriented Grid infrastructure Provision physical resources to support application workloads Users Composition Workflows Invocation Appln Service Appln Service Provisioning “The Many Faces of IT as Service”, Foster, Tuecke, 2005 17 Job and resource management Workshop Module 2 18 GRAM provides a uniform interface to diverse cluster schedulers. GRAM User Condor VO Site A PBS VO Site C VO Site B LSF UNIX fork() VO Site D Grid 20 DAGMan Directed Acyclic Graph Manager DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. (e.g., “Don’t run job “B” until job “A” has completed successfully.”) Data Management Workshop Module 3 22 Data management services provide the mechanisms to find, move and share data GridFTP Fast, Flexible, Secure, Ubiquitous data transport Often embedded in higher level services RFT Replica Location Service Tracks multiple copies of data for speed and reliability Storage Resource Manager Reliable file transfer service using GridFTP Manages storage space allocation, aggregation, and transfer Metadata management services are evolving 23 Grids replicate data files for faster access Effective use of the grid resources – more parallelism Each logical file can have multiple physical copies Avoids single points of failure Manual or automatic replication Automatic replication considers the demand for a file, transfer bandwidth, etc. 24 Grid Security Workshop Module 4 25 Grid security is a crucial component Problems being solved might be sensitive Resources are typically valuable Resources are located in distinct administrative domains Each resource has own policies, procedures, security mechanisms, etc. Implementation must be broadly available & applicable Standard, well-tested, well-understood protocols; integrated with wide variety of tools 26 National Grid Cyberinfrastructure Workshop Module 5 27 Open Science Grid 50 sites (15,000 CPUs) & growing 400 to >1000 concurrent jobs Many applications + CS experiments; includes long-running production operations Up since October 2003 Diverse job mix www.opensciencegrid.org 28 TeraGrid provides vast resources via a number of huge computing facilities. 29 To efficiently use a Grid, you must locate and monitor its resources. Check the availability of different grid sites Discover different grid services Check the status of “jobs” Make better scheduling decisions with information maintained on the “health” of sites 30 Virtual Organization (VO) Concept Virtual Community C Person B (Administrator) Compute Server C1' Person A (Principal Investigator) Person E (Researcher) Person D (Researcher) Person B (Staff) Compute Server C2 File server F1 (disk A) Compute Server C1 Person A (Faculty) Person C (Student) Organization A Person D File server F1 (Staff) (disks A and B) Compute Server C3 Person E (Faculty) Person F (Faculty) Organization B VO for each application or workload Carve out and configure resources for a particular use and set of users OSG Resource Selection Service: VORS 32 Grid Workflow Workshop Module 6 33 A typical workflow pattern in image analysis runs many filtering apps. 3a.h 3a.i 4a.h 4a.i ref.h ref.i 5a.h 5a.i 6a.h align_warp/1 align_warp/3 align_warp/5 align_warp/7 3a.w 4a.w 5a.w 6a.w reslice/2 reslice/4 reslice/6 reslice/8 3a.s.h 3a.s.i 4a.s.h 4a.s.i 5a.s.h 5a.s.i 6a.s.h 6a.i 6a.s.i softmean/9 atlas.h slicer/10 atlas.i slicer/12 slicer/14 atlas_x.ppm atlas_y.ppm atlas_z.ppm convert/11 convert/13 convert/15 atlas_x.jpg atlas_y.jpg atlas_z.jpg Workflow courtesy James Dobson, Dartmouth Brain Imaging Center 34 Swift scripts: Parallelism via foreach type imagefile; (imagefile output) flip(imagefile input) { app { convert "-rotate" "180" @input @output; } } imagefile observations[ ] <simple_mapper; prefix=“orion”>; imagefile flipped[ ] <simple_mapper; prefix=“orion-flipped”>; foreach obs.i in observations { flipped[i] = flip(obs); } Process all dataset members in parallel Name outputs based on inputs Conclusion: Why Grids? New approaches to inquiry based on Deep analysis of huge quantities of data Interdisciplinary collaboration Large-scale simulation and analysis Smart instrumentation Dynamically assemble the resources to tackle a new scale of problem Enabled by access to resources & services without regard for location & other barriers 36 Based on: Grid Intro and Fundamentals Review Dr. Gabrielle Allen Center for Computation & Technology Department of Computer Science Louisiana State University gallen@cct.lsu.edu 37