Grid computing : an introduction Lionel Brunie

advertisement
Grid computing : an
introduction
Lionel Brunie
Institut National des Sciences Appliquées
Laboratoire LIRIS – UMR CNRS 5205 – Equipe DRIM
Lyon, France
A Brain is a Lot of
Data!
(Mark Ellisman, UCSD)
And comparisons must be
made among many
We need to get to one micron to know location of every cell. We’re just now
starting to get to 10 microns
Données intensives
• Physique nucléaire et des hautes énergies
• Simulations
– Observation terrestre, modélisation du climat
– Géophysique, modélisation des tremblements de
Terre
– Aérodynamique et dynamique des fluides
– Dispersion de polluants
• Astronomie : les futurs télescopes produiront plus de
10 Petaoctets par an !
•
•
•
•
Génomique
Chimie et biochimie
Applications financières
Imagerie médicale
Evolution de la performance des
composants informatiques
• Performance Réseau/Processeur
– La vitesse des processeurs double tous les 18 mois
– La vitesse des réseaux double tous les 9 mois
– La capacité de stockage des disques double tous les 12 mois
• 1986 à 2000
– Processeurs : x 500
– Réseaux : x 340000
• 2001 à 2010
– Processeurs : x 60
– Réseaux : x 4000
Moore’s Law vs. storage improvements vs. optical improvements. Graph from Scientific American (Jan-2001) by Cleo Vilett, source Vined Khoslan,
Kleiner, Caufield and Perkins.
Conclusion : invest in
networks !
Hansel and Gretel are lost in the forest of
definitions
•
•
•
•
•
•
•
•
•
•
Distributed system
Parallel system
Cluster computing
Meta-computing
Grid computing
Peer to peer computing
Global computing
Internet computing
Network computing
Cloud computing
Distributed system
• N autonomous computers (sites) : n administrators, n
data/control flows
• an interconnection network
• User view : one single (virtual) system
– «A distributed system is a collection of independent computers that
appear to the users of the system as a single computer » Distributed
Operating Systems, A. Tanenbaum, Prentice Hall, 1994
• « Traditional » programmer view : client-server
Parallel System
• 1 computer, n nodes : one administrator, one scheduler,
one power source
• memory : it depends
• Programmer view : one single machine executing
parallel codes. Various programming models (message
passing, distributed shared memory, data parallelism…)
Examples of parallel system
Memory
CPU
CPU
Memory
CPU
CPU
CPU
CPU
CPU
Interconnection network
A CC-NUMA architecture
CPU
Memory
network
Memory
CPU
CPU
network
Periph.network
network
CPU
CPU
Memory
Memory
A shared nothing architecture
CPU
CPU
CPU
Cluster computing
• Use of PCs interconnected by a (high
performance) network as a parallel (cheap)
machine
• Two main approaches
– dedicated network (based on a high performance
network : Myrinet, SCI, Infiniband, Fiber Channel...)
– non-dedicated network (based on a (good) LAN)
Where are we today ?
• A source for efficient and up-to-date information :
www.top500.org
• The 500 best architectures !
• N° 1 : 1457 (1105) Tflops ! N° 500 : 22 (13) Tflops
• Sum (1-500) = 16953 Tflops
• 31% in Europe, 59% in North America
• 1 Flops = 1 floating point operation per second
• 1 TeraFlops = 1000 GigaFlops
How it grows ?
• in 1993 (prehistoric times!)
– n°1 : 59.7 GFlops
– n°500 : 0.4 Gflops
– Sum = 1.17 TFlops
• in 2004 (yesterday)
– n°1 : 70 TFlops (x1118)
– n°500 : 850 Gflops (x2125)
– Sum = 11274 Tflops and
408629 processors
2007/11 best : http://www.top500.org/
Peak: 596 Tflops !!!
http://www.top500.org/
2008/11 best : http://www.top500.org/
Peak: 1457 Tflops !!!
http://www.top500.org/
2009/11 best : http://www.top500.org/
Peak: 2331 Tflops !!!
http://www.top500.org/
Performance evolution
Projected performance
Architecture distribution
Interconnection network distribution
NEC earth simulator
(1er en 2004 ; 30ème en 2007)
Single stage crossbar : 2700 km of cables
A MIMD with
Distributed Memory
700 TB disk space
1.6 PB mass storage
area : 4 tennis court,
3 floors
NEC earth simulator
BlueGene
• 212992 processors – 3D torus
• Rmax = 478 Tflops ; Rpeak = 596 Tflops
RoadRunner
•
•
•
•
3456 nodes (18 clusters) - 2 stage fat tree Infiniband (optical)
1 node= 2 AMD Opteron DualCore + 4 IBM PowerXCell 8i
Rmax = 1.1Pflops ; Rpeak = 1.5Pflops
3,9 MW (0,35 Gflops/W)
Jaguar
• 224162 cœurs – Mémoire : 300 To – Disque : 10 Po
• AMD x86_64 Opteron Six Core 2600 MHz (10.4 GFlops)
• Rmax = 1759 – Rpeak = 2331
• Power : 6,950 MW
• http://www.nccs.gov/jaguar/
Network computing
• From LAN (cluster) computing to WAN
computing
• Set of machines distributed over a MAN/WAN
that are used to execute parallel loosely coupled
codes
• Depending on the infrastructure (soft and hard),
network computing is derived in Internet
computing, P2P, Grid computing, etc.
Meta computing (beginning 90’s)
• Definitions become fuzzy...
• A meta computer = set of (widely) distributed (high
performance) processing resources that can be
associated for processing a parallel not so loosely
coupled code
• A meta computer = parallel
virtual machine over a
distributed system
SAN
LAN
Cluster of PCs
WAN
SAN
Visualization
Cluster of PCs
Supercomputer
Internet computing
•
Use of (idle) computer interconnected by
Internet for processing large throughput
applications
•
Ex : SETI@HOME
– 5M+ users since launching
– 2009/11 : 930k users, 2.4M computers;
190k active users, 278k active computers,
2M years of CPU time
– 234 « countries »
– 1021 floating point operations since 1999
– 769 Tflop/s!
– BOINC infrastructure (Décrypthon, RSA155…)
•
Programmer view : a single master, n
servants
Global computing
• Internet computing on a pool of sites
• Meta computing with loosely coupled
codes
• Grid computing with poor communication
facilities
• Ex : Condor (invented in the 80’s)
Peer to peer computing
• A site is both client and server : servent
• Dynamic servent discovery by « contamination »
• 2 approaches :
– centralized management : Napster, Kazaa, eDonkey…
– distributed management : Gnutella, KAD, Freenet,
bittorrent…
• Application : file sharing
Grid computing (1)
“coordinated resource sharing and problem
solving in dynamic, multi-institutional virtual
organisations” (I. Foster)
Grid computing (2)
• Information grid
– large access to distributed data (the Web)
• Data grid
– management and processing of very large
distributed data sets
• Computing grid
– meta computer
• Ex : Globus, Legion, UNICORE…
Parallelism vs grids : some recalls
•
•
•
Grids date back “only” 1996
Parallelism is older ! (first classification in 1972)
Motivations :
– need more computing power (weather forecast, atomic
simulation, genomics…)
– need more storage capacity (petabytes and more)
– in a word : improve performance ! 3 ways ...
Work harder
Work smarter
Get help
-->
-->
-->
Use faster hardware
Optimize algorithms
Use more computers !
The performance ? Ideally it grows linearly
• Speed-up :
– if TS is the best time to process a problem sequentially,
– its time should be TP=TS/P with P processors !
– Speedup = TS/TP
– limited (Amdhal law): any program has a sequential and a parallel
part TS=F+T//,
– thus the speedup is limited : S = (F+T//)/(F+T///P)<1/F
•
Scale-up :
– if TPS is the time to treat a problem of size S with P processors,
– then TPS should also be the time to treat a problem of size n*S with
n*P processors
Grid computing
Starting point
• Real need for very high performance infrastructures
• Basic idea : share computing resources
– “The sharing that the GRID is concerned with is not primarily
file exchange but rather direct access to computers,
software, data, and other resources, as is required by a
range of collaborative problem-solving and resourcebrokering strategies emerging in industry, science, and
engineering” (I. Foster)
Applications
• Distributed supercomputing
• High throughput computing
• On demand (real time) computing
• Data intensive computing
• Collaborative computing
An Example Virtual Organization:
CERN’s Large Hadron Collider
Worldwide LHC Computing Grid (WLCG)
1800 Physicists, 140 Institutes, 33 Countries
10 PB of data per year ; 50,000 CPUs?
LCG System Architecture
•
Modèle de calcul sur 4 étages
– Tier-0: CERN: accélérateur
• Acquisition de données et
reconstruction
• Distribution des données aux Tier-1
(~online)
– Tier-1
• Accès et disponibilité 24x7,
• Acquisition de données quasi-online
• Service de données sur la grille:
enregistrement utilisant un service de
« mass storage »
• Analyse-Lourde des données
• ~10 pays
– Tier-2
• Simulation
• Utilisateur final, analyses des
donnéesen batch et en Interactif
• ~40 Pays
Tier-0
(1)
Tier-1
(11)
Tier-2
(~150)
Tier-3
(~50)
LHC
40 millions de collisions par seconde
~100 collisions d’intérêt par seconde après filtrage
1-10 MB de données pour chaque collision
– Tier-3
Taux d’acquisition: 0.1 à 1 GB/sec
• Utilisateur final, analyse scientifique
1010 collisions enregistrées chaque année
~10 PetaBytes/an
LCG System Architecture (suite)
Trigger and
Data
Acquisition
System
Tier-0
Tier-1
10 Gbps links
Optical Private Network
(to almost all sites)
Tier-2
General
Purpose/Academic/
Research Network
From F. Malek – LCG FRance
Back to roots (routes)
• Railways, telephone, electricity, roads, bank system
• Complexity, standards, distribution, integration
(large/small)
• Impact on the society : how US grown
• Big differences :
–
–
–
–
clients (the citizens) are NOT providers (State or companies)
small number of actors/providers
small number of applications
strong supervision/control
Computational grid
• “Hardware and software infrastructure that provides
dependable, consistent, pervasive and inexpensive
access to high-end computational capabilities”
• Performance criteria :
–
–
–
–
–
–
–
security
reliability
computing power
latency
throughput
scalability
services
Some recalls about parallelism
Sources of parallelism
R1
R2
---> pipeline parallelism
---> intra-operator parallelism
---> inter-operator parallelism
---> inter-query parallelism
Parallel Execution Plan (PEP)
PARALLEL EXECUTION PLAN
“Scenario” of the query processing
Define the role played by every processor
and its interaction with the other ones
Plan model
Cost Model
Search heuristics
Search space
Intrinsic limitations
• Startup time
• Contentions :
– concurrent accesses to shared resources
– sources of contention :
•
•
•
•
architecture
data partitioning
communication management
execution plan
• Load imbalance
– response time = slowest process
– NEED to balance data, IO, computations, comm.
Parallel Execution Scenario
---> Operator processing ordering
---> degree of inter-operation parallelism
---> access method (e.g. indexed access)
---> characteristics of intermediate relations
(e.g. cardinality)
---> degree of intra-operation parallelism
---> algorithm (e.g. hybrid hash join)
---> redistribution procedures
---> synchronizations
---> scheduling
---> mapping
---> control strategies
Execution control
Precisely planning the processing of a query
is impossible !
Global and partial
Information
Dynamic
parameters
Mandatory to control the execution
and
to dynamically re-optimize the plan
Load balancing
Re-parallelization
End of recalls…
Levels of cooperation
in a computing grid
• End system (computer, disk, sensor…)
– multithreading, local I/O
• Cluster (heterogeneous)
– synchronous communications, DSM, parallel I/O
– parallel processing
• Intranet
– heterogeneity, distributed admin, distributed FS and databases
– low supervision, resource discovery
– high throughput
• Internet
– no control, collaborative systems, (international) WAN
– brokers, negotiation
Grid characteristics
•
•
•
•
•
•
•
•
Large scale
Heterogeneity
Multiple administration domain
Autonomy… and coordination
Dynamicity
Flexibility
Extensibility
Security
Basic services
•
•
•
•
•
•
Authentication/Authorization/Traceability
Activity control (monitoring)
Resource discovery
Resource brokering
Scheduling
Job submission, data access/migration and
execution
• Accounting
Layered Grid Architecture
(By Analogy to Internet Architecture)
“Coordinating multiple resources”:
ubiquitous infrastructure services,
app-specific distributed services
Collective
Application
“Sharing single resources”:
negotiating access, controlling use
Resource
“Talking to things”: communication
(Internet protocols) & security
Connectivity
Transport
Internet
“Controlling things locally”: Access
to, & control of, resources
Fabric
Link
From I. Foster
Internet Protocol Architecture
Application
Aspects of the Problem
•
Need for interoperability when different groups want to share
resources
–
–
•
Need for shared infrastructure services to avoid repeated
development, installation
–
–
•
Diverse components, policies, mechanisms
E.g., standard notions of identity, means of communication, resource
descriptions
E.g., one port/service/protocol for remote access to computing, not
one per tool/application
E.g., Certificate Authorities: expensive to run
A common need for protocols & services
From I. Foster
Resources
•
•
•
•
•
•
•
Description
Advertising
Cataloging
Matching
Claiming
Reserving
Checkpointing
Resource layers
• Application layer
– tasks, resource requests
• Application resource management layer
– intertask resource management, execution environment
• System layer
– resource matching, global brokering
• Owner layer
– owner policy : who may uses what
• End-resource layer
– end-resource policy (e.g. O.S.)
Resource management (1)
• Services and protocols depend on the infrastructure
• Some parameters
–
–
–
–
stability of the infrastructure (same set of resources or not)
freshness of the resource availability information
reservation facilities
multiple resource or single resource brokering
• Example request : I need from 10 to 100 CE each with at
least 128 MB RAM and a computing power of 50 Mips
Resource management and
scheduling (1)
• Levels of scheduling
– job scheduling (global level ; perf : throughput)
– resource scheduling (perf : fairness, utilization)
– application scheduling (perf : response time, speedup,
produced data…)
• Mapping/scheduling
–
–
–
–
–
resource discovery and selection
assignment of tasks to computing resources
data distribution
task scheduling on the computing resources
(communication scheduling)
Resource management and scheduling (2)
• Individual perfs are not necessarily consistent with the
global (system) perf !
• Grid problems
– predictions are not definitive : dynamicity !
– Heterogeneous platforms
– Checkpointing and migration
A Resource Management System
example (Globus)
RSL
specialization
Broker
RSL
Resource Specification Language
Queries
& Info
Application
Ground RSL
Information
Service
Co-allocator
Simple ground RSL
Local
resource
managers
GRAM
GRAM
GRAM
LSF
Condor
NQE
LSF : Load Sharing Facility
(task scheduling and load balancing;
Developed by Platform Computing)
NQE : Network Queuing Env.
(batch management; developed
by Cray Research
Resource information (1)
• What is to be stored ?
– virtual organizations, people, computing resources, software
packages, communication resources, event producers, devices…
– what about data ???
• A key issue in such dynamics environments
• A first approach : (distributed) directory (LDAP)
–
–
–
–
–
–
–
easy to use
tree structure
distribution
static
mostly read ; not efficient updating
hierarchical
poor procedural language
Resource information (2)
• But :
–
–
–
–
dynamicity
complex relationships
frequent updates
complex queries
• A second approach : (relational) database
Programming on the grid : potential
programming models
•
•
•
•
•
•
•
Message passing (PVM, MPI)
Distributed Shared Memory
Data Parallelism (HPF, HPC++)
Task Parallelism (Condor)
Client/server - RPC
Agents
Integration system (Corba, DCOM, RMI)
Program execution : issues
•
Parallelize the program with the right job structure, communication
patterns/procedures, algorithms
•
Discover the available resources
•
Select the suitable resources
•
Allocate or reserve these resources
•
Migrate the data
•
Initiate computations
•
Monitor the executions ; checkpoints ?
•
React to changes
•
Collect results
Data management
• It was long forgotten !!!
• Though it is a key issue !
• Issues :
–
–
–
–
–
–
indexing
retrieval
replication
caching
traceability
(auditing)
• And security !!!
Bruni, pas BruniE !!!
From computing grids to
information grids
From computing grids to information
grids (1)
• Grids lack most of the tools mandatory to share (index, search,
access), analyze, secure, monitor semantic data (information)
• Several reasons :
– history
– money
– difficulty
• Why is it so difficult ?
– Sensitivity but openness
– Multiple administrative domains, multiple actors,
heterogeneousness but a single global architecture/view/system
– Dynamicity and unpredictability but robustness
– Wideness but high performance
From computing grids to information grids (2)
ex: Replica Management Problem
•
Maintain a mapping between logical names for files and collections and one
or more physical locations
•
Decide where and when a piece of data must be replicated
•
Important for many applications
•
Example: CERN high-level trigger data
–
–
–
–
–
•
Multiple petabytes of data per year
Copy of everything at CERN (Tier 0)
Subsets at national centers (Tier 1)
Smaller regional centers (Tier 2)
Individual researchers have copies of pieces of data
Much more complex with sensitive and complex data like medical data !!!
From computing grids to information grids (3):
some (still…) open issues
• Security, security, security (incl. privacy, monitoring,
traceability…)) at a semantic level
• Access protocols (incl. replication, caching, migration…)
• Indexing tools
• Brokering of data (incl. accounting)
• (Content-based) Query optimization and execution
• Mediation of data
• Data integration, data warehousing and analysis tools
• Knowledge discovery and data mining
Functional View of Grid
Data Management
Application
Metadata Service
Planner:
Data location,
Replica selection,
Selection of compute
and storage nodes
Replica Location
Service
Information Services
Location based on
data attributes
Location of one or
more physical replicas
State of grid resources,
performance measurements
and predictions
Security and Policy
Executor:
Initiates
data transfers and
computations
Data Movement
Data Access
Compute Resources
Storage Resources
Un peu simpliste,
quand même…
Grid Security (1)
Why Grid Security is Hard
• Resources being used may be extremely valuable & the
problems being solved extremely sensitive
• Resources are often located in distinct administrative domains
– Each resource may have own policies & procedures
• Users may be different
• The set of resources used by a single computation may be large,
dynamic, and/or unpredictable
– Not just client/server
• The security service must be broadly available & applicable
– Standard, well-tested, well-understood protocols
– Integration with wide variety of tools
Grid Security(2)
Various views
User View
Resource Owner View
1) Easy to use
1) Specify local access control
2) Single sign-on
2) Auditing, accounting, etc.
3) Run applications
ftp,ssh,MPI,Condor,Web,…
3) Integration w/ local system
Kerberos, AFS, license mgr.
4) User based trust model
4) Protection from compromised
resources
5) Proxies/agents (delegation)
Developer View
API/SDK with authentication, flexible message protection,
flexible communication, delegation, ...
Direct calls to various security functions (e.g. GSS-API)
Or security integrated into higher-level SDKs:
E.g. GlobusIO, Condor
Grid security (3)
Requirements
•
•
•
•
•
•
•
Authentication
Authorization and delegation of authority
Assurance
Accounting
Auditing and monitoring
Traceability
Integrity and confidentiality
Access to data and Mediation
•
•
•
•
Ciel, where are the data ?
Use case : German tourist – heart accident in Nice
Data inside the grid # data at the side of the grid !
Basic idea
– use of metadata/indexes. Pb : indexes are (sensitive)
information
• Alternative
– encrypted indexes, use of views, proxies  DSEM/DM2
• Mediation
– no single view of the world  mechanisms for
interoperability, ontologies
• Negotiation : a key open issue
Caching
• Motivation :
– Collaborative caching is proved to be efficient
– Each institution wants to control the access to its data
– No standard exists in Grids for caching
• Proposal :
– use metadata to collaborate / index data
– a two-level cache : local caches and a global virtual cache
– a single access point to internal data (proxy)
Query optimization and execution
• Old wine in new bottles ?
• Yes and no : it seems the problem has not changed but the
operational context has so changed that classical heuristics and
methods are not more pertinent
• Key issues :
– Dynamicity
– Unpredictability
– Adaptability
• Very few works have specifically addressed this problem
An application example :
GGM
Grille Geno-Medicale
An application example: GGM
Biomedical grids
• Biomedical applications are perfect candidates for
gridification:
–
–
–
–
Huge volumes of data (an hospital = several TB per year)
Dissemination of data
Collaborative work (health networks)
Very hard requirements (e.g. response time)
• But
– Partially structured semantic data
– Very strong privacy issues
… a perfect play field for researchers !
An application example: GGM
Motivation (1)
• Dissemination of new “high bandwidth” technologies in genome and
proteome research (e.g. micro-arrays)
– huge volume of structural (gene localization)
– functional (gene expression) data
• Generalization of digital patient files and digital medical images
• Implementation of (regional and inter-national) health networks
• All information is available, people are connected to the network.
• The question is : How can we use it ?
An application example: GGM
Motivation (2)
• Need for an information infrastructure to
– index, exchange/share, process all this data
– while preserving their privacy at a very large scale
• That is... just a good grid!
• Application objectives:
– correlation of genomic and medical data: fundamental
research and later medical decision making process
– patient-centered medical data integration: patient’s
monitoring in and out-side the hospital
– epidemiology
– training
An application example: GGM
Motivation (3)
• References: “Synergy between medical informatics and
bioinformatics: facilitating genomic medicines for future
healthcare”,
– BIOINFOMED Working Group, Jan. 2003, European Commission
• Proceedings of the first and second Healthgrid conferences
(Lyon 2003, Clermont-Ferrand 2004, Oxford 2005, Valence
2006)
An application example: GGM Scope
“The goal of the GGM project is, on top of a grid
infrastructure, to propose a software architecture able to
manage heterogeneous and dynamic data stored in
distributed warehouses for intensive analysis and
processing purposes.”
•
•
•
•
Distributed Data Warehouses
Query Optimization
Data Access [and control]
Data Mining
An application example: GGM Coming
back: semantic data ?
• A medical piece of data (age, image, biological result, salient
object in an image) has a meaning
– conveys information that can be (differently) interpreted
• Meta-data can be attached to medical data… or not
– pre-processing is necessary
• Medical data are often private
– privacy/delegation
• The medical data of a patient are often disseminated over
multiple sites
– access rights/authentication problem, collection/integration of data
into partial views, identification of data/users
• Medical (meta-)data are complex and not yet standardized
– no global structure
An application example: GGM
Architecture
DW-GUI
DW-GUI
GGM
NDS
DW
DM
OQS
Patate
DA+Cache
wrappers
Grid Middlware
Experiments
GO
Medical
Virtual data warehouses
on the grid
Virtual data warehouse on the grid (1)
• Almost nothing…
• Why is it so difficult ?
–
–
–
–
–
–
–
multiple administrative domains
very sensitive data => security/privacy issues
wide distribution
unpredictability
relationship with data replica
heterogeneity
dynamicity (permanent production of large volumes of data)
• Centralized data warehouse ?
– Not realistic at a large scale and not acceptable
Virtual data warehouse on the grid (2)
• A possible direction of research : virtual data warehouses on the grid
• Components :
– a federated schema
– a set of partial views (“chunks”)materialized at the local system level
• Advantages
–
–
–
–
Flexibility wrt users’ needs
Good use of the storage capacity of the grid and scalability
Security control at the local level
Global view of the disseminated data
Virtual data warehouse on the grid (3)
• Drawbacks and open issues
–
–
–
–
maintenance protocols
indexing tools
access to data and negotiation
query processing
• Use of mobile agents ?
Access to data and
collaborative brokers
Access to data and collaborative
brokers (1)
• Brokers act as interfaces between data, services and applications
• Possible locations
–
–
–
–
at the interface between the grid and the external data repositories
on the grid storage elements
at the interface between the grid and the user
inside the network (e.g. routers)
• Open issues
–
–
–
–
–
–
caching : computation results, query partial results…
data indexing
prefetching
user’s customization
inter brokers collaboration
a key issue : security and privacy
Access to data and collaborative
brokers (2) : security and privacy
• Medical data belong to the patient that should be able to give
access rights to who he wants
• To whom processed (even anonymous) data belong to ?
• How one can combine privacy and dissemination/
replication/caching ?
• What about traceability ?
• What about traceability ?
Datamining and knowledge
extraction on the grid
Datamining and knowledge extraction
on the grid
• Structure of the data : few records, many attributes
• Parallelizing data mining algorithms for the grid
– volatility of the resources (data, processing)
– fault tolerance, checkpointing
– distribution of the data : local data exploration + aggregation
function to converge towards a unified model
– incremental production of the data => active data mining
techniques
A short overview
of some grid middlewares
The Legion system
• University of Virginia
• Object-oriented approach. Objects = data, applications, sensors,
computing resources, codes… : all is object !
• Loosely coupled codes
• Single naming space
• Reuse of existing OS and protocols ; definition of message
formats and high level protocols
• Core objects : naming, binding, object
creation/activation/desactivation/destruction
• Methods : description via an IDL
• Security : in the hands of the users
• Resource allocation : a site can define its own policy
High-Throughput Computing: Condor
• High-throughput computing platform for mapping
many tasks to idle computers
• Since 1986 !
• Major components
– A central manager manages pool(s) of [distributively owned
or dedicated] computers. A CM = scheduler + coordinator
– DAGman manages user task pools
– Matchmaker schedules tasks to computers using classified
ads
– Checkpointing and process migration
– No simple communications
• Parameter studies, data analysis
• Condor married Globus : Condor-G
• Several hundreds of Condor pools in the world ; or on
your machine !
Defining a DAG
• A DAG is defined by a .dag file, listing each of its nodes
and their dependencies:
# diamond.dag
Job A a.sub
Job B b.sub
Job C c.sub
Job D d.sub
Parent A Child B C
Parent B C Child D
Job A
Job B
Job C
Job D
• Each node will run the Condor job specified by its
accompanying Condor submit file
From Condor tutorial
Executing jobs => placing data !
Executing
jobsapproach
=> placing data !
The Storck
The Stork approach
• Storck
• STORK: Making Data Placement a
First Class Citizen in the Grid :
Tevfik Kosar and Miron Livny,
University of Wisconsin-Madison,
ICDCS’2004, Tokyo, Japan
The Globus toolkit
• A set of integrated executable management (GEM)
services for the Grid
• Services
–
–
–
–
–
–
–
–
–
–
resource management (GRAM-DUROC)
communication (NEXUS - MPICH-G2, globus_io)
information (MDS)
data management (replica catalog)
security (GSI)
monitoring (HBM)
remote data access (GASS - GridFTP - RIO)
executable management (GEM)
execution
commodity Grid Kits (Java, Python, Corba, Matlab…)
Components in Globus Toolkit
3.0
GSI
WU GridFTP
Pre-WS
GRAM
WS-Security
RFT
(OGSI)
WS GRAM
(OGSI)
MDS2
JAVA
WS Core
(OGSI)
WS-Index
(OGSI)
OGSI
C Bindings
RLS
Security
Data
Management
Resource
Management
Information
Services
WS
Core
Components in Globus Toolkit
3.2
GSI
WU GridFTP
Pre-WS
GRAM
WS-Security
RFT
(OGSI)
WS GRAM
(OGSI)
CAS
(OGSI)
RLS
SimpleCA
OGSI-DAI
MDS2
JAVA
WS Core
(OGSI)
WS-Index
(OGSI)
OGSI
C Bindings
OGSI
Python Bindings
(contributed)
pyGlobus
(contributed)
XIO
Security
Data
Management
Resource
Management
Information
Services
WS
Core
Planned Components in GT 4.0
GSI
New GridFTP
Pre-WS
GRAM
WS-Security
RFT
(WSRF)
WS-GRAM
(WSRF)
CAS
(WSRF)
RLS
CSF
(contribution)
SimpleCA
OGSI-DAI
MDS2
JAVA
WS Core
(WSRF)
WS-Index
(WSRF)
C WS Core
(WSRF)
pyGlobus
(contributed)
Authz
Framework
Security
XIO
Data
Management
Resource
Management
Information
Services
WS
Core
GT2 GRAM
Trusted
by server
and user
Root
Gatekeeper
Host
Creds
Server
Invoke
Requestor
JobManager
User Account
GT3 GRAM
Globus account
(non-privileged)
Trusted
by server
MMJFS
Invoke
Requestor
Root
HostEnv
Starter
GRIM
Host
Creds
Server
JobManager GRIM
Creds
User Account
Trusted
by server
GT4 GRAM
• http://wwwunix.globus.org/toolkit/docs/3.2/gram/ws/d
eveloper/architecture.html
Conclusion (2005)
•
Just a new toy for scientists or a revolution ?
•
Huge investments
•
Classical issues but a functional, operational and applicative context
very complex
•
Complexity from heterogeneity, wide distribution, security, dynamicity
•
Functional shift from computing to information
•
Data management in grids : not prehistory, but still middle-ages
•
Still much work to do !!!
•
A global framework for grid computing, pervasive computing and Web
services ?
Conclusion (2008)
•
Just a new toy for scientists or a revolution ? Neither of them !
•
Huge investments : too much ?!
•
Classical issues but a functional, operational and applicative context very
complex
•
Complexity from heterogeneity, wide distribution, security, dynamicity
•
Functional shift from computing to information
•
Data management in grids : not middle-ages, but not 21st century => services
•
Supercomputing is still alive
•
A global framework for grid computing, pervasive computing and Web
services… and SOA !
•
Some convergence between P2P and grid computing
•
The industrialization time
And in 2009, a new paradigm :
cloud computing
•
•
•
•
•
•
Amazon, Google, Microsoft… even L’Oréal !
Cloud = Internet
Resource = service provided by the network
Behind the scene : a grid ???
Old wine in a new bottle ?
Cf. « exposés »
Download