The Grid

advertisement
网格计算与云计算
“Cloud” Computing is 1+ yr old
Michael Sheehan’s GoGrid Blog, July 25, 2008
http://linux.sys-con.com/node/587717
Confused?
Grid Computing
Virtualization
Cluster Computing
?
SaaS
P2P
SaaS = Software as a Service
Utility Computing
?
Cloud Computing
One can categorize each component
Utility Computing
Cloud Computing
SaaS
Cluster Computing
Virtualization
Grid Computing
P2P
Usage Model
Infrastructure
网格计算
What is a Grid?
Enable “coordinated resource sharing
& problem solving in dynamic, multi-institutional
virtual organizations.”
(Source: “The Anatomy of the Grid”)
5
Virtual Organizations
6
TeraGrid
What is the TeraGrid?
Technology + Support = Science
8
–
–
NSF已投资2.460亿美元
自2004年10月已处于生产运行阶段,目前已用高性能网络集成了每秒750万亿次计算能
力、30PB存储空间和100多个学科的数据库资源。
TeraGrid’s 3-pronged strategy to further science
• DEEP Science: Enabling
Terascale Science
– Make science more productive
through an integrated set of veryhigh capability resources
• ASTA projects
• WIDE Impact: Empowering
Communities
– Bring TeraGrid capabilities to the
broad science community
• Science Gateways
• OPEN Infrastructure, OPEN
Partnership
– Provide a coordinated, general
purpose, reliable set of services
and resources
• Grid interoperability working group
9
TeraGrid Used
10
TeraGrid PI’s By Institution
Blue: 10 or more PI’s
Red: 5-9 PI’s
Yellow: 2-4 PI’s
Green: 1 PI
TeraGrid PI’s
11
TeraGrid Resources
Computational
Resources
ANL/UC
IU
NCSA
ORNL
PSC
Purdue
SDSC
TACC
Itanium 2
(0.5 TF)
IA-32
(0.5 TF)
Itanium2
(0.2 TF)
IA-32
(2.0 TF)
Itanium2
(10.7 TF)
SGI SMP
(7.0 TF)
Dell Xeon
(17.2TF)
IBM p690
(2TF)
Condor Flock
(1.1TF)
IA-32
(0.3 TF)
XT3
(10 TF)
TCS
(6 TF)
Marvel
SMP
(0.3 TF)
Hetero
(1.7 TF)
IA-32
(11 TF)
Itanium2
(4.4 TF)
Power4+
(15.6 TF)
Blue Gene
(5.7 TF)
IA-32
(6.3 TF)
32 TB
1140 TB
1 TB
300 TB
26 TB
1400 TB
50 TB
1.2 PB
5 PB
2.4 PB
1.3 PB
6 PB
2 PB
10 CHI
30 CHI
30 CHI
10 CHI
10 LA
10 CHI
5 Col.
>3.7 TB
URL/DB/
GridFTP
> 30 Col.
URL/SRB/DB/
GridFTP
4 Col.
7 TB
SRB/Portal/
OPeNDAP
>70 Col.
>1 PB
GFS/SRB/
DB/GridFTP
4 Col.
2.35 TB
SRB/Web
Services/
URL
RB
IA-32, 48
Nodes
RB
RI, RC, RB
UltraSPARC
IV, 512GB
SMP, 16 gfx
cards
100+ TF
8 distinct
architectures
3 PB Online Disk
Online Storage
20 TB
Mass Storage
Net Gb/s, Hub
30 CHI
Data
Collections
# collections
Approx total size
Access methods
>100 data collections
Instruments
Visualization
Resources
RI: Remote Interact
RB: Remote Batch
RC: RI/Collab
12
Proteomics
X-ray Cryst.
RI, RC, RB
IA-32, 96
GeForce
6600GT
10 ATL
Opportunistic
SNS and
HFIR
Facilities
RB
SGI Prism, 32
graphics pipes;
IA-32
RI, RB
IA-32 +
Quadro4
980 XGL
Science Gateways
A new initiative for the TeraGrid
• Increasing investment by
communities in their own
cyberinfrastructure, but
heterogeneous:
• Resources
• Users – from expert to K-12
• Software stacks, policies
• Science Gateways
– Provide “TeraGrid Inside”
capabilities
– Leverage community investment
• Three common forms:
– Web-based Portals
– Application programs running on
users' machines but accessing
services in TeraGrid
– Coordinated access points
enabling users to move
seamlessly between TeraGrid
and other grids.
13
Workflow Composer
Gateways are growing in numbers
•
•
•
10 initial projects as part of TG proposal
>20 Gateway projects today
No limit on how many gateways can use TG resources
–
Prepare services and documentation so developers can work
independently
•
•
•
•
•
•
•
•
•
•
•
•
•
Open Science Grid (OSG)
Special PRiority and Urgent Computing Environment
(SPRUCE)
National Virtual Observatory (NVO)
Linked Environments for Atmospheric Discovery (LEAD)
Computational Chemistry Grid (GridChem)
Computational Science and Engineering Online (CSEOnline)
GEON(GEOsciences Network)
Network for Earthquake Engineering Simulation (NEES)
SCEC Earthworks Project
Network for Computational Nanotechnology and
nanoHUB
GIScience Gateway (GISolve)
Biology and Biomedicine Science Gateway
Open Life Sciences Gateway
The Telescience Project
Grid Analysis Environment (GAE)
Neutron Science Instrument Gateway
TeraGrid Visualization Gateway, ANL
BIRN
Gridblast Bioinformatics Gateway
Earth Systems Grid
Astrophysical Data Repository (Cornell)
•
Many others interested
•
•
•
•
•
•
•
•
–
–
14
SID Grid
HASTAC
OSG
(Open Science Grid)
Open Science Grid (OSG)
Origins:
– National Grid (iVDGL, GriPhyN, PPDG) and LHC Software &
Computing Projects
Current Compute Resources:
– 61 Open Science Grid sites
– Connected via Inet2, NLR.... from 10 Gbps – 622 Mbps
– Compute & Storage Elemets
– All are Linux clusters
– Most are shared
• Campus grids
• Local non-grid users
– More than 10,000 CPUs
• A lot of opportunistic usage
• Total computing capacity difficult to estimate
• Same with Storage
16
OSG Snapshot
96 Resources across
production & integration infrastructures
Using production & research networks
Snapshot of Jobs on OSGs
Sustaining through OSG submissions:
3,000-4,000 simultaneous jobs .
~10K jobs/day
~50K CPUhours/day.
Peak test jobs of 15K a day.
20 Virtual Organizations +6 operations
Includes 25% non-physics.
~20,000 CPUs (from 30 to 4000)
~6 PB Tapes
~4 PB Shared Disk
17
What is the Open Science Grid?
MCGILL
HARVARD
ALBANY
BU
BUFFALO
UMICH
UWM
WSU CORNELL BNL
MSU
WISC
PSU
FNAL UIC
IOWA STATE
UNI
UCHICAGO
LEHIGH
ANL PURDUE
UNL
UIUC
NSF
IUPUI
UVA
INDIANA IU
KU
NERSC
STANFORD
VANDERBILT
UCLA
CALTECH
UCR
SDSC
UNM
OU
TTU
UMISS
ORNL
CLEMSON
UTA LTU
SMU
LSU
(+Brazil, Mexico,
Tawain, UK)
UFL
OSG应用
Genome
sequence
analysis
Sloan digital sky survey
Earth System Grid:
O(100TB) online data
STAR: 5 TB transfer
(SRM, GridFTP)
Earth System Grid
EGEE
(Enabling Grids for E-sciencE)
European Grid Initiative
22
Archeology
Astronomy
Astrophysics
Civil Protection
Comp. Chemistry
Earth Sciences
Finance
Fusion
Geophysics
High Energy Physics
Life Sciences
Multimedia
Material Sciences
…
23
>250 sites
48 countries
>50,000 CPUs
>20 PetaBytes
>10,000 users
>150 VOs
>150,000 jobs/day
June 2, 2008
Users and resources distribution
24
June 2, 2008
EGEE workload in 2007
Data:
25PB stored
11PB transferred
CPU: 114 Million hours
CPU
Xfer
Storage
http://gridview.cern.ch/GRIDVIEW/same_index.php http://calculator.s3.amazonaws.com/calc5.html? 17/05/08 $58688679.08
25
LCG
(LHC Computing Grid)
LHC - Large Hadronic Collider
GRID Tutorial - How to use LCG
4 experiments:
ATLAS Alice CMS LHCb
27 km long pipe
7+7 TeV
3/18/2016
Federico Calzolari
27
LCG - LHC Computing Grid
目前集成了33个国家的
140个计算中心。
GRID Tutorial - How to use LCG

2008年将执行1亿个计
算任务。

3/18/2016
Federico Calzolari
28
Proxy certificate
GRID Tutorial - How to use LCG

Get your proxy certificate


temporary (usually 24h) certificate
depending on VO:
grid-proxy-init
voms-proxy-init -voms <VO>:/<VO>/Role=<role> -valid 1000:00
3/18/2016
29
Certificate
GRID Tutorial - How to use LCG

Install your certificate on the User Interface:

Log in into the UserInterface, copy there the file you exported, and create
a directory where your certificate + private key will be stored:
mkdir ~/.globus

Convert PKCS12 file .p12 into the supported standard .pem
This operation will split your mycert.p12 file in two files: the certificate
(usercert.pem) and the private key (userkey.pem)
openssl pkcs12 -nocerts
-in <mycert.p12> -out ~/.globus/userkey.pem
openssl pkcs12 -clcerts -nokeys -in <mycert.p12> -out ~/.globus/usercert.pem
chmod 0400 ~/.globus/userkey.pem
chmod 0600 ~/.globus/usercert.pem

At end you should have something like:
[user@userinterface .globus]$ ls -al
-rw------- 1 user user 2008 Nov 13 16:50 usercert.pem
-r-------- 1 user user
963 Nov 13 16:50 userkey.pem
3/18/2016
Federico Calzolari
30
Register to a VO
GRID Tutorial - How to use LCG
http://grid-it.cnaf.infn.it
for generic user
3/18/2016
31
JDL: Job Description Language
GRID Tutorial - How to use LCG

JOB overview:



JDL (job encapsulation)
main script
executable program
Creation
Submission
Status
Retrieval
3/18/2016
32
JDL
GRID Tutorial - How to use LCG

test.jdl
Executable
StdOutput
StdError
InputSandbox
OutputSandbox
VirtualOrganisation
DataAccessProtocol
InputData
OutputSE
=
=
=
=
=
=
=
=
=
"script.sh";
"std.out";
"std.err";
{"script.sh","exe.bin"};
# Input
{"std.out","std.err","out"}; # Output
"<VO>";
{"file","gsiftp","rfio","dcap"};
{"lfn:/grid/<VO>/<FILE>"};
"<SE>";
Requirements=Member("<SITE>",
other.GlueHostApplicationSoftwareRunTimeEnvironment &&
other.GlueCEName=="<QUEUE>");
3/18/2016
33
Main script
GRID Tutorial - How to use LCG

script.sh
#!/bin/sh
# Environment
date
>> out2
hostname >> out2
# Get data
lcg-cp [-v] --vo <VO> lfn:<file> file:///data.tgz
# Unpack input [data.tgz: src.cpp,...]
tar -zxvf data.tgz
# Compile source
g++ src.cpp -o exe.bin
chmod u+x exe.bin
# Exec program
./exe.bin > out
# Pack output
tar -zcvf out.tgz out out2
3/18/2016
34
Submit a Job
GRID Tutorial - How to use LCG

Submit a JOB
edg-job-submit -o ID <JDL>
# save JOBid on file ID
Selected Virtual Organisation name (from JDL): cms
Connecting to host rb119.cern.ch, port 7772
# Resource Broker
Logging to host rb119.cern.ch, port 9002
*********************************************************************************************
JOB SUBMIT OUTCOME
The job has been successfully submitted to the Network Server.
Use edg-job-status command to check job current status. Your job identifier (edg_jobId) is:
- https://rb119.cern.ch:9000/tG3Xp2jT_58IUeXoY1GoZQ
# JOBid
*********************************************************************************************

Control JOB status
edg-job-status <JOBid> [https://rb119.cern.ch:9000/tG3Xp2jT_58IUeXoY1GoZQ]
*************************************************************
BOOKKEEPING INFORMATION:
Status info for the Job : https://rb119.cern.ch:9000/tG3Xp2jT_58IUeXoY1GoZQ
Current Status:
Waiting / Scheduled / Running / Done (Success/Abort)
Status Reason:
Job successfully submitted to Globus
Destination:
ce0001.m45.ihep.su:2119/jobmanager-lcgpbs-cms
reached on:
Sat Nov 17 22:38:34 2007
*************************************************************
3/18/2016
35
Get the output

JOB output retrieve
GRID Tutorial - How to use LCG
edg-job-get-output <JOBid> [https://rb119.cern.ch:9000/tG3Xp2jT_58IUeXoY1GoZQ]
Retrieving files from host: rb119.cern.ch( for https://rb119.cern.ch:9000/tG3Xp2jT_58IUeXoY1GoZQ)
*********************************************************************************
JOB GET OUTPUT OUTCOME
Output sandbox files for the job:
- https://rb119.cern.ch:9000/tG3Xp2jT_58IUeXoY1GoZQ
have been successfully retrieved and stored in the directory:
/tmp/jobOutput/<USER>_ tG3Xp2jT_58IUeXoY1GoZQ
*********************************************************************************
ls -al /tmp/jobOutput/calzolar_ tG3Xp2jT_58IUeXoY1GoZQ
-rw-r--r--rw-r--r--rw-r--r--
3/18/2016
1 calzolar cms 11 Nov 17 23:59 out
1 calzolar cms 133 Nov 17 23:59 std.err
1 calzolar cms
8 Nov 17 23:59 std.out
36
Job Requirements

JDL Requirements
GRID Tutorial - How to use LCG

everywhere
NO Requirements

at Pisa
Requirements=Member("INFN-PISA",other.GlueHostApplicationSoftwareRunTimeEnvironment);

on a queue 1 day at least long
Requirements=(other.GlueCEPolicyMaxCPUTime>60*24);

on a site with at least 20 free CPU
Requirements=(other.GlueCEStateFreeCPUs>20);

on a site with at least 1 TB (unit:kb) local disk available
Requirements=anyMatch(other.storage.CloseSEs,target.GlueSAStateAvailableSpace > 1000000000);

on a site with a given software locally installed
Requirements=Member(”VO-<VO>-TAG",other.GlueHostApplicationSoftwareRunTimeEnvironment);
3/18/2016
37
Requirements TAGs

from SINICA http://goc.grid.sinica.edu.tw/gstat/<SITE>/
GRID Tutorial - How to use LCG
GlueHostOperatingSystemName:
Scientific Linux CERN
GlueHostOperatingSystemRelease: 4.5
GlueHostOperatingSystemVersion: Beryllium
GlueSubClusterPhysicalCPUs:
0
GlueSubClusterLogicalCPUs:
0
GlueHostApplicationSoftwareRunTimeEnvironment:
LCG-2
LCG-2_1_0
LCG-2_1_1
LCG-2_2_0
LCG-2_3_0
LCG-2_3_1
LCG-2_4_0
LCG-2_5_0
LCG-2_6_0
LCG-2_7_0
GLITE-3_0_0
R-GMA
INFN-PISA
SI00MeanPerCPU_1800
SF00MeanPerCPU_2000
MPICH
MPI_HOME_NOTSHARED
AFS
VO-atlas-cloud-IT
VO-atlas-production-12.0.5
VO-atlas-production-12.0.6
VO-atlas-production-12.0.7
[…]
3/18/2016
38
Resources search

Query CPU / Storage available per VO
GRID Tutorial - How to use LCG
lcg-infosites --vo <VO> ce
#CPU
Free
Total Jobs
Running Waiting ComputingElement
---------------------------------------------------------165
1
1
0
1
ce.phy.bg.ac.yu:2119/jobmanager-pbs-cms
120
11
0
0
0
fangorn.man.poznan.pl:2119/jobmanager-pbs-cms
192
110
0
0
0
gridce.atlantis.ugent.be:2119/jobmanager-pbs-cms
212
0
529
146
383
gridce.iihe.ac.be:2119/jobmanager-pbs-cms
227
5
312
222
90
ingrid.cism.ucl.ac.be:2119/jobmanager-lcgcondor-cms
15
15
0
0
0
ce002.ipp.acad.bg:2119/jobmanager-lcgpbs-cms
80
43
0
0
0
ce02.grid.acad.bg:2119/jobmanager-pbs-cms
24
13
0
0
0
ce001.grid.uni-sofia.bg:2119/jobmanager-lcgpbs-cms
lcg-infosites --vo <VO> se
Avail Space(Kb) Used Space(Kb) Type
SEs
---------------------------------------------------------97470000
n.a
n.a
dpm.phy.bg.ac.yu
395467659
779205896
n.a
cmsse01.ihep.ac.cn
27664924
59878772
n.a
se001.grid.uni-sofia.bg
149180000
n.a
n.a
se.hpc.iit.bme.hu
1
1
n.a
dcsrm.usatlas.bnl.gov
190040000
208
n.a
lxdpm101.cern.ch
1000000000000
500000000000
n.a
castorgrid.cern.ch
1000000000000
500000000000
n.a
srm.cern.ch
3/18/2016
39
Resources search

Query available sites for my Job
GRID Tutorial - How to use LCG
edg-job-list-match <JDL>
Selected Virtual Organisation name (from JDL): cms
Connecting to host rb119.cern.ch, port 7772
***************************************************************************
COMPUTING ELEMENT IDs LIST
The following CE(s) matching your job requirements have been found:
*CEId*
a01-004-128.gridka.de:2119/jobmanager-pbspro-cmsS
a01-004-128.gridka.de:2119/jobmanager-pbspro-cmsXS
ares02.cyf-kr.edu.pl:2119/jobmanager-pbs-cms
beagle14.ba.itb.cnr.it:2119/jobmanager-lcgpbs-cms
bogrid5.bo.infn.it:2119/jobmanager-lcgpbs-cms
ce-fzk.gridka.de:2119/jobmanager-pbspro-cmsL
ce-fzk.gridka.de:2119/jobmanager-pbspro-cmsS
ce-fzk.gridka.de:2119/jobmanager-pbspro-cmsXS
ce.bg.ktu.lt:2119/jobmanager-lcgpbs-cms
ce.cc.ncu.edu.tw:2119/jobmanager-lcgpbs-cms
[…]
gridce.ilc.cnr.it:2119/jobmanager-lcgpbs-cms
gridce2.pi.infn.it:2119/jobmanager-lcglsf-cms4
gridce.sns.it:2119/jobmanager-lcgpbs-cms
3/18/2016
40
GRID Tutorial - How to use LCG
Grid Monitoring
GridICE INFN
GOC Sinica
3/18/2016
41
GRID Tutorial - How to use LCG
Grid Monitoring
AOB
3/18/2016
42
云计算
Cloud Computing
4
Cloud Computing

Definition

Cloud computing is a concept of using the internet to allow
people to access technology-enabled services.
It allows users to consume services without knowledge of
control over the technology infrastructure that supports
them.
- Wikipedia
46
Enterprise IT spending challenge
Global Annual IT Spending
Estimated US$B 1996-2010
300
Power and Cooling Costs
Server Mgt and Admin Costs
250
New Server Spending
200
150
100
50
$0B
Source: IBM Corporate Strategy analysis of IDC data, Sept. 2007
Dream or Nightmare?
Seasonal Spikes
A Closer Look at Cloud Computing
End Users / Requestors
Government/
Academics
Industry
(Startups/ SMB/ Enterprise)
Consumers
INNOVATIVE BUSINESS MODELS
SIMPLIFIED SERVICES
Public Cloud
Enterprise Cloud
Source: Corporate Strategy
• New combinations of
services to form
differentiating value
propositions at lower
costs in shorter time
• Cloud applications
enable the simplification
of complex services
• A cloud computing
platform combines
modular components
on a service oriented
architecture with flexible
pricing
• An “Elastic” pool of high
performance virtualized
compute resources
• Internet protocol based
convergence of networks
and devices
Examples of Different Types of Services
Web Application Service
Compute Service
Collaboration Services
Datacenter
Infrastructure
Database service
Cloud
Computing
Job Scheduling
Service
Virtual Client
service
Service Catalog
Storage service
Content
Classification
Storage backup,
archive… service
51
Google and Cloud Computing
Google与云计算
User Centric
• Data stored in the “Cloud”
• Data follows you & your devices
• Data accessible anywhere
• Data can be shared with others
messages
preferences
news
contacts
calendar
investments
maps
photo
mailing lists
music
e-mails
phone numbers
Google的三大法宝
Google File System(GFS)
BigTable
MapReduce
Google File System(GFS)
Replicas
GFS Architecture
GFS Master
Client
Masters MSN
19% GFS Master
Google
48%
C2
C
Yahoo 1
33% C
C5
3
Chunkserver 1
Chunkserver 2
C0
C5
•
•
•
56
C1
…
Client
Client
Client
Client
C0
C5
C2
Chunkserver N
Files broken into chunks (typically 64 MB)
Master manages metadata
Data transfers happen directly between clients/chunkservers
Client
Client
Client
Client
GFS Usage @ Google
•
•
•
•
•
200+ clusters
Filesystem clusters of up to 5000+ machines
Pools of 10000+ clients
5+ Petabyte Filesystems
All in the presence of frequent HW failure
Google的三大法宝
Google File System(GFS)
BigTable
MapReduce
BigTable
• Data model
 (row, column, timestamp)  cell contents
BigTable
• Distributed multi-level sparse map
 Fault-tolerance, persistent
• Scalable
 Thousand of servers
 Terabytes of in-memory data
 Petabytes of disk-based data
• Self-managing
 Servers can be added/removed dynamically
 Servers adjust to load imbalance
Why not just use commercial DB?
• Scale is too large or cost is too high for most
commercial databases
• Low-level storage optimizations help performance
significantly
 Much harder to do when running on top of a database layer
 Also fun and challenging to build large-scale systems
BigTable Summary
• Data model applicable to broad range of clients
 Actively deployed in many of Google’s services
• System provides high-performance storage system on a
large scale
 Self-managing
 Thousands of servers
 Millions of ops/second
 Multiple GB/s reading/writing
• Largest bigtable cell manages – 3PB of data spread over
several thousand machines
Google的三大法宝
Google File System(GFS)
BigTable
MapReduce
MapReduce
• A simple programming model that applies to many
data-intensive computing problems
• Hide messy details in MapReduce runtime library
 Automatic parallelization
 Load balancing
 Network and disk transfer optimization
 Handle of machine failures
 Robustness
 Easy to use
MapReduce Programming Model
f
f
f
f
f
f
• Borrowed from functional
programming
map(f, [x1,…,xm,…]) = [f(x1),…,f(xm),…]
reduce(f, x1, [x2, x3,…])
= reduce(f, f(x1, x2), [x3,…])
=…
f
initial
f
f
f
f
returned
(continue until the list is exhausted)
• Users implement two functions
map (in_key, in_value)  (key, value) list
reduce (key, [value1,…,valuem])  f_value
MapReduce – A New Model and System
• Two phases of data processing
– Map: (in_key, in_value)  {(keyj, valuej) | j = 1…k}
– Reduce: (key, [value1,…valuem])  (key, f_value)
Input key*value
pairs
Input key*value
pairs
...
map
map
Data store 1
Data store n
(key 1,
values...)
(key 2,
values...)
(key 3,
values...)
(key 2,
values...)
(key 1,
values...)
(key 3,
values...)
== Barrier == : Aggregates intermediate values by output key
key 1,
intermediate
values
key 2,
intermediate
values
key 3,
intermediate
values
reduce
reduce
reduce
final key 1
values
final key 2
values
final key 3
values
MapReduce Version of Pseudo Code
Example – WordCount (1/2)
• Input is files with one document per record
• Specify a map function that takes a key/value pair
 key = document URL
 Value = document contents
• Output of map function is key/value pairs. In our case,
output (w,”1”) once per word in the document
Example – WordCount (2/2)
• MapReduce library gathers together all pairs with the
same key(shuffle/sort)
• The reduce function combines the values for a key. In
our case, compute the sum
• Output of reduce paired with key and saved
MapReduce Framework
• For certain classes of problems, the MapReduce
framework provides:
 Automatic & efficient parallelization/distribution
 I/O scheduling: Run mapper close to input data
 Fault-tolerance: restart failed mapper or reducer tasks on the
same or different nodes
 Robustness: tolerate even massive failures:
e.g. large-scale network maintenance: once lost 1800 out of 2000
machines
 Status/monitoring
Task Granularity And Pipelining
• Fine granularity tasks: many more map tasks than
machines
 Minimizes time for fault recovery
 Can pipeline shuffling with map execution
 Better dynamic load balancing
• Often use 200,000 map/500 reduce tasks with 2000
machines
MapReduce: Uses at Google
• Typical configuration: 200,000 mappers, 500
reducers on 2,000 nodes
• Broad applicability has been a pleasant surprise
 Quality experiences, log analysis, machine translation, ad-hoc
data processing
 Production indexing system: rewritten with MapReduce
• ~10 MapReductions, much simpler than old code
MapReduce Summary
• MapReduce is proven to be useful abstraction
• Greatly simplifies large-scale computation at
Google
• Fun to use: focus on problem, let library deal
with messy details
A Data Playground
• MapReduce + BigTable + GFS = Data playground
 Substantial fraction of internet available for processing
 Easy-to-use teraflops/petabytes, quick turn-around
 Cool problems, great colleagues
Amazon Web Services
Amazon Simple Storage Service
S3
Amazon Simple Storage Service
• Object-Based Storage
• 1 B – 5 GB / object
• Fast, Reliable, Scalable
• Redundant, Dispersed
• 99.99% Availability
Goal
• Private or Public
• Per-object URLs & ACLs
• BitTorrent Support
$.15 per GB
per month
storage
$.01 for 1000 to
10000 requests
$.10 - $.18 per
GB data transfer
Amazon S3 Concepts
Objects:
Opaque data to be stored (1 byte … 5 Gigabytes)
Authentication and access controls
Buckets:
Object container – any number of objects
100 buckets per account / buckets are “owned”
Keys:
Unique object identifier within bucket
Up to 1024 bytes long
Flat object storage model
Standards-Based Interfaces:
REST and SOAP
URL-Addressability – every object has a URL
S3 SOAP/Query API
Service:
ListAllMyBuckets
Buckets:
CreateBucket
DeleteBucket
ListBucket
GetBucketAccessControlPolicy
SetBucketAccessControlPolicy
GetBucketLoggingStatus
SetBucketLoggingStatus
Objects:
PutObject
PutObjectInline
GetObject
GetObjectExtended
DeleteObject
GetObjectAccessControlPolicy
SetObjectAccessControlPolicy
Amazon Simple Queue Service
SQS
Amazon Simple Queue Service
• Scalable Queuing
• Elastic Capacity
• Reliable, Simple, Secure
$.10 per 1000
messages
Inter-process messaging, data
buffering, architecture
component
$.10 - $.18 per
GB data transfer
Amazon SQS Concepts
Queues:
Named message container
Persistent
Messages:
Up to 256KB of data per message
Peek / Lock access model
Scalable:
Unlimited number of queues per account
Unlimited number of messages per queue
SQS SOAP/Query API
Queues:
ListQueues
DeleteQueue
SetVisibilityTimeout
GetVisibilityTimeout
Messages:
SendMessage
ReceiveMessage
DeleteMessage
PeekMessage
Security:
AddGrant
ListGrants
RemoveGrant
Amazon Elastic Compute Cloud
EC2
Amazon Elastic Compute Cloud
• Virtual Compute Cloud
• Elastic Capacity
• 1.7 GHz x86
• 1.7 GB RAM
• 160 GB Disk
• 250 MB/Second Network
• Network Security Model
Time or Traffic-based Scaling, Load
testing, Simulation and Analysis,
Rendering, Software as a Service
Platform, Hosting
$.10 per
server hour
$.10 - $.18 per
GB data transfer
Amazon EC2 Concepts
Amazon Machine Image (AMI):
Bootable root disk
Pre-defined or user-built
Catalog of user-built AMIs
OS: Fedora, Centos, Gentoo, Debian,
Ubuntu, Windows Server
App Stack: LAMP, mpiBLAST, Hadoop
Instance:
Running copy of an AMI
Launch in less than 2 minutes
Start/stop programmatically
Network Security Model:
Explicit access control
Security groups
Inter-service bandwidth is free
Amazon EC2 At Work
Startups
Cruxy – Media transcoding
GigaVox Media – Podcast Management
Fortune 500 clients:
High-Impact, Short-Term Projects
Development Host
Science / Research:
Hadoop / MapReduce
mpiBLAST
Load-Management and Load Balancing Tools:
Pound
Weogeo
Rightscale
EC2 SOAP/Query API
Images:
RegisterImage
DescribeImages
DeregisterImage
Instances:
RunInstances
DescribeInstances
TerminateInstances
GetConsoleOutput
RebootInstances
Keypairs:
CreateKeyPair
DescribeKeyPairs
DeleteKeyPair
Image Attributes:
ModifyImageAttribute
DescribeImageAttribute
ResetImageAttribute
Security Groups:
CreateSecurityGroup
DescribeSecurityGroups
DeleteSecurityGroup
AuthorizeSecurityGroupIngress
RevokeSecurityGroupIngress
Web-Scale Architecture
GigaVox Economics
Implemented Amazon S3, Amazon EC2 and
Amazon SQS in November 2006
Created an infinitely scalable
infrastructure for less than $100 - building
the same infrastructure themselves would
have cost thousands of dollars
Reduced staffing requirements - far less
responsibility for 24x7 operations
分析展望
网络的迅猛发展
1986 年到2000年 计算机: × 500
网络: × 340,000
网络发展的必然结果…
网格计算与云计算的比较
•
•
•
•
•
•
•
•
•
异构资源
不同机构
虚拟组织
科学计算为主
高性能计算机
紧耦合问题
免费
标准化
科学界
•
•
•
•
•
•
•
•
•
同构资源
单一机构
虚拟机
数据处理为主
服务器/PC
松耦合问题
按量计费
尚无标准
商业社会
106
云计算是广义网格的一种
“网格是构筑在互联网上的一组新兴技术,它
将高速互联网、高性能计算机、大型数据库、
传感器、远程设备等融为一体,为科技人员
和普通老百姓提供更多的资源、功能和交互
性服务。
Ian Foster, The Grid, 1998
107
未来10年的科学
Science 2.0
网格计算
未来10年的商业
Business 2.0
云计算
网格书籍
http://www.chinagrid.net
http://www.china-cloud.net
Download