ClimateDataAnalysis

advertisement
Early Access to NCI Climate Data
& Analysis Systems
Ben Evans
Ben.Evans@anu.edu.au
NCI: Vision and Role
• Vision
– Provide Australian researchers with a world-class, high-end computing
service
• Aim
– To make a real difference to the conduct, ambition and outcomes of leading
Australian research
• Role
– Sustain / develop Australian capability computing targeted to Earth Systems
Science
– Provide comprehensive computational support for data-intensive research
– Provide specialised support for key research communities
– Develop and maintain a strategic plan to support the advancement of
national high-end computing services
– Provide access through partner shares, national merit allocation scheme
• Partners
– ANU, CSIRO, Geoscience Australia, Bureau of Meteorology (2012),
Providing capability in modelling & data-intensive science
• Peak Infrastructure
– Vayu—Sun Constellation
– commissioned 2010
– 140 Tflops, 240K SPEC,
36 TBytes/800 TBytes
– Well balanced; Good performance
– Batch oriented, parallel processing
• Data Storage Cloud
– Large/fast/reliable data storage
– Persistent disk-resident or tape
– Relational databases
– National datasets and support for
Virtual Observatories
– New Storage Infrastructure (Nov, 2011)
– Dual site storage
– Dual site data services
• Data Intensive Computing
– Data analysis
– Data Compute Cloud (2011)
Options for data services
•
•
•
•
•
Data Cloud spans two physical sites
Internal network completed June 2011
Redundant 10 GigE network links to AARNet completed Aug 2011
Data migration from old tape silo to disk mid-Sept 2011.
Floor replacement and additional pod cooling completed end Sept
2011.
• Storage architecture physical layer acceptance due end Oct 2011
Options for data services
•
•
•
•
HSM (tape) – 1,2 copies at 2 locations
Scratch disk – one location, 2 speed options
Persistent disk – two locations, 2 speed options
Persistent data services – two locations,
movable/sychronised VMs
• Self-managed backup, synchronised data, or HSM
• Filesystems layered on top of hardware options.
• Specialised: Databases, Storage Objects, HDFS, …
• Domain speciality: ESG, THREDDS/OpenDAP,
Workflows, Data Archives (OAIS)
2001.q2
2001.q3
2001.q4
2002.q1
2002.q2
2002.q3
2002.q4
2003.q1
2003.q2
2003.q3
2003.q4
2004.q1
2004.q2
2004.q3
2004.q4
2005.q1
2005.q2
2005.q3
2005.q4
2006.q1
2006.q2
2006.q3
2006.q4
2007.q1
2007.q2
2007.q3
2007.q4
2008.q1
2008.q2
2008.q3
2008.q4
2009.q1
2009.q2
2009.q3
2009.q4
2010.q1
2010.q2
2010.q3
2010.q4
2011.q1
2011.q2
Tbytes
Total Data Cloud Storage
Used
Granted
Requested
2500
2000
1500
1000
500
0
Used
NCI: Cloud Computing for Data Intensive Science
Special Features for the Data-Intensive Cloud
• Integration into NCI Environment
•
•
•
•
•
•
Fast access to: Storage, Filesystems and Significant Datasets
Compute (and some GPU)
Cloud Environments – deployed on demand
NF software – large repository
Integrated data analysis tools and visualisation
Networks
– general (AARNet, CSIRO, GA, …),
– direct (HPC Computing, Telescopes, Gene Sequencers, CT
scanners, …, International Federations)
• More features ....
Astronomy Virtual Observatory
Astronomy Virtual Observatory
• An era of cross-fertilisation - match surveys
that span the electromagnetic spectrum
• ANU’s SkyMapper telescope - providing the
world’s first digital map of the southern sky.
• the measurements of brightness and
position will form the basis of countless
science programs and calibrate other future
surveys
Less time at the telescope - more time in
front of the computer!
X-ray
Optical
Radio
Earth Observations Workflow
Eg. National Environmental Satellite Data
Backbone at NCI – CSIRO, GA
Earth Observing (EO) sensors carried on space-borne
platforms produce large (multiple TB/year) data sets
serving multiple research and application communities
The NCI established a single National archive of raw
(unprocessed) MODIS data for the Australian region,
and to support processing software (common to all
users) and specialised tools for applications. LANDSAT
is now being processed.
The high quality historical archive is complemented by
exploiting the NCI network connectivity to download
and merge data acquired directly from the spacecraft
by local reception stations all round Australia in real
time.
Data products and tools available through web
technologies and embedded workflows.
Collaborators: King, Evans, Lewis, Wu, Lymburner
An Earth Systems Virtual Laboratory
• ESG – internationally significant climate model data
• Data analysing capability
• …
Earth Systems Grid and Access to Analysis
Mission: The DOE SciDAC-2 Earth System Grid Center for Enabling
Technologies (ESG-CET) project is to provide climate researchers
worldwide with access to: data, information, models, analysis
tools, and computational resources required to make sense of
enormous climate simulation datasets.
ESGF – Federation of worldwide sites providing data. Core nodes
are: PCMDI (LLNL), BADC (UK), DKRZ (MPI), NCAR/JPL. NCI joined
to provide an Australian Node.
NCI : Support the ESG as the Australian node and subsequent
processing
– Support publishing of Australian data as a primary node
– Store/Replicate priority overseas data
– Provide associated compute/storage services through NCI shares.
Methods to get climate data
• Providing two main methods to access climate data:
1. ESG portal web site
2. Filesystem on the NCI data cloud
• Providing computational systems for analysing the data:
1. dcc – data compute cloud
2. …
Current Status
Status of Publishing - CSIRO-QCCCE mk3.6
• Models run at QCCCE facility and data transferred to NCI
• Processed data and through L1 checks (check)
• Eg: 7 Tbytes of final output required 90+ Tbytes of temporary space on
Vayu and 200 Tbytes on the data cloud
• Moved data between vayu and dc – time consuming
• Currently completing rest of data processing.
• Expect final size to be ~30 Tbytes
Status of Publishing – CAWCR ACCESS
• Preindustrial runs commencing by early October
• 500 years at 250 year intervals
• First publish by end of Dec (20Tbytes). Second release around Feb.
Status of ESG software
• Stable release and serving data
• Federated Identity/authorisation systems are evolving
• Continual updates (data replication not yet working within ESG stack)
CMIP5 Modelling groups and data being generated
CAWCR - Centre for Australian Weather and Climate
Research
CCCMA - Canadian Centre for Climate Modelling and
Analysis
CCSM - Community Climate System Model
CMA-BCC - Beijing Climate Center, China Meteorological
Administration
CMCC - Centro Euro-Mediterraneo per I Cambiamenti
Climatici
CNRM-CERFACS - Centre National de Recherches
Meteorologiques - Centre Europeen de Recherche et
Formation Avancees en Calcul Scientifique.
EC-Earth - Europe
FIO - The First Institute of Oceanography, SOA, China
GCESS - College of Global Change and Earth System
Science, Beijing Normal University
GFDL - Geophysical Fluid Dynamics Laboratory
INM - Russian Institute for Numerical Mathematics
IPSL - Institut Pierre Simon Laplace
LASG - Institute of Atmospheric Physics, Chinese
Academy of Sciences China
MIROC - University of Tokyo, National Institute for
Environmental Studies, and Japan Agency for MarineEarth Science and Technology
MOHC - UK Met Office Hadley Centre
MPI-M - Max Planck Institute for Meteorology
MRI - Japanese Meteorological Institute
NASA GISS- NASA Goddard Institute for Space Studies
USA
NCAR - US National Centre for Atmospheric Research
NCAS - -UK National Centre for Atmospheric Science
NCC - Norwegian Climate Centre
NIMR - Korean National Institute for Meteorological
Research
QCCCE-CSIRO - Queensland Climate Change Centre of
Excellence and Commonwealth Scientific and Industrial
Research Organisation
RSMAS - University of Miami - RSMAS
24 modelling groups, 25 platforms being described, 44
models, 65 grids, and 223 simulations
CMIP5 Modelling groups and data being generated
Simulations:
~90,000 years
~60 experiments within CMIP5
~20 modelling centres (from around
the world) using
~several model configurations each
~2 million output “atomic” datasets
~10's of petabytes of output
~2 petabytes of CMIP5 requested
output
~1 petabyte of CMIP5 “replicated”
output
Which will be replicated at a number
of sites (including ours), arriving
now!
Of the replicants:
~ 220 TB decadal
~ 540 TB long term
~ 220 TB atmos-only
~80 TB of 3hourly data
~215 TB of ocean 3d monthly data!
~250 TB for the cloud feedbacks!
~10 TB of land-biochemistry (from the
long term experiments alone).
Slide sourced from Metafor web site
early 2011.
CMIP5 Submission
CMIP5 Archive Status – automatically generated
Summary
Modeling centers
13
Models
17
Data nodes
13
Gateways
5
Datasets
15,950
Size
177.4 TB
Files
40,7792
Last Update: Monday, 19 September 2011 03:21AM (UTC)
Datasets by Access Protocol
Protocol
Number of
Centers
Number of
Models
Number of
Datasets
Size (TB)
HTTP
13
19
15,958
177.4
GridFTP
6
6
1819
41.6
OPeNDAP
3
3
408
25
Datasets on our data cloud
Modelling
Centre
Model
Capacity (TB)
CSIRO-QCCCE
CSIRO-Mk3.6
~15
IPSL
IPSL-CM5A-LR
9.4
INM
inmcm4
6.3
CMIP3
All
35.4
All
High Priority
Variables Lawson
7.14
Status of Data Replication
Status of Data Replication
Data Replication
Data Replication supported by two methods
1. Bulk-fast transfers by ESG nodes
2. User-initiated data transfers at variable level
• First method is fast but requires coordination at the
sites and international networks
• Second method can be very slow but relatively
“simple”
• We will provide more details during our session today.
Data Processing Capability
• We have provided a first iteration of data processing
capability – Early Access mode
• dcc provides data processing directly on the data cloud
• Upgrades in place as more hardware becomes available
– Filesystem upgrade – mid Nov
– Data processing – mid Nov (pending order)
– Some software license issues being resolved
• Data processing pipelines being
understood/established. Eg CVC data pipeline
Data Replication
Issues:
• Slow links to some sites can get swamped
• ESG software not ready to manage official replicas
• Don’t have a clear view of when model data will be
available
• Data may need to be revoked as errors are found.
• Data capacity being closely managed/prioritised.
• CAWCR, CoE, CSIRO, BoM and shareholders monitoring
for future expansion
Keeping in touch
• ESG status messages / ESG Federation web site
• help@nf.nci.org.au
• Twitter feeds:
– @NCIEarthSystems
– @NCIdatacloud
– @NCIpeaksystems
• Open weekly Townhall Q/A meeting – details to be
established.
• Regular meetings between CAWCR, CoE and NCI on
status.
• Keep your Team leaders advised on your
requirements/issues.
• More developments planned through a VL proposal.
THE END
NCI Data Cloud
Download