Data Processing and Analysis Center with Burst

advertisement
Data Processing and Analysis Center with Burst-Capability on Flagship/Leadership
Computing Facility
Participants: Kenneth Read (ORNL), Galen Shipman (ORNL), Adam Simpson (OLCF),
Alexei Klimentov (BNL), Sergey Panitkin (BNL), Eli Dart (ESnet), Shane Canon
(NERSC)
Facilities: OLCF (ASCR), BNL (ASCR), NERSC (ASCR), ESnet (ASCR)
The Worldwide LHC Computing Grid (which includes the Open Science Grid) facilitates the
processing of high priority science in fields such as high energy physics and high energy
nuclear physics by connecting computing centers of varying size across the globe. This science
includes understanding various decay modes of the recently discovered Higgs candidate and
understanding properties of the Quark Gluon Plasma. Virtually all of this workflow is eventbased with no capability-class scale requirements other than necessary time-to-completion.
Heretofore, such event-based projects in High Energy and Nuclear Physics have only exploited
the largest Flagship/Leadership Computing Facilities in experimental scavenging mode, as
opposed to major awarded dedicated time. Meanwhile, the dramatically increasing data rates
and escalating computational needs of these fields are projected to exceed the available
resources unless Flagship/Leadership supercomputing resources are incorporated as part of
the solution moving forward. The ASCR-funded BigPanDA workflow management software
coordinates much of the job and data availability requirements in this field. The ALICE
Online/Offline Computing Project includes the future need for advanced HPC resources. The
new ORNL CADES (Compute and Data Environment for Science) HPC provides the unique
flexibility of Tier1-level resources with in-house, high-bandwidth access to the Titan Leadership
supercomputer and 30 PB Atlas file system.
We propose a real-time demonstration that marries the flexible data handling and processing
capabilities at ORNL CADES, providing resources comparable to a typical Tier1 LHC/FAIR
Computing Center, with coordinated opportunistic burst-capability running at-scale on Titan at
the OLCF. To accomplish the goal of demonstrating the promise of federated real-time
distributed computing operations with Flagship/Leadership burst-capability, we propose the
following multi-laboratory, multi-project demo. The components of the demo will include:
1. Remote federated job coordination from a distant central server located at BNL or CERN
(Geneva).
2. Local job coordination and data availability workflow managed on ORNL CADES by a
PanDA workflow management system, temporarily replicating the full flexibility, predefined and controlled remote connectivity, distributed data management, and promised
quality-of-service of a Tier1 center (with 97% uptime).
3. Real-time burst-mode job submission, based on Titan availability, to OLCF Titan, relying
on high-bandwidth, in-house data transfer and visibility.
4. High statistics, multi-threaded simulated data production using Geant4 and Monte Carlo
simulation using alpgen, PYTHIA, and sherpa generators.
5. Dynamically establish a GridFTP connection between NERSC, OLCF, and BNL/CERN
to transfer generated/processed data from Titan to NERSC.
Download