Big Data Processing on the GRID: Future Research

advertisement
Big Data Processing on the Grid:
Future Research Directions
A. Vaniachine
XXIV International Symposium on Nuclear Electronics & Computing
Varna, Bulgaria, 9-16 September 2013
Big Data Processing on the Grid
A. Vaniachine
2
A Lot Can be Accomplished in 50 Years: Nuclear
Energy Took 50 Years from Discovery to Use


1896: Becquerel discovered radioactivity
1951: Reactor at Argonne generated electricity for light bulbs
Big Data Processing on the Grid
A. Vaniachine
3
A Lot Has Happened in 14 Billion Years

Everything is a remnant of the Big
Bang, including the energy we use:
– Chemical energy: scale is eV
• Stored millions of years ago
– Nuclear energy: scale is MeV or
million times higher than chemical
• Stored billions of years ago
– Electroweek energy: scale is 100 GeV
or 100,000 times higher than nuclear
• Stored right after the Big Bang
– Can this energy be harnessed
in some useful way?
Electroweek
phase
transition
Big Data Processing on the Grid
A. Vaniachine
4
2012: Higgs Boson Discovery
Big Data Processing on the Grid
Meta-stability:
a prerequisite
for energy use
JHEP 08 (2012) 098
A. Vaniachine
5
Higgs Boson Study Makes LHC a Top Priority


European Strategy
US Snowmass Study
1. How do we understand the Higgs boson? What principle determines its
couplings to quarks and leptons? Why does it condense and acquire a
vacuum value throughout the universe? Is there one Higgs particle or
many? Is the Higgs particle elementary or composite?
2. What principle determines the masses and mixings of quarks and
leptons? Why is the mixing pattern apparently different for quarks and
leptons? Why is the CKM CP phase nonzero? Is there CP violation in the
lepton sector?
3. 1.
Why
are neutrinos
so light
compared
to other
particles?
Arescales
Probe
the highest
possible
energies
and matter
smallest
distance
neutrinos their own antiparticles? Are their small masses connected to
with the existing and upgraded Large Hadron Collider and reach
the presence of a very high mass scale? Are there new interactions
for even higher precision with a lepton collider; study the
invisible except through their role in neutrino physics?
properties of the Higgs boson in full detail
2. Develop technologies for the long-term future to build multi-TeV
http://cds.cern.ch/record/1551933
4. What
mechanism
produced
excess
of matter
over anti-matter that we
lepton
colliders
and 100the
TeV
hadron
colliders
see
in
the
universe?
Why
are
the
interactions
of
particles
and
http://science.energy.gov/~/media/hep/hepap/pdf/201309/Hadley_HEPAP_Intro_Sept_2013.pdf
3. Execute a program with the U.S. as host that provides precision
antiparticles not exactly mirror opposites?
Big Data Processing on the Grid
HEPAP –
tests of the neutrino sector with an underground detector; search
for new physics in quark and lepton decays in conjunction with
6
September 5, 2013
1
A. Vaniachine
precision measurements of electric dipole
and anomalous
The LHC Roadmap
Big Data Processing on the Grid
A. Vaniachine
7
Big Data
LHC RAW
data per year

In 2010 the LHC experiments
produced 13 PB of data
– That rate outstripped any other
scientific effort going on
Big Data Processing on the Grid
http://www.wired.com/magazine/2013/04/bigdata
A. Vaniachine
8
Big Data

In 2010 the LHC experiments
produced 13 PB of data
WLCG data
on the Grid
LHC RAW
data per year
– That rate outstripped any other
scientific effort going on

LHC RAW data volumes are inflated by
storing derived data products, replication
for safety and efficient access, and by the
need for storing even more simulated data
than the RAW data
Big Data Processing on the Grid
http://www.wired.com/magazine/2013/04/bigdata
A. Vaniachine
9
Big Data

In 2010 the LHC experiments
produced 13 PB of data
WLCG data
on the Grid
LHC RAW
data per year
– That rate outstripped any other
scientific effort going on

LHC RAW data volumes are inflated by
storing derived data products, replication
for safety and efficient access, and by the
need for storing even more simulated data
than the RAW data
Scheduled LHC
upgrades will
increase RAW data
taking rates tenfold
Big Data Processing on the Grid
http://www.wired.com/magazine/2013/04/bigdata
A. Vaniachine
10
Big Data
Brute force approach to scale
up Big Data processing on
the Grid for LHC upgrade
needs is not an option
Big Data Processing on the Grid
http://www.wired.com/magazine/2013/04/bigdata
A. Vaniachine
11
Physics Facing Limits

The demands on computing resources to accommodate the Run2 physics needs
increase
– HEP now risks to compromise physics because of lack of computing resources
• Has not been true for ~20 years
From I. Bird presentation at the “HPC and super-computing workshop for Future Science Applications” (BNL, June 2013)

The limits are those of tolerable cost for storage and analysis. Tolerable cost is
established in an explicit or implicit optimization of physics dollars for the entire
program. The optimum rate of data to persistent storage depends on the
capabilities of technology, the size and budget of the total project, and the physics
lost by discarding data. There is no simple answer!
From US Snowmass Study:
https://indico.fnal.gov/getFile.py/access?contribId=342tisessionId=100tiresId=0timaterialId=1ticonfId=6890

Physics needs drives future research directions in Big Data processing on the Grid
Big Data Processing on the Grid
A. Vaniachine
12
HEP Data Challenges
Big Data Processing on the Grid
A. Vaniachine
! "#$#%& ' ' ( ) *+%$,-.) */0) +%1' 2/3%4%5%6+%789: %%%55%& 3%; ' < =,/' =>%
13
1-.?' %%96%
US Big Data Research and Development Initiative

At the time of the “Big Data Research and Development Initiative” announcement,
a $200 million investment in tools to handle huge volumes of digital data needed
to spur U.S. science and engineering discoveries, two examples of successful HEP
technologies were already in place:
– Collaborative big data management ventures include PanDA (Production and
Distributed Analysis) Workload Management System and XRootD , a high performance,
fault tolerant software for fast, scalable access to data repositories of many kinds.

Supported by the DOE Office of Advanced Scientific Computing Research, PanDA is
now being generalized and packaged, as a Workload Management System already
proven at extreme scales, for the wider use of the Big Data community
– Progress in this project was reported by A. Klimentov earlier in this session
Big Data Processing on the Grid
A. Vaniachine
14
Synergistic Challenges

As HEP is facing the Big Data
processing challenges ahead of
other sciences, it is instructive to
look for commonalities in the
discovery process across the
sciences
– In 2013 the Subcommittee of
the US DOE Advanced Scientific
Computing Advisory Committee
prepared the Summary Report
on Synergistic Challenges in
Data-Intensive Science and
Exascale Computing
Big Data Processing on the Grid
A. Vaniachine
15
Knowledge-Discovery Life-Cycle for Big Data: 1
Instruments, sensors
Data may be
generated by
instruments,
experiments,
sensors, or
supercomputers
supercomputers
Transactional:
Data
Generation
Data
Management
Historical :
Data
Processing/
Organization
Act, Refine,
Feedback
Data
Reduction,
Query
Data
Visualization
Data Sharing
Relational:
Mining,
Discovery,
Predictive
Modeling
(A)
Big Data Processing on the Grid
Historical
Learning
Trigger/
Predict
A. Vaniachine
16
Knowledge-Discovery Life-Cycle for Big Data: 2
Instruments, sensors
(Re)organizing,
processing, deriving
subsets, reduction,
visualization, query
analytics, distributing,
and other aspects
supercomputers
Transactional:
Data
Generation
Data
Management
Historical :
Data
Processing/
Organization
Act, Refine,
Feedback
Data
Reduction,
Query
Data
Visualization
Data Sharing
Relational:
Mining,
Discovery,
Predictive
Modeling
Big Data Processing on the Grid
Historical
Learning
Trigger/
In LHC experiments, this includes
common operations on and derivations
from raw data. The output of data
processing is used by thousands of
scientists for knowledge discovery.
(A)
Predict
A. Vaniachine
17
Knowledge-Discovery Life-Cycle for Big Data: 3
Instruments, sensors
Given the size and complexity
of data and the need for both
top-down and bottom up
discovery, scalable algorithms
and software need to be
deployed in this phase
Transactional:
Data
Generation
Data
Management
Historical :
Data
Processing/
Organization
Act, Refine,
Feedback
Relational:
Mining,
Discovery,
Predictive
Modeling
Big Data Processing on the Grid
Historical
Learning
supercomputers
Trigger/
Data
Reduction,
Query
Data
Visualization
Although the discovery process can
Sharing
be quite specific Data
to the
scientific
problem under consideration,
repeated evaluations, what-if
scenarios, predictive modeling,
correlations, causality and other
mining operations at scale are
(A)
common at this phase
Predict
A. Vaniachine
18
Knowledge-Discovery Life-Cycle for Big Data: 4
Instruments, sensors
Insights and discoveries
from previous phases help
close the loop to determine
new simulations, models,
parameters, settings,
observations, thereby,
making the closed loop
supercomputers
Transactional:
Data
Generation
Data
Management
Historical :
Data
Processing/
Organization
Act, Refine,
Feedback
Data
Reduction,
Query
Data
Visualization
While this represents a common high-level approach to data-driven knowledge
discovery, there can be important differences among different sciences Data
as toSharing
how
data is produced, consumed, stored, processed, and analyzed
Relational:
Mining,
Discovery,
Predictive
Modeling
(A)
Big Data Processing on the Grid
Historical
Learning
Trigger/
Predict
A. Vaniachine
19
Data-Intensive Science Workflow

The Summary Report identified an urgent need to simplify the workflow for DataIntensive Science
– Analysis and visualization of increasingly larger-scale data sets will require integration of
the best computational algorithms with the best interactive techniques and interfaces
– The workflow for data-intensive science is complicated by the need to simultaneously
manage large volumes of data as well as large amounts of computation to analyze the
data, and this complexity is increasing at an inexorable rate

These complications can greatly reduce the productivity of the domain scientist, if
the workflow is not simplified and made more flexible
– For example, the workflow should be able to transparently support decisions such as
when to move data to computation or computation to data
Big Data Processing on the Grid
A. Vaniachine
20
Lessons Learned

The distributed computing environment for the LHC has proved to be a formidable
resource, giving scientists access to huge resources that are pooled worldwide and
largely automatically managed
– However, the scale of operational effort required is burdensome for the HEP
community, and will be hard to replicate in other science communities
• Could the current HEP distributed environments be used as a distributed systems laboratory to
understand how more robust, self-healing, self-diagnosing systems could be created?

Indeed, Big Data processing on the Grid must tolerate a continuous stream of
failures, errors, and faults
– Transient job failures on the Grid can be recovered by managed re-tries
• However, workflow checkpointing at the level of a file or a job delays turnaround times

Advancements in reliability engineering provide a framework for fundamental
understanding of the Big Data processing turnaround time
– Designing fault tolerance strategies that minimize the duration of Big Data processing
on the Grid is an active area of research
Big Data Processing on the Grid
A. Vaniachine
21
• Dynamic data and/or informa on and/or "resource" collec on,
discovery, alloca on and management mechanisms
Future Research Direction: Workflow Management

– Resource descrip on and understanding
– Resource = any en ty that is part of the system (papers, files, data,
documents, people, compu ng, storage)
– Federated seman c discovery
To significantly shorten the time needed to transform scientific data into
• actionable
Rapid informa
on the
andUSknowledge
based
response
andResearch
decisionoffice
knowledge,
DOE Advance
Scientific
Computing
mechanisms
ismaking
preparing
a call that will include
– Steering scien fic processes
• Composi on and execu on of end to end scien fic processes across
heterogeneous environments
–
–
–
–
Covering dynamic and sta c
Community based Job management and workflows
Domain specific abstrac ons
Flexible, resilient, and rapidly reconfigurable run me environments
From R. Carlson presentation at the “HPC and super-computing workshop for Future Science Applications” (BNL, June 2013)
https://indico.bnl.gov/materialDisplay.py?contribId=16&sessionId=8&materialId=slides&confId=612
Big Data Processing on the Grid
A. Vaniachine
22
Maximizing Physics Output through Modeling

In preparations for LHC data taking future networking perceived as a limit
– Monarc model serves as an example how to circumvent the resource limitation
• WLCG implemented hierarchical data flow maximizing reliable data transfers

Today networking is not a limit and
WLCG abandoned the hierarchy
– No fundamental technical barriers to
transport 10x more traffic within 4 years

In contrast, future CPU and storage are
perceived as a limit
– HEP now risks to compromise physics
because of lack of computing resources
• As in the days of Monarc, HEP needs
comprehensive modeling capabilities
that would enable maximizing physics
output within the resource constraints
Big Data Processing on the Grid
Picture by I. Bird
A. Vaniachine
23
Future Research Direction: Workflow Modeling
Modeling and Simula on Program Elements
• Applica on workflows should have predictable performance
behaviors
– Modeling the compu ng, storage, and networking resources
– Modeling the protocols, services, and applica ons
– Simula ng the execu on environment with enough fidelity to
make informed predic ons
• ASCR is developing a new joint CS/Network modeling and
simula on program to address this important area
– Workshop scheduled for Sept 18-19, 2013
– h p://hpc.pnl.gov/modsim/2013/
– Posi on papers due June 17, 2013
From R. Carlson presentation at the “HPC and super-computing workshop for Future Science Applications” (BNL, June 2013)
https://indico.bnl.gov/materialDisplay.py?contribId=16&sessionId=8&materialId=slides&confId=612
Big Data Processing on the Grid
A. Vaniachine
24
Conclusions

Study of Higgs boson properties is a top priority for LHC physics
– LHC upgrades increase demands for computing resources beyond flat budgets
• HEP now risks to compromise physics because of lack of computing resources

A comprehensive end-to-end solution for the composition and execution of Big
Data processing workflow within given CPU and storage constraints is necessary
– Future research in workflow management and modeling are necessary to provide the
tools for maximizing scientific output within given resource constraints

By bringing Nuclear Electronics and Computing experts together, the NEC
Symposium continues to be in unique position to promote HEP progress as the
solution requires optimization cross-cutting Trigger and Computing domains
Big Data Processing on the Grid
A. Vaniachine
25
Extra Slides
For LHC Increases per year
! "#$ %#&' %$ () * +, %
&- - - - - - "
! "#"$%$&' ( ) "*"( %+' , "
' &- - - - - "
' ------"
- ./(0%
$&- - - - - "
LHC Computing adds
about 25k processor
cores a year
- ./(1%
$- - - - - - "
#234 %
, &- - - - - "
%%
, ------"
0556710%8.9/: (%
( &- - - - - "
(------"
&- - - - - "
And 34PB of disk
-"
, - - . " , - - / " , - ( - " , - ( ( " , - ( , " , - ( $" , - ( ' " , - ( &" , - ( %" , - ( +" , - ( . " , - ( / " , - , - "
! "#$ %&'() %$ *+, -. %
+**"
Ian Fisk CD/FNAL
%+*"
The cost and complexity
of the storage is much
larger than the processing
! "#"$%&' ( ") "*&+"
%**"
/'0*1%
$+*"
/'0*2%
$**"
#345 %
' +*"
%
%
' **"
%
1667821%
9': 0; *%
, +*"
, **"
+*"
Big Data Processing on the Grid
*"
' **- " ' **. " ' *, *" ' *, , " ' *, ' " ' *, $" ' *, %" ' *, +" ' *, / " ' *, 0" ' *, - " ' *, . " ' *'27
*"
A. Vaniachine
Download