Big Data Processing on the Grid: Future Research Directions A. Vaniachine XXIV International Symposium on Nuclear Electronics & Computing Varna, Bulgaria, 9-16 September 2013 Big Data Processing on the Grid A. Vaniachine 2 A Lot Can be Accomplished in 50 Years: Nuclear Energy Took 50 Years from Discovery to Use 1896: Becquerel discovered radioactivity 1951: Reactor at Argonne generated electricity for light bulbs Big Data Processing on the Grid A. Vaniachine 3 A Lot Has Happened in 14 Billion Years Everything is a remnant of the Big Bang, including the energy we use: – Chemical energy: scale is eV • Stored millions of years ago – Nuclear energy: scale is MeV or million times higher than chemical • Stored billions of years ago – Electroweek energy: scale is 100 GeV or 100,000 times higher than nuclear • Stored right after the Big Bang – Can this energy be harnessed in some useful way? Electroweek phase transition Big Data Processing on the Grid A. Vaniachine 4 2012: Higgs Boson Discovery Big Data Processing on the Grid Meta-stability: a prerequisite for energy use JHEP 08 (2012) 098 A. Vaniachine 5 Higgs Boson Study Makes LHC a Top Priority European Strategy US Snowmass Study 1. How do we understand the Higgs boson? What principle determines its couplings to quarks and leptons? Why does it condense and acquire a vacuum value throughout the universe? Is there one Higgs particle or many? Is the Higgs particle elementary or composite? 2. What principle determines the masses and mixings of quarks and leptons? Why is the mixing pattern apparently different for quarks and leptons? Why is the CKM CP phase nonzero? Is there CP violation in the lepton sector? 3. 1. Why are neutrinos so light compared to other particles? Arescales Probe the highest possible energies and matter smallest distance neutrinos their own antiparticles? Are their small masses connected to with the existing and upgraded Large Hadron Collider and reach the presence of a very high mass scale? Are there new interactions for even higher precision with a lepton collider; study the invisible except through their role in neutrino physics? properties of the Higgs boson in full detail 2. Develop technologies for the long-term future to build multi-TeV http://cds.cern.ch/record/1551933 4. What mechanism produced excess of matter over anti-matter that we lepton colliders and 100the TeV hadron colliders see in the universe? Why are the interactions of particles and http://science.energy.gov/~/media/hep/hepap/pdf/201309/Hadley_HEPAP_Intro_Sept_2013.pdf 3. Execute a program with the U.S. as host that provides precision antiparticles not exactly mirror opposites? Big Data Processing on the Grid HEPAP – tests of the neutrino sector with an underground detector; search for new physics in quark and lepton decays in conjunction with 6 September 5, 2013 1 A. Vaniachine precision measurements of electric dipole and anomalous The LHC Roadmap Big Data Processing on the Grid A. Vaniachine 7 Big Data LHC RAW data per year In 2010 the LHC experiments produced 13 PB of data – That rate outstripped any other scientific effort going on Big Data Processing on the Grid http://www.wired.com/magazine/2013/04/bigdata A. Vaniachine 8 Big Data In 2010 the LHC experiments produced 13 PB of data WLCG data on the Grid LHC RAW data per year – That rate outstripped any other scientific effort going on LHC RAW data volumes are inflated by storing derived data products, replication for safety and efficient access, and by the need for storing even more simulated data than the RAW data Big Data Processing on the Grid http://www.wired.com/magazine/2013/04/bigdata A. Vaniachine 9 Big Data In 2010 the LHC experiments produced 13 PB of data WLCG data on the Grid LHC RAW data per year – That rate outstripped any other scientific effort going on LHC RAW data volumes are inflated by storing derived data products, replication for safety and efficient access, and by the need for storing even more simulated data than the RAW data Scheduled LHC upgrades will increase RAW data taking rates tenfold Big Data Processing on the Grid http://www.wired.com/magazine/2013/04/bigdata A. Vaniachine 10 Big Data Brute force approach to scale up Big Data processing on the Grid for LHC upgrade needs is not an option Big Data Processing on the Grid http://www.wired.com/magazine/2013/04/bigdata A. Vaniachine 11 Physics Facing Limits The demands on computing resources to accommodate the Run2 physics needs increase – HEP now risks to compromise physics because of lack of computing resources • Has not been true for ~20 years From I. Bird presentation at the “HPC and super-computing workshop for Future Science Applications” (BNL, June 2013) The limits are those of tolerable cost for storage and analysis. Tolerable cost is established in an explicit or implicit optimization of physics dollars for the entire program. The optimum rate of data to persistent storage depends on the capabilities of technology, the size and budget of the total project, and the physics lost by discarding data. There is no simple answer! From US Snowmass Study: https://indico.fnal.gov/getFile.py/access?contribId=342tisessionId=100tiresId=0timaterialId=1ticonfId=6890 Physics needs drives future research directions in Big Data processing on the Grid Big Data Processing on the Grid A. Vaniachine 12 HEP Data Challenges Big Data Processing on the Grid A. Vaniachine ! "#$#%& ' ' ( ) *+%$,-.) */0) +%1' 2/3%4%5%6+%789: %%%55%& 3%; ' < =,/' =>% 13 1-.?' %%96% US Big Data Research and Development Initiative At the time of the “Big Data Research and Development Initiative” announcement, a $200 million investment in tools to handle huge volumes of digital data needed to spur U.S. science and engineering discoveries, two examples of successful HEP technologies were already in place: – Collaborative big data management ventures include PanDA (Production and Distributed Analysis) Workload Management System and XRootD , a high performance, fault tolerant software for fast, scalable access to data repositories of many kinds. Supported by the DOE Office of Advanced Scientific Computing Research, PanDA is now being generalized and packaged, as a Workload Management System already proven at extreme scales, for the wider use of the Big Data community – Progress in this project was reported by A. Klimentov earlier in this session Big Data Processing on the Grid A. Vaniachine 14 Synergistic Challenges As HEP is facing the Big Data processing challenges ahead of other sciences, it is instructive to look for commonalities in the discovery process across the sciences – In 2013 the Subcommittee of the US DOE Advanced Scientific Computing Advisory Committee prepared the Summary Report on Synergistic Challenges in Data-Intensive Science and Exascale Computing Big Data Processing on the Grid A. Vaniachine 15 Knowledge-Discovery Life-Cycle for Big Data: 1 Instruments, sensors Data may be generated by instruments, experiments, sensors, or supercomputers supercomputers Transactional: Data Generation Data Management Historical : Data Processing/ Organization Act, Refine, Feedback Data Reduction, Query Data Visualization Data Sharing Relational: Mining, Discovery, Predictive Modeling (A) Big Data Processing on the Grid Historical Learning Trigger/ Predict A. Vaniachine 16 Knowledge-Discovery Life-Cycle for Big Data: 2 Instruments, sensors (Re)organizing, processing, deriving subsets, reduction, visualization, query analytics, distributing, and other aspects supercomputers Transactional: Data Generation Data Management Historical : Data Processing/ Organization Act, Refine, Feedback Data Reduction, Query Data Visualization Data Sharing Relational: Mining, Discovery, Predictive Modeling Big Data Processing on the Grid Historical Learning Trigger/ In LHC experiments, this includes common operations on and derivations from raw data. The output of data processing is used by thousands of scientists for knowledge discovery. (A) Predict A. Vaniachine 17 Knowledge-Discovery Life-Cycle for Big Data: 3 Instruments, sensors Given the size and complexity of data and the need for both top-down and bottom up discovery, scalable algorithms and software need to be deployed in this phase Transactional: Data Generation Data Management Historical : Data Processing/ Organization Act, Refine, Feedback Relational: Mining, Discovery, Predictive Modeling Big Data Processing on the Grid Historical Learning supercomputers Trigger/ Data Reduction, Query Data Visualization Although the discovery process can Sharing be quite specific Data to the scientific problem under consideration, repeated evaluations, what-if scenarios, predictive modeling, correlations, causality and other mining operations at scale are (A) common at this phase Predict A. Vaniachine 18 Knowledge-Discovery Life-Cycle for Big Data: 4 Instruments, sensors Insights and discoveries from previous phases help close the loop to determine new simulations, models, parameters, settings, observations, thereby, making the closed loop supercomputers Transactional: Data Generation Data Management Historical : Data Processing/ Organization Act, Refine, Feedback Data Reduction, Query Data Visualization While this represents a common high-level approach to data-driven knowledge discovery, there can be important differences among different sciences Data as toSharing how data is produced, consumed, stored, processed, and analyzed Relational: Mining, Discovery, Predictive Modeling (A) Big Data Processing on the Grid Historical Learning Trigger/ Predict A. Vaniachine 19 Data-Intensive Science Workflow The Summary Report identified an urgent need to simplify the workflow for DataIntensive Science – Analysis and visualization of increasingly larger-scale data sets will require integration of the best computational algorithms with the best interactive techniques and interfaces – The workflow for data-intensive science is complicated by the need to simultaneously manage large volumes of data as well as large amounts of computation to analyze the data, and this complexity is increasing at an inexorable rate These complications can greatly reduce the productivity of the domain scientist, if the workflow is not simplified and made more flexible – For example, the workflow should be able to transparently support decisions such as when to move data to computation or computation to data Big Data Processing on the Grid A. Vaniachine 20 Lessons Learned The distributed computing environment for the LHC has proved to be a formidable resource, giving scientists access to huge resources that are pooled worldwide and largely automatically managed – However, the scale of operational effort required is burdensome for the HEP community, and will be hard to replicate in other science communities • Could the current HEP distributed environments be used as a distributed systems laboratory to understand how more robust, self-healing, self-diagnosing systems could be created? Indeed, Big Data processing on the Grid must tolerate a continuous stream of failures, errors, and faults – Transient job failures on the Grid can be recovered by managed re-tries • However, workflow checkpointing at the level of a file or a job delays turnaround times Advancements in reliability engineering provide a framework for fundamental understanding of the Big Data processing turnaround time – Designing fault tolerance strategies that minimize the duration of Big Data processing on the Grid is an active area of research Big Data Processing on the Grid A. Vaniachine 21 • Dynamic data and/or informa on and/or "resource" collec on, discovery, alloca on and management mechanisms Future Research Direction: Workflow Management – Resource descrip on and understanding – Resource = any en ty that is part of the system (papers, files, data, documents, people, compu ng, storage) – Federated seman c discovery To significantly shorten the time needed to transform scientific data into • actionable Rapid informa on the andUSknowledge based response andResearch decisionoffice knowledge, DOE Advance Scientific Computing mechanisms ismaking preparing a call that will include – Steering scien fic processes • Composi on and execu on of end to end scien fic processes across heterogeneous environments – – – – Covering dynamic and sta c Community based Job management and workflows Domain specific abstrac ons Flexible, resilient, and rapidly reconfigurable run me environments From R. Carlson presentation at the “HPC and super-computing workshop for Future Science Applications” (BNL, June 2013) https://indico.bnl.gov/materialDisplay.py?contribId=16&sessionId=8&materialId=slides&confId=612 Big Data Processing on the Grid A. Vaniachine 22 Maximizing Physics Output through Modeling In preparations for LHC data taking future networking perceived as a limit – Monarc model serves as an example how to circumvent the resource limitation • WLCG implemented hierarchical data flow maximizing reliable data transfers Today networking is not a limit and WLCG abandoned the hierarchy – No fundamental technical barriers to transport 10x more traffic within 4 years In contrast, future CPU and storage are perceived as a limit – HEP now risks to compromise physics because of lack of computing resources • As in the days of Monarc, HEP needs comprehensive modeling capabilities that would enable maximizing physics output within the resource constraints Big Data Processing on the Grid Picture by I. Bird A. Vaniachine 23 Future Research Direction: Workflow Modeling Modeling and Simula on Program Elements • Applica on workflows should have predictable performance behaviors – Modeling the compu ng, storage, and networking resources – Modeling the protocols, services, and applica ons – Simula ng the execu on environment with enough fidelity to make informed predic ons • ASCR is developing a new joint CS/Network modeling and simula on program to address this important area – Workshop scheduled for Sept 18-19, 2013 – h p://hpc.pnl.gov/modsim/2013/ – Posi on papers due June 17, 2013 From R. Carlson presentation at the “HPC and super-computing workshop for Future Science Applications” (BNL, June 2013) https://indico.bnl.gov/materialDisplay.py?contribId=16&sessionId=8&materialId=slides&confId=612 Big Data Processing on the Grid A. Vaniachine 24 Conclusions Study of Higgs boson properties is a top priority for LHC physics – LHC upgrades increase demands for computing resources beyond flat budgets • HEP now risks to compromise physics because of lack of computing resources A comprehensive end-to-end solution for the composition and execution of Big Data processing workflow within given CPU and storage constraints is necessary – Future research in workflow management and modeling are necessary to provide the tools for maximizing scientific output within given resource constraints By bringing Nuclear Electronics and Computing experts together, the NEC Symposium continues to be in unique position to promote HEP progress as the solution requires optimization cross-cutting Trigger and Computing domains Big Data Processing on the Grid A. Vaniachine 25 Extra Slides For LHC Increases per year ! "#$ %#&' %$ () * +, % &- - - - - - " ! "#"$%$&' ( ) "*"( %+' , " ' &- - - - - " ' ------" - ./(0% $&- - - - - " LHC Computing adds about 25k processor cores a year - ./(1% $- - - - - - " #234 % , &- - - - - " %% , ------" 0556710%8.9/: (% ( &- - - - - " (------" &- - - - - " And 34PB of disk -" , - - . " , - - / " , - ( - " , - ( ( " , - ( , " , - ( $" , - ( ' " , - ( &" , - ( %" , - ( +" , - ( . " , - ( / " , - , - " ! "#$ %&'() %$ *+, -. % +**" Ian Fisk CD/FNAL %+*" The cost and complexity of the storage is much larger than the processing ! "#"$%&' ( ") "*&+" %**" /'0*1% $+*" /'0*2% $**" #345 % ' +*" % % ' **" % 1667821% 9': 0; *% , +*" , **" +*" Big Data Processing on the Grid *" ' **- " ' **. " ' *, *" ' *, , " ' *, ' " ' *, $" ' *, %" ' *, +" ' *, / " ' *, 0" ' *, - " ' *, . " ' *'27 *" A. Vaniachine