Open Science Grid Linking Universities and Laboratories In National Cyberinfrastructure www.opensciencegrid.org Physics Colloquium RIT (Rochester, NY) May 23, 2007 RIT Colloquium (May 23, 2007) Paul Avery University of Florida avery@phys.ufl.edu Paul Avery 1 Cyberinfrastructure and Grids Grid: Geographically distributed computing resources configured for coordinated use Fabric: Physical resources & networks providing raw capability Ownership: Resources controlled by owners and shared w/ others Middleware: Software tying it all together: tools, services, etc. Enhancing collaboration via transparent resource sharing US-CMS “Virtual Organization” RIT Colloquium (May 23, 2007) Paul Avery 2 Motivation: Data Intensive Science 21st century scientific discovery Computationally & data intensive Theory + experiment + simulation Internationally distributed resources and collaborations Dominant 2000 2007 2013 2020 Powerful factor: data growth (1 petabyte = 1000 terabytes) ~0.5 petabyte ~10 petabytes ~100 petabytes ~1000 petabytes How to collect, manage, access and interpret this quantity of data? cyberinfrastructure needed Computation Data storage & access Data movement Data sharing Software RIT Colloquium (May 23, 2007) Massive, distributed CPU Large-scale, distributed storage International optical networks Global collaborations (100s – 1000s) Managing all of the above Paul Avery 3 Open Science Grid: July 20, 2005 Consortium of many organizations (multiple disciplines) Production grid cyberinfrastructure 80+ sites, 25,000+ CPUs: US, UK, Brazil, Taiwan RIT Colloquium (May 23, 2007) Paul Avery 4 The Open Science Grid Consortium U.S. grid projects University facilities Multi-disciplinary facilities Science projects & communities LHC experiments Open Science Grid Regional and campus grids Education communities Computer Science Laboratory centers Technologists (Network, HPC, …) RIT Colloquium (May 23, 2007) Paul Avery 5 Open Science Grid Basics Who Comp. scientists, IT specialists, physicists, biologists, etc. What Shared computing and storage resources High-speed production and research networks Meeting place for research groups, software experts, IT providers Vision Maintain and operate a premier distributed computing facility Provide education and training opportunities in its use Expand reach & capacity to meet needs of stakeholders Dynamically integrate new resources and applications Members and partners Members: HPC facilities, campus, laboratory & regional grids Partners: Interoperation with TeraGrid, EGEE, NorduGrid, etc. RIT Colloquium (May 23, 2007) Paul Avery 6 Crucial Ingredients in Building OSG Science “Push”: ATLAS, CMS, LIGO, SDSS 1999: Early Foresaw overwhelming need for distributed cyberinfrastructure funding: “Trillium” consortium PPDG: $12M (DOE) (1999 – 2006) GriPhyN: $12M (NSF) (2000 – 2006) iVDGL: $14M (NSF) (2001 – 2007) Supplements + new funded projects Social networks: ~150 people with many overlaps Universities, Coordination: labs, SDSC, foreign partners pooling resources, developing broad goals Common middleware: Virtual Data Toolkit (VDT) Multiple Grid deployments/testbeds using VDT Unified entity when collaborating internationally Historically, a strong driver for funding agency collaboration RIT Colloquium (May 23, 2007) Paul Avery 7 OSG History in Context LIGO operation LIGO preparation LHC construction, preparation LHC Ops iVDGL(NSF) GriPhyN(NSF) Trillium Grid3 OSG (DOE+NSF) PPDG (DOE) 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 European Grid + Worldwide LHC Computing Grid Campus, regional grids RIT Colloquium (May 23, 2007) Paul Avery 8 Principal Science Drivers 100s of petabytes (LHC) Several petabytes LIGO (gravity wave search) 0.5 - several petabytes Digital 2002 astronomy 10s of petabytes 10s of terabytes Other 2007 2005 2009 2001 2009 2007 2005 2003 sciences coming forward Bioinformatics (10s of petabytes) Nanoscience 2001 Community growth energy and nuclear physics Data growth High Environmental Chemistry Applied mathematics Materials Science? RIT Colloquium (May 23, 2007) Paul Avery 9 OSG Virtual Organizations ATLAS HEP/LHC HEP experiment at CERN CDF HEP HEP experiment at FermiLab CMS HEP/LHC HEP experiment at CERN DES Digital astronomy Dark Energy Survey DOSAR Regional grid Regional grid in Southwest US DZero HEP HEP experiment at FermiLab DOSAR Regional grid Regional grid in Southwest ENGAGE Engagement effort A place for new communities FermiLab Lab grid HEP laboratory grid fMRI fMRI Functional MRI GADU Bio Bioinformatics effort at Argonne Geant4 Software Simulation project GLOW Campus grid Campus grid U of Wisconsin, Madison GRASE Regional grid Regional grid in Upstate NY RIT Colloquium (May 23, 2007) Paul Avery 10 OSG Virtual Organizations (2) GridChem Chemistry GPN Great Plains Network www.greatplains.net GROW Campus grid Campus grid at U of Iowa I2U2 EOT E/O consortium LIGO Gravity waves Gravitational wave experiment Mariachi Cosmic rays Ultra-high energy cosmic rays nanoHUB Nanotech Nanotechnology grid at Purdue NWICG Regional grid Northwest Indiana regional grid NYSGRID NY State Grid www.nysgrid.org OSGEDU EOT OSG education/outreach SBGRID Structural biology Structural biology @ Harvard SDSS Digital astronomy Sloan Digital Sky Survey (Astro) STAR Nuclear physics Nuclear physics experiment at Brookhaven UFGrid Campus grid Campus grid at U of Florida RIT Colloquium (May 23, 2007) Quantum chemistry grid Paul Avery 11 Partners: Federating with OSG Campus and regional Grid Grid Laboratory of Wisconsin (GLOW) Operations Center at Indiana University (GOC) Grid Research and Education Group at Iowa (GROW) Northwest Indiana Computational Grid (NWICG) New York State Grid (NYSGrid) (in progress) Texas Internet Grid for Research and Education (TIGRE) nanoHUB (Purdue) LONI (Louisiana) National Data Intensive TeraGrid Science University Network (DISUN) International Worldwide LHC Computing Grid Collaboration (WLCG) Enabling Grids for E-SciencE (EGEE) TWGrid (from Academica Sinica Grid Computing) Nordic Data Grid Facility (NorduGrid) Australian Partnerships for Advanced RIT Colloquium (May 23, 2007) Computing (APAC) Paul Avery 12 Defining the Scale of OSG: Experiments at Large Hadron Collider 27 km Tunnel in Switzerland & France TOTEM CMS LHC @ CERN ALICE LHCb Search for Origin of Mass New fundamental forces Supersymmetry Other new particles RIT – Colloquium (May 23, 2007) 2007 ? ATLAS Paul Avery 13 CMS: “Compact” Muon Solenoid Inconsequential humans RIT Colloquium (May 23, 2007) Paul Avery 14 Collision Complexity: CPU + Storage (+30 minimum bias events) All charged tracks with pt > 2 GeV Reconstructed tracks with pt > 25 GeV 109 collisions/sec, selectivity: 1 in 1013 RIT Colloquium (May 23, 2007) Paul Avery 15 LHC Data and CPU Requirements CMS ATLAS Storage Raw recording rate 0.2 – 1.5 GB/s Large Monte Carlo data samples 100 PB by ~2013 1000 PB later in decade? Processing PetaOps (> 300,000 3 GHz PCs) Users 100s of institutes 1000s of researchers LHCb RIT Colloquium (May 23, 2007) Paul Avery 16 OSG and LHC Global Grid 5000 physicists, 60 countries 10s of Petabytes/yr by 2009 CERN / Outside = 10-20% CMS Experiment Online System Tier 0 Tier 1 CERN Computer Center 200 - 1500 MB/s Korea Russia UK 10-40 Gb/s FermiLab >10 Gb/s Tier 2 OSG U Florida Caltech UCSD 2.5-10 Gb/s Tier 3 Tier 4 FIU Physics caches RIT Colloquium (May 23, 2007) Iowa Maryland PCs Paul Avery 17 LHC Global Collaborations CMS ATLAS 2000 – 3000 physicists per experiment USA is 20–31% of total RIT Colloquium (May 23, 2007) Paul Avery 18 LIGO: Search for Gravity Waves LIGO Grid 6 US sites 3 EU sites (UK & Germany) Birmingham• Cardiff AEI/Golm • * LHO, LLO: LIGO observatory sites * LSC: LIGO Scientific Collaboration RIT Colloquium (May 23, 2007) Paul Avery 19 Sloan Digital Sky Survey: Mapping the Sky RIT Colloquium (May 23, 2007) Paul Avery 20 Bioinformatics: GADU / GNARE Public Databases Genomic databases available on the web. Eg: NCBI, PIR, KEGG, EMP, InterPro, etc. GADU using Grid Applications executed on Grid as workflows and results are stored in integrated Database. TeraGrid DOE SG Bidirectional Data Flow •SEED (Data Acquisition) •Shewanella Consortium (Genome Analysis) Others.. OSG Services to Other Groups Integrated Database Applications (Web Interfaces) Based on the Integrated Database Chisel Protein Function Analysis Tool. PATHOS Pathogenic DB for Bio-defense research GADU Performs: Acquisition: to acquire Genome Data from a variety of publicly available databases and store temporarily on the file system. Analysis: to run different publicly available tools and in-house tools on the Grid using Acquired data & data from Integrated database. Storage: Store the parsed data acquired from public databases and parsed results of the tools and workflows used during analysis. PUMA2 Evolutionary Analysis of Metabolism TARGET Targets for Structural analysis of proteins. Phyloblocks Evolutionary analysis of protein families Integrated Database Includes: Parsed Sequence Data and Annotation Data from Public web sources. Results of different tools used for Analysis: Blast, Blocks, TMHMM, … GNARE – Genome Analysis Research Environment RIT Colloquium (May 23, 2007) Paul Avery 21 Bioinformatics (cont) Shewanella oneidensis genome RIT Colloquium (May 23, 2007) Paul Avery 22 Nanoscience Simulations collaboration learning modules 1881 sim. users >53,000 simulations Real users and real usage >10,100 users nanoHUB.org seminars courses, tutorials online simulation RIT Colloquium (May 23, 2007) Paul Avery 23 OSG Engagement Effort Purpose: Led Bring non-physics applications to OSG by RENCI (UNC + NC State + Duke) Specific targeted opportunities Develop relationship Direct assistance with technical details of connecting to OSG Feedback and new requirements for OSG infrastructure (To facilitate inclusion of new communities) More & better documentation More automation RIT Colloquium (May 23, 2007) Paul Avery 24 OSG and the Virtual Data Toolkit VDT: a collection of software Grid software (Condor, Globus, VOMS, dCache, GUMS, Gratia, …) Virtual Data System Utilities VDT: the basis for the OSG software stack Goal is easy installation with automatic configuration Now widely used in other projects Has a growing support infrastructure RIT Colloquium (May 23, 2007) Paul Avery 25 Why Have the VDT? Everyone But could download the software from the providers the VDT: Figures out dependencies between software Works with providers for bug fixes Automatic configures & packages software Tests everything on 15 platforms (and growing) Debian 3.1 Fedora Core 3 Fedora Core 4 (x86, x86-64) Fedora Core 4 (x86-64) RedHat Enterprise Linux 3 AS (x86, x86-64, ia64) RedHat Enterprise Linux 4 AS (x64, x86-64) ROCKS Linux 3.3 Scientific Linux Fermi 3 Scientific Linux Fermi 4 (x86, x86-64, ia64) SUSE Linux 9 (IA-64) RIT Colloquium (May 23, 2007) Paul Avery 26 VDT Growth Over 5 Years (1.6.1i now) Both added and removed software VDT 1.3.9 VDT 1.3.6 For OSG 0.4 For OSG 0.2 VDT 1.1.8 Adopted by LCG VDT 1.3.0 VDT 1.6.1 For OSG 0.6.0 VDT 1.0 Globus 2.0b Condor-G 6.3.1 VDT 1.1.x RIT Colloquium (May 23, 2007) VDT 1.2.x VDT 1.3.x Paul Avery VDT 1.4.0 VDT 1.5.x 07 Ja n- 6 l- 0 Ju 06 Ja n- 5 l- 0 nJa Ju 05 VDT 1.2.0 4 l- 0 Ju 04 Ja n- 3 l- 0 Ju Ja n- 2 03 VDT 1.1.11 Grid2003 l- 0 nJa More dev releases Ju 50 45 40 35 30 25 20 15 10 5 0 02 Number of major components of Components # vdt.cs.wisc.edu VDT 1.6.x 27 Collaboration with Internet2 www.internet2.edu RIT Colloquium (May 23, 2007) Paul Avery 28 Collaboration with National Lambda Rail www.nlr.net Optical, multi-wavelength community owned or leased “dark fiber” (10 GbE) networks for R&E Spawning state-wide and regional networks (FLR, SURA, LONI, …) Bulletin: NLR-Internet2 merger announcement RIT Colloquium (May 23, 2007) Paul Avery 29 UltraLight Integrating Advanced Networking in Applications http://www.ultralight.org Funded by NSF RIT Colloquium (May 23, 2007) Paul Avery 10 Gb/s+ network • Caltech, UF, FIU, UM, MIT • SLAC, FNAL • Int’l partners 30 • Level(3), Cisco, NLR REDDnet: National Networked Storage NSF funded project Vandebilt 8 initial sites Multiple disciplines Satellite imagery HEP Terascale Supernova Initative Structural Biology Bioinformatics Storage 500TB disk 200TB tape Brazil? RIT Colloquium (May 23, 2007) Paul Avery 31 OSG Jobs Snapshot: 6 Months 5000 simultaneous jobs from multiple VOs Sep Oct RIT Colloquium (May 23, 2007) Nov Dec Paul Avery Jan Feb Mar 32 OSG Jobs Per Site: 6 Months 5000 simultaneous jobs at multiple sites Sep Oct RIT Colloquium (May 23, 2007) Nov Dec Paul Avery Jan Feb Mar 33 Completed Jobs/Week on OSG 400K Sep CMS “Data Challenge” Oct RIT Colloquium (May 23, 2007) Nov Dec Paul Avery Jan Feb Mar 34 # Jobs Per VO New Accounting System (Gratia) RIT Colloquium (May 23, 2007) Paul Avery 35 Massive 2007 Data Reprocessing by D0 Experiment @ Fermilab LCG ~ 400M total ~ 250M OSG OSG SAM RIT Colloquium (May 23, 2007) Paul Avery 36 CDF Discovery of Bs Oscillations Bs Bs xs f 2.8THz 2 Bs et / 2 sin xst / Bs1 cos xst / Bs 2 RIT Colloquium (May 23, 2007) Paul Avery 37 Communications: International Science Grid This Week SGTW iSGTW From April 2005 Diverse audience >1000 subscribers www.isgtw.org RIT Colloquium (May 23, 2007) Paul Avery 38 OSG News: Monthly Newsletter 18 issues by Apr. 2007 www.opensciencegrid.org/ osgnews RIT Colloquium (May 23, 2007) Paul Avery 39 Grid Summer Schools Summer 2004, 2005, 2006 1 week @ South Padre Island, Texas Lectures plus hands-on exercises to ~40 students Students of differing backgrounds (physics + CS), minorities Reaching a wider audience Lectures, exercises, video, on web More tutorials, 3-4/year Students, postdocs, scientists Agency specific tutorials RIT Colloquium (May 23, 2007) Paul Avery 40 Project Challenges Technical constraints Commercial tools fall far short, require (too much) invention Integration of advanced CI, e.g. networks Financial constraints (slide) Fragmented & short term funding injections (recent $30M/5 years) Fragmentation of individual efforts Distributed coordination and management Tighter organization within member projects compared to OSG Coordination of schedules & milestones Many phone/video meetings, travel Knowledge dispersed, few people have broad overview RIT Colloquium (May 23, 2007) Paul Avery 41 Funding & Milestones: 1999 – 2007 Grid Communications First US-LHC Grid Testbeds UltraLight, $2M GriPhyN, $12M Grid3 start iVDGL, $14M 2000 2001 LIGO Grid 2002 VDT 1.0 2003 Grid, networking projects Large experiments Education, outreach, training RIT Colloquium (May 23, 2007) VDT 1.3 2004 OSG start 2005 2006 LHC start 2007 CHEPREO, $4M PPDG, $9.5M DISUN, $10M Grid Summer Schools OSG, $30M NSF, 2004, 2005, 2006 DOE Digital Divide Workshops 04, 05, 06 Paul Avery 42 Challenges from Diversity and Growth Management of an increasingly diverse enterprise Sci/Eng projects, organizations, disciplines as distinct cultures Accommodating new member communities (expectations?) Interoperation with other grids TeraGrid International partners (EGEE, NorduGrid, etc.) Multiple campus and regional grids Education, outreach and training Training for researchers, students … but also project PIs, program officers Operating a rapidly growing cyberinfrastructure 100K CPUs, 4 10 PB disk Management of and access to rapidly increasing data stores (slide) Monitoring, accounting, achieving high utilization Scalability of support model (slide) 25K RIT Colloquium (May 23, 2007) Paul Avery 43 Rapid Cyberinfrastructure Growth: LHC Meeting LHC service challenges & milestones Participating in worldwide simulation productions 350 200 2008: ~140,000 PCs 150 Tier-1 MSI2000 250 Tier-2 300 100 0 2007 RIT Colloquium (May 23, 2007) CERN 50 LHCb-Tier-2 CMS-Tier-2 ATLAS-Tier-2 ALICE-Tier-2 LHCb-Tier-1 CMS-Tier-1 ATLAS-Tier-1 ALICE-Tier-1 LHCb-CERN CMS-CERN ATLAS-CERN ALICE-CERN 2008 2009 Year Paul Avery 2010 44 OSG Operations Distributed model Scalability! VOs, sites, providers Rigorous problem tracking & routing Security Provisioning Monitoring Reporting Partners with EGEE operations RIT Colloquium (May 23, 2007) Paul Avery 45 Five Year Project Timeline & Milestones Contribute to Worldwide LHC Computing Grid LHC Event Data Distribution and Analysis Support 1000 Users; 20PB Data Archive LHC Simulations Contribute to LIGO Workflow and Data Analysis LIGO data run SC5 Advanced LIGO LIGO Data Grid dependent on OSG STAR, CDF, D0, Astrophysics CDF Simulation CDF Simulation and Analysis D0 Simulations D0 Reprocessing STAR Data Distribution and Jobs Additional Science Communities 006 10KJobs per Day +1 Community +1 Community +1 Community +1 Community 2007 2008 +1 Community +1 Community +1 Community +1 C 2009 2010 201 Facility Security : Risk Assessment, Audits, Incident Response, Management, Operations, Technical Controls Plan V1 1st Audit Risk Audit Risk Audit Risk Assessment Assessment Assessment Facility Operations and Metrics: Increase robustness and scale; Operational Metrics defined and validated each year. Audit Risk Assessment Interoperate and Federate with Campus and Regional Grids VDT and OSG Software Releases: Major Release every 6 months; Minor Updates as needed VDT 1.4.0 VDT 1.4.1 VDT 1.4.2 … … … OSG 0.6.0 OSG 0.8.0 OSG 1.0 OSG 2.0 OSG 3.0 … … VDT Incremental Updates dCache with Accounting Auditing Federated monitoring and role based information services VDS with SRM authorization Common S/w Distribution Transparent data and job with TeraGrid movement with TeraGrid EGEE using VDT 1.4.X Transparent data management with EGEE Extended Capabilities & Increase Scalability and Performance for Jobs and Data to meet Stakeholder needs SRM/dCache “Just in Time” Workload VO Services Integrated Network Management Extensions Management Infrastructure Data Analysis (batch and Improved Workflow and Resource Selection interactive) Workflow Work with SciDAC-2 CEDS and Security with Open Science 2006 2007 2008 RIT Colloquium (May 23, 2007) Project start End of Phase I 2009 Paul Avery End of Phase II 2010 2011 46 Extra Slides RIT Colloquium (May 23, 2007) Paul Avery 47 VDT Release Process (Subway Map) Gather requirements Time Day 0 Build software Test Validation test bed VDT Release ITB Release Candidate Integration test bed Day N OSG Release From Alain Roy RIT Colloquium (May 23, 2007) Paul Avery 48 VDT Challenges How should we smoothly update a production service? In-place vs. on-the-side Preserve old configuration while making big changes Still takes hours to fully install and set up from scratch How do we support more platforms? A struggle to keep up with the onslaught of Linux distributions AIX? Mac OS X? Solaris? How can we accommodate native packaging formats? RPM Deb Fedora Core 6 BCCD RIT Colloquium (May 23, 2007) Paul Avery Fedora Core 4 RHEL 3 49 4 Fedora Core 3 RHEL