Computing & Networking User Group Meeting Roy Whitney Andy Kowalski Sandy Philpott Chip Watson 17 June 2008 1 Users and JLab IT • Ed Brash is User Group Board of Directors’ representative on the IT Steering Committee. • Physics Computing Committee (Sandy Philpott) • Helpdesk and CCPR requests and activities • Challenges – Constrained budget • Staffing • Aging infrastructure – Cyber Security 2 Computing and Networking Infrastructure Andy Kowalski 3 CNI Outline • Helpdesk • Computing • Wide Area Network • Cyber Security • Networking and Asset Management 4 Helpdesk • Hour 8am-12pm M-F – Submit a CCPR via http://cc.jlab.org/ – Dial x7155 – Send email to helpdesk@jlab.org • Windows XP, Vista and RHEL5 Supported Desktops – Migrating older desktops • Mac Support? 5 Computing • Email Servers Upgraded – Dovecot IMAP Server (Indexing) – New File Server and IMAP Servers (Farm Nodes) • Servers Migrating to Virtual Machines • Printing – Centralized Access via jlabprt.jlab.org – Accounting Coming Soon • Video Conferencing (working on EVO) 6 Wide Area Network • Bandwidth – 10Gbps WAN and LAN backbone – Offsite Data Transfer Servers • scigw.jlab.org(bbftp) • qcdgw.jlab.org(bbcp) 7 Cyber Security Challenge • The threat: sophistication and volume of attacks continue to increase. – Phishing Attacks • Spear Phishing/Whaling are now being observed at JLab. • Federal, including DOE, requirements to meet the cyber security challenges require additional measures. • JLab uses a risk based approach that incorporates achieving the mission while at the same time dealing with the threat. 8 Cyber Security • Managed Desktops – Skype Allowed From Managed Desktops On Certain Enclaves • Network Scanning • Intrusion Detection • PII/SUI (CUI) Management 9 Networking and IT Asset Management • Network Segmentation/Enclaves – Firewalls • Computer Registration – https://reggie.jlab.org/user/index.php • Managing IP Addresses – DHCP • Assigns all IP addresses (most static) • Integrated with registration • Automatic Port Configuration – Rolling out now – Uses registration database 10 Scientific Computing Chip Watson & Sandy Philpott 11 Farm Evolution Motivation • Capacity upgrades – Re-use of HPC clusters • Movement to Open Source – O/S upgrade – Change from LSF to PBS 13 Farm Evolution Timetable Nov 07: Auger/PBS available – RHEL3 - 35 nodes Jan 08: Fedora 8 (F8) available – 50 nodes May 08: Friendly-user mode; IFARML4,5 Jun 08: Production – F8 only; IFARML3 + 60 nodes from LSF IFARML alias Jul 08: IFARML2 + 60 nodes from LSF Aug 08: IFARML1 + 60 nodes from LSF Sep 08: RHEL3/LSF->F8/PBS Migration complete – No renewal of LSF or RHEL for cluster nodes 14 Farm F8/PBS Differences • Code must be recompiled – 2.6 kernel – gcc 4 • Software installed locally via yum – cernlib – Mysql • Time limits: 1 day default, 3 days max • stdout/stderr to ~/farm_out • Email notification 15 Farm Future Plans • Additional nodes – from HPC clusters • CY08: ~120 4g nodes • CY09-10: ~60 6n nodes – Purchase as budgets allow • Support for 64 bit systems when feasible & needed 16 Storage Evolution • Deployment of Sun x4500 “thumpers” • Decommissioning of Panasas (old /work server) • Planned replacement of old cache nodes 17 Tape Library • Current STK “Powderhorn” silo is nearing end-of-life – Reaching capacity & running out of blank tapes – Doesn’t support upgrade to higher density cartridges – Is officially end-of-life December 2010 • Market trends – LTO (Linear Tape Open) Standard has proliferated since 2000 – LTO-4 is 4x density, capacity/$, and bandwidth of 9940b: 800 GB/tape, $100/TB, 120 MB/s – LTO-5, out next year, will double capacity, 1.5x bandwidth: 1600 GB/tape, 180 MB/s – LTO-6 will be out prior to the 12 GeV era 3200 GB/tape, 270 MB/s 18 Tape Library Replacement • Competitive procurement now in progress – Replace old system, support 10x growth over 5 years • Phase 1 in August – System integration, software evolution – Begin data transfers, re-use 9940b tapes • Tape swap through January • 2 PB capacity by November • DAQ to LTO-4 in January 2009 • Old silo gone in March 2009 End result: breakeven on cost by the end of 2009! 19 Long Term Planning • Continue to increase compute & storage capacity in most cost effective manner • Improve processes & planning – PAC submission process – 12 GeV Planning… 20 E.g.: Hall B Requirements Event Simulation SPECint_rate2006 sec/event Number of events Event size (KB) % Stored Long Term Total CPU (SPECint_rate2006) Petabytes / year (PB) Data Acquisition Average event size (KB) Max sustained event rate (kHz) Average event rate (kHz) Average 24-hour duty factor (%) Weeks of operation / year Network (n*10gigE) Petabytes / year st 1 Pass Analysis 2012 2013 2014 2015 2016 1.8 1.00E+12 20 1.8 1.00E+12 20 1.8 1.00E+12 20 1.8 1.00E+12 20 1.8 1.00E+12 20 10% 5.7E+04 2 25% 5.7E+04 5 25% 5.7E+04 5 25% 5.7E+04 5 25% 5.7E+04 5 20 0 0 0% 0 1 0.0 20 0 0 0% 0 1 0.0 20 10 10 50% 0 1 0.0 20 10 10 60% 30 1 2.2 20 20 10 65% 30 1 2.4 2012 2013 2014 2015 2016 SPECint_rate2006 sec/event Number of analysis passes Event size out / event size in Total CPU (SPECint_rate2006) Silo Bandwidth (MB/s) Petabytes / year 1.5 0 2 0.0E+00 0 0.0 1.5 0 2 0.0E+00 0 0.0 1.5 1.5 2 0.0E+00 900 0.0 1.5 1.5 2 7.8E-03 900 4.4 1.5 1.5 2 8.4E-03 1800 4.7 Total SPECint_rate2006 SPECint_rate2006 / node # nodes needed (current year) Petabytes / year 5.7E+04 600 95 2 5.7E+04 900 63 5 5.7E+04 1350 42 5 5.7E+04 2025 28 12 5.7E+04 3038 19 12 LQCD Computing • JLab operates 3 clusters with nearly 1100 nodes, primarily for LQCD plus some accelerator modeling • National LQCD Computing Project (2006-2009: BNL, FNAL, JLab; USQCD Collaboration) • LQCD II proposal 2010-2014 would double the hardware budget to enable key calculations • JLab Experimental Physics & LQCD computing share staff (operations & software development) & tape silo, providing efficiencies for both 22