Grids and NIBR Steve Litster PhD. Manager of Advanced Computing Group OGF22, Date Feb. 26th 2008 Agenda Introduction to NIBR Brief History of Grids at NIBR Overview of Cambridge Campus grid Storage woes and Virtualization solution 2 | Presentation Title | Presenter Name | Date | Subject | Business Use Only NIBR Organization Novartis Institutes for BioMedical Research (NIBR) is Novartis’ global pharmaceutical research organization. Informed by clinical insights from our translational medicine team, we use modern science and technology to discover new medicines for patients worldwide. Approximately 5000 people worldwide 3 | Presentation Title | Presenter Name | Date | Subject | Business Use Only NIBR Locations VIENNA CAMBRIDGE HORSHAM TSUKUBA BASEL EMERYVILLE SHANGHAI E. HANOVER 4 | Presentation Title | Presenter Name | Date | Subject | Business Use Only Brief History of Grids at NIBR 2001-2004 Infrastructure • PC Grid for cycle “scavenging” - 2700 Desktop CPU’s . Applications primarily Molecular Docking- GOLD and DOCK • SMP Systems- Multiple, departmentally segregated - no queuing, multiple interactive sessions • Linux Clusters (3 in US, 2 in EU)-departmentally segregated - Multiple Job schedulers PBS, SGE, “home-grown”. • Storage - Multiple NAS systems and SMP systems serving NFS based file systems - Storage: growth-100GB/Month 5 | Presentation Title | Presenter Name | Date | Subject | Business Use Only Brief History of Grids at NIBR cont. 2004-2006 • PC Grid “Winding” Down • Reasons: - Application on-boarding very specialized • SMP Systems Multiple, departmentally segregated systems • Linux Clusters (420 CPU in US, 200 CPU in EU) • Reasons: - Departmental consolidation (still segregation through “queues”) - Standardized OS and Queuing systems • Storage - Consolidation to centralized NAS and SAN solution - Growth Rate 400GB/Month 6 | Presentation Title | Presenter Name | Date | Subject | Business Use Only Brief History of Grids at NIBR cont. 2006-Present • Social Grids Community Forums, Sharepoint, Media Wiki (user driven) • PC Grid: 1200 CPU’s (EU). Changing requirements. • SMP Systems - Multiple 8+ CPU systems-shared resource-Part of Compute Grid • Linux Clusters 4 Globally Available (Approx 800 CPU ) - Incorporation of Linux workstations, Virtual Systems and Web Portals - Workload changing from Docking/Bioinformatics to Statistical and Image Analysis • Storage (Top Priority) - Virtualization - Growth Rate 2TB/Month (expecting 4-5TB/Month) 7 | Presentation Title | Presenter Name | Date | Subject | Business Use Only Cambridge Site “Campus Grid” 8 | Presentation Title | Presenter Name | Date | Subject | Business Use Only Cambridge Infrastructure 9 | Presentation Title | Presenter Name | Date | Subject | Business Use Only Cambridge Campus Grid cont. Current Capabilities • Unified home, data and application directories • All file systems are accessible via NFS and CIFS (multi-protocol) • NIS and AD “UIDS/GIDS” synchronized • Users can submit jobs to the compute grid from linux workstations or portal based applications e.g.. Inquiry integrated with SGE. • Extremely scalable NAS solution and virtualized file systems 10 | Presentation Title | Presenter Name | Date | Subject | Business Use Only Storage Infrastructure cont. • Storage (Top Priority) - Virtualization - Growth Rate 2TB/Month (expecting 4-5TB/Month) SAN Cambridge Storage Growth 7 TB 13 TB NAS 40 TB 100 90 . 70 TB Used 80 60 50 40 30 20 10 0 11 | Presentation Title | Presenter Name | Date | Subject | Business Use Only 08 nJa 7 -0 ov N 07 pSe 7 l-0 Ju 7 -0 ay M 7 -0 ar M 07 nJa 6 -0 ov N 06 pSe 6 l-0 Ju 6 -0 ay M 6 -0 ar M 06 nJa 5 -0 ov N 05 pSe 5 l-0 Ju 5 -0 ay M 5 -0 ar M 05 nJa Date How did we get there technically? Original Design: A single NAS File System • 2004 3TB (no problems, life was good) • 2005 7TB (problems begin) - Backing up file systems >4TB problematic - Restoring data from 4TB file system-even more of a problem • Must have a “like device”, 2TB restore takes 17hrs - Time to think about NAS Virtualization. • 2006 13 TB (major problems begin) - Technically past the limit supported by storage vendor • Journalled file systems do need fsck'ing sometimes 12 | Presentation Title | Presenter Name | Date | Subject | Business Use Only NAS Virtualization Requirements: • Multi-protocol file systems (NFS and CIFS) • No “stubbing” of files • No downtime due to storage expansion/migration • Throughput must be as good as existing solution • Flexible data management policies • Must backup with 24hrs 13 | Presentation Title | Presenter Name | Date | Subject | Business Use Only NAS Virtualization Solution 14 | Presentation Title | Presenter Name | Date | Subject | Business Use Only NAS Virtualization Solution cont. Pro’s • Huge reduction in backup resources (90->30 LTO3 tapes/week) • Less wear and tear on backup infrastructure (and operator) • Cost savings : Tier 4 = 3x, Tier 3 =x • Storage Lifecycle –NSX example • Much less downtime (99.96% uptime) Cons • Can be more difficult to restore data • Single point of failure (Acopia meta data) • Not built for high throughput (requirements TBD) 15 | Presentation Title | Presenter Name | Date | Subject | Business Use Only Storage Virtualization Inbox-Data repository for “wet lab” data • 40TB of Unstructured data • 15 Million directories • 150 Million files • 98% of data, not accessed 3 months after creation. Growth due to HCS, X-Ray Crystallographic, MRI, NMR and Mass Spectrometry data • Expecting 4-5TB Month 16 | Presentation Title | Presenter Name | Date | Subject | Business Use Only Lessons Learned Can we avoid having to create a technical solution? • Discuss data management policy before lab instruments are deployed • Discuss standardizing image formats • Start taking advantage of meta data In the meantime: • Monitor and manage storage growth and backups closely • Scale backup infrastructure with storage • Scale Power and Cooling –new storage systems require >7KW/Rack. 17 | Presentation Title | Presenter Name | Date | Subject | Business Use Only NIBR Grid Future (highly dependent on open standards) • Cambridge model deployed Globally • Tighter integration of the GRIDs (storage, compute, social) with: - workflow tools –Pipeline Pilot, Knime, ….. - home grown and ISV applications - data repositories –start accessing information not just data - Portals/SOA - Authentication and Authorization 18 | Presentation Title | Presenter Name | Date | Subject | Business Use Only Acknowledgements Bob Cohen, John Ehrig, OGF Novartis ACS Team: • Jason Calvert, Bob Coates, Mike Derby, James Dubreuil, Chris Harwell, Delvin Leacock, Bob Lopes, Mike Steeves, Mike Gannon Novartis Management: • Gerold Furler, Ted Wilson, Pascal Afflard and Remy Evard (CIO) 19 | Presentation Title | Presenter Name | Date | Subject | Business Use Only Additional Slides :NIBR Network Note: E. Hanover is the North American hub, Basel is the European hub All Data Center sites on the Global Novartis Internal Network have network connectivity capabilities to all other Data Center sites dependent on locally configured network routing devices and the presence of firewalls in the path. Basel; Redundant OC3 links to BT OneNet cloud Cambridge; Redundant dual OC 12 circuits to E. Hanover. Cambridge does not have a OneNet connection at this time. NITAS Cambridge also operates a network DMZ with 100Mbps/1Gbps internet circuit. E. Hanover; Redundant OC12s to USCA OC3 to Basel T3 into BT OneNet MPLS cloud with Virtual Circuits defined to Tsukuba and Vienna, Emeryville; 20 Mbps OneNet MPLS connection Horsham; Redundant E3 circuits into BT OneNet cloud with Virtual Circuits to E. Hanover & Basel Shanghai; 10Mbps BT OneNet cloud connection, backup WAN to Pharma Beijing Tsukuba; Via Tokyo, redundant T3 links into BT OneNet cloud with Virtual Circuits defined to E. Hanover & Vienna Vienna; Redundant dual E3 circuits into the BT OneNet MPLS cloud with Virtual Circuits to E. Hanover, Tsukuba & Horsham 20 | Presentation Title | Presenter Name | Date | Subject | Business Use Only