IBM Systems & Technology Group Data Intensive World Bob Curran and Kevin Gildea IBM Systems & Technology Group © 2006 IBM Corporation IBM Systems & Technology Group Introduction Data size is growing both in aggregate and in individual elements Rate of data storage/access is growing both in aggregate and in individual elements. Price of storage is making the storage of more data and more elaborate forms of old data affordable. Better networking is making it possible to deliver some of this data to users away from the installation which created the data. New application classes build upon all of this All of this leads to questions about data integrity, data validity, data search, backup, retention … © 2006 IBM Corporation IBM Systems & Technology Group Data is growing Individual files are growing 1MB of text observations becomes 1TB of observational data in weather over the last two decades A 4KB letter in ASCI becomes a 30KB Word document with no added pictures, signatures or logos. Media applications generating more and higher resolution data Aggregate Data is growing Science data is growing. CERN starting a new set of particle acceleration experiments expected to collect 10 PB/year of filtered retained data GPFS file system at LLNL reached 2PB this year Library of Congress RFP in 1/2006 called for the ingest of 8PB/year of video data retained indefinitely © 2006 IBM Corporation IBM Systems & Technology Group Storage is growing Anthology Terabyte NAS on sale at Fry’s - $699 or $599 w/ rebate SATA drives of 500GB are available today. FC drives at 300GB. IBM DS8300 can store 180TB of data in one controller. A DS4800 mid-range controller can store 89TB. One IBM TS3500 tape library can store over 3PB of data on 3592 cartridges which can contain 1TB each with typical compression. Disk density improvements appear to be slowing down. PLUS: you can aggregate them to create file systems of hundreds of TB to a few PB today © 2006 IBM Corporation IBM Systems & Technology Group Data Access Rates are growing Disks attached by 4Gb FC, 320 MB/sec SCSI … replacing the 10 MB/sec SCSI of a few years ago. Drive speeds of 10K rpm or 15K rpm are common replacing the 5400 rpm of a few years ago Data transfer rates vary heavily based on usage patterns and fragmentation, but 50MB/sec + is not uncommon is our experience. RAID arrays can multiply this transfer by 4-8X for a single object PLUS: you can stripe files or sets of files across arrays, controllers and links to produce even larger data rates for a file or collection of files. GPFS measured 122GB/sec for one file at LLNL in 3/06. © 2006 IBM Corporation IBM Systems & Technology Group Network data rates are going up My office has gone from 15Mb (token ring) to 1Gb over the past 5 years. My home has gone from 9kbps modems to 300Mbps cable modems Highest performing clusters have multiple 2GB/sec proprietary links and 4X IB or 10Gb ethernet Multiple Gbps links are common between sites and some GRID environments aggregate enough to provide 30-40Gb/sec. I’d like any data from anywhere and I’d like it now, please. © 2006 IBM Corporation U Penn Digital Mammography Project Potential for 28 Petabytes/year over 2000 Hospitals, in full production 7 Regional Archives @ 4,000 TB/yr 20 Area @ 100 TB/yr 15 Hospitals @ 7 TB/yr Proposed Hierarchical Layout A A A A REG I ONAL A A A A REG I ONAL A REG I ONAL A A A A A A A A A A REG I ONAL A A A A A REG I ONAL A A REGIONAL A A A Goal: Distribute Storage Load and Balance Network and Query Loads A 7 U Penn Digital Mammography Project Current NDMA Configuration Testbed to demonstrate feasibility Storage and retrieval Infrastructure for access Instant consultation with experts Innovative teaching tools Ensure privacy and confidentiality Current NDMA Portal Locations Regional Archive map CAnet*3 CAnet*3 Chicago GigaPop NSCP Abilene N. C. GigaPop Atlanta GigaPop ESNET . http://nscp.upenn.edu/NDMA 8 IBM Systems & Technology Group HPCS (A petaflop project) As stated: 1.5-5TB/sec aggregate for big science applications. Less for other applications 32K file creates/sec Single 1 stream at 30GB/sec trillion files in a file system Inferred Probably requires 50-100PB for a balanced system at the high end of this project. Multi-PB even at the lower end of this project. It would be good to manage these components effectively. There’s a lot of components here that need to work in concert. Access beyond the main cluster is desired beyond basic NFS © 2006 IBM Corporation IBM Systems & Technology Group Data increasingly created by events/sensors Images collected from NEAT (Near- Earth Asteroid Tracking) telescopes First year: processed 250,000 images, archived – 6 TB of compressed data – Randomly accessed by community of users Life sciences – Mass spectrometer creates 200 GB/day, times 100 Mass Spectrometers -> 20 TB/day – Mammograms, X-rays and other medical images Ability to create/store data is exponentially increasing Problem is to extract meaningful info from all the data Often able to find information from hindsight perspective Sometimes able to do some analysis at data collection which requires the ability to access the data as it is collected. High data rates required. © 2006 IBM Corporation IBM Systems & Technology Group Collaborative computing – no boundaries Emergence of large, multidisciplinary computational science teams Across geographies Across disciplines/sciences Biologic systems Computer science, biochemistry, statistics, computation chemistry, fluids, … Automotive design Change sheet metal design – interactive integration Collaboration within enterprise/university Collaboration anywhere/anytime Ability to share/characterize data is critical © 2006 IBM Corporation IBM Systems & Technology Group Data accessible forever Size – too large to backup/restore unconditionally Growth rate – constantly changing/increasing Value – always needs to be available for analysis Search - Need to be able to find the data Performance – appropriate for the data Collaboration – no way to anticipate users New storage paradigm © 2006 IBM Corporation IBM Systems & Technology Group GPFS: parallel file access Parallel Cluster File System Based on Shared Disk (SAN) Model Cluster – fabric-interconnected nodes (IP, SAN, …) Shared disk - all data and metadata on fabric-attached disk Parallel - data and metadata flows from all of the nodes to all of the disks in parallel under control of distributed lock manager. GPFS File System Nodes Switching fabric (System or storage area network) Shared disks (SAN-attached or network block device) © 2006 IBM Corporation IBM General Parallel File System Adding Information Lifecycle Management to GPFS GPFS adds support for ILM abstractions: filesets, storage pools, policy Fusion Native Clients (Linux, AIX) Application Application Application GPFS Placement Policy Placement Posix Policy GPFS Placement Policy GPFS Placement Policy – Fileset: subtree of a file system – Storage pool – group of LUNs – Policy – rules for placing files into storage pools Application GPFS GPFS Manager Node •Cluster manager •Lock manager •Quota manager •Allocation manager •Policy manager GPFS RPC Protocol Examples – Place new files on fast, reliable storage, move files as they age to slower storage, then to tape – Place media files on videofriendly storage (fast, smooth), other files on cheaper storage – Place related files together, e.g. for failure containment Storage Network Gold Pool System Pool Silver Pool Pewter Pool Data Pools GPFS File System (Volume Group) 14 © 2006 IBM Corporation IBM Systems & Technology Group The Largest GPFS systems System Year TF GB/s Nodes Disk size Storage Disks Blue P 1998 3 3 1464 9GB 43 TB 5040 White 2000 12 9 512 19GB 147 TB 8064 Purple/C 2005 100 122 1536 250GB 2000 TB 11000 © 2006 IBM Corporation IBM Systems & Technology Group ASCI Purple Supercomputer 1536-node, 100 TF pSeries cluster at Lawrence Livermore Laboratory 2 PB GPFS file system 122 GB/s to a single file from all nodes in parallel © 2006 IBM Corporation IBM General Parallel File System Multi-cluster GPFS and Grid Storage Multi-cluster supported in GPFS 2.3 – Remote mounts secured with OpenSSL – User ID’s mapped across clusters • Server and remote client clusters can have different userid spaces • File userids on disk may not match credentials on remote cluster • Pluggable infrastructure allows userids to be mapped across clusters Multi-cluster works within a site or across a WAN – Lawrence Berkeley Labs (NERSC) • multiple supercomputer clusters share large GPFS file systems – DEISA (European computing grid) • RZG, CINECA, IDRIS, CSC, CSCS, UPC, IBM • pSeries and other clusters interconnected with multi-gigabit WAN • Multi-cluster GPFS in “friendly-user” production 4/2005 – Teragrid • • • • 17 SDSC, NCSA, ANL, PSC, CalTech, IU, UT, ….. Sites linked via 30 Gb/sec dedicated WAN 500TB GPFS file system at SDSC shared across sites, 1500 nodes Multi-cluster GPFS in production 10/2005 © 2006 IBM Corporation IBM General Parallel File System Grid Computing Dispersed resources connected via a network – – – – Compute Visualization Data Instruments (telescopes, microscopes, etc.) Sample Workflow: – Ingest data at Site A – Compute at Site B – Visualize output data at Site C Logistic nightmare! – Conventional approach: copy data via ftp. Space, time, bookkeeping. – Possible approach: on-demand parallel file access over grid • 18 … but the scale of problems demand high performance! © 2006 IBM Corporation IBM General Parallel File System SDSC-IBM StorCloud Challenge Demo – Grid Storage SDSC DataStar Compute Nodes NCSA Compute Nodes Teragrid 30 Gb/s WAN Enzo visualization Grid Sites Data Star (183-node IBM pSeries cluster) at SDSC Mercury IBM IA64 cluster at NCSA GPFS 2.3 (Multi-cluster) 120-terabyte file system on IBM DS4300 (FAStT 600 Turbo) storage at SC04 Teragrid 30 Gb/s Teragrid backbone WAN Workflow Enzo simulates the evolution of the universe from big bang to present. Enzo runs best on the pSeries Data Star nodes at SDSC. Enzo writes its output as it is produced over the Teragrid to the GPFS Storcloud file system at SC04. VISTA reads the Enzo data from the GPFS StorCloud file system and renders it into images that can be compressed into a QuickTime video. Vista takes advantage of the cheaper IA64 nodes at NCSA. VISTA reads the Enzo results from the StorCloud file system and writes its output images there as well. The resulting QuickTime movie is read from GPFS and displayed in the SDSC booth at the conference 19 © 2006 IBM Corporation GPFS Clients in IBM Booth GPFS NSD Servers SDSC Booth Storcloud SAN /Gpfs/storcloud 120 TB GPFS File System 60 Ds4300 RAID Storcloud Booth Visualization IBM Systems & Technology Group DEISA – European Science Grid http://www.deisa.org/applications/deci.php DEISA File System - Logical View Global name space Transparent access across sites “symmetric” access equal performance everywhere via 1 Gb/s GEANT network DEISA File System - Physical View FZJ-Jülich (Germany): P690 (32 processor nodes) architecture, incorporating 1312 processors. Peak performance is 8.9 Teraflops. IDRIS-CNRS (France): Mixed P60 and P655+ (4 processor nodes) architecture, incorporating 1024 processors. Peak performance is 6.7 Teraflops. RZG–Garching (Germany): P690 architecture incorporating 896 processors. Peak performance is 4.6 Teraflops. CINECA (Italy): P690 architecture incorporating 512 processors. Peak performance is 2.6 Teraflops. © 2006 IBM Corporation IBM Systems & Technology Group DEISA – Management Each “core” partner contributes initially 10-15% of its computing capacity to a common resource pool. This pool benefits from the DEISA global file system. Sharing model is based on simple exchanges: on the average each partner recovers as much as he contributes. This leaves the different business models of the partners organizations unchanged. The pool is dynamic: in 2005, computing nodes will be able to join or leave the pool in real time, without disrupting the national services. The pool can therefore be reconfigured to match users requirements and applications profiles. Each DEISA site is a fully independent administration domain, with its own AAA policies. The “dedicated” network connects computing nodes – not sites. A network of trust will be established, to operate the pool. © 2006 IBM Corporation IBM Systems & Technology Group Issues Data Integrity Systems must ensure data is valid at creation and integrity is maintained Data locality Difficult to predict where data will be needed/used Data Movement GRID like infrastructure to move data as needed Move close to biggest users? Analyze access patterns Data manipulation Data annotation Data availability © 2006 IBM Corporation IBM Systems & Technology Group Solution Elements Storage/Storage controller Redundancy Caching Management A file system Scalable Provides appropriate performance © 2006 IBM Corporation IBM Systems & Technology Group Issues Data manipulation What can you do with a 100 GB file? Common tools to analytically view/understand data Extract information/correlations Data annotation What is the data? What is quality? What is source? Mars rover – feet vs meters Climate study – temperature bias Data availability Anywhere/anytime © 2006 IBM Corporation IBM Systems & Technology Group Solutions Global file systems First step to name mapping and distributed access Embedded analysis via controller Storage controllers are basically specialized computers Analysis locality Data annotation XML self describing, meta-data, RDBMS Storage GRIDs Access to collaboraters Global data dictionary © 2006 IBM Corporation IBM Systems & Technology Group The 2010 HPC File system Wide striping for data rates scaling to 1TB/sec is basic. Metadata requires heavy write cache or solid state disk Access beyond the compute cluster using the speed delivered by network vendors. GPFS multi-cluster begins this process for IBM. pNFS is also working in this direction. The file system will automatically adapt to degradation in the network and the storage. The file system will provide improved facilities for HSM, backup and other utility vendors to selectively move files. Better search algorithms will be implemented to find the data that you need. This will be joint between the file system and external search capabilities or databases © 2006 IBM Corporation IBM Systems & Technology Group pNFS – Parallel NFS Version 4 Extension to NFS4 to support parallel access Allows transparent load balancing across multiple servers Metadata server handles namespace operations Data servers handle read/write operations Layer pNFS metadata server and data servers on top of GPFS cluster Working with University of Michigan and others on Linux pNFS on top of GPFS Linux Linux NFSD NFSv4 NFSDClients NFSv4 + pNFS NFSv4 metadata server get layout device driver Storage protocol NFSv4 read/write Storage Device Storage Or Device Storage Device Or server NFSv4 data or NFSv4 data server NFSv4 data server Management Protocol © 2006 IBM Corporation IBM Systems & Technology Group In 2010 Disk drives in excess of 1TB. Drive transfer rates increase somewhat because of greater density. Disk connection networks of 10Gb/sec or more. General networks of 30Gb/sec or more (12x IB) and expand over greater distances The network is becoming less of a limitation. Storage will centralize for management reasons. Processor trends continue ========================================== Enables larger and faster file systems. Requires better sharing of data. Standard NFS over TCP will only be part of the answer. Data center sharing of data through high speed networks becomes common Requires better management of the components and better robustness. New applications involve collection and search of large amounts of data © 2006 IBM Corporation IBM Systems & Technology Group Data Analytics Analytics is the intersection of: Visualization, analysis, scientific data management, humancomputer interfaces, cognitive science, statistical analysis, reasoning, … All sciences need to find, access, and store and understand information In some sciences, the data management (and analysis) challenge already exceeds the compute-power challenge The ability to tame a tidal wave of information will distinguish the most successful scientific, commercial, and national security endeavors It is the limiting or the enabling factor for a wide range of sciences Analyzing data to find meaningful information requires substantially more computing power and more intelligent data handling Bioinformatics, Financial, Climate, Materials © 2006 IBM Corporation IBM Systems & Technology Group Distributed Data generators Realtime data creation Cameras, telescopes, satellites, sensors, weather stations, simulations Old way – capture all data for later analysis New way – analysis at data origin/creation Embed intelligent systems into sensors Analyze data early in life cycle © 2006 IBM Corporation IBM Systems & Technology Group Summary Data is going to keep increasing Smarter methods to extract information from data Full blown collaborative infrastructures needed © 2006 IBM Corporation