Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006 Anke Kamrath Division Director, San Diego Supercomputer Center kamratha@sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER UCSD The Digital World Entertainment Shopping Information SAN DIEGO SUPERCOMPUTER CENTER UCSD Science is a Team Sport Data Management and Mining Astronomy Geosciences Life Sciences GAMESS Modeling and Simulation QCD Physics SAN DIEGO SUPERCOMPUTER CENTER UCSD Cyberinfrastructure – A Unifying Concept Cyberinfrastructure = resources (computers, data storage, networks, scientific instruments, experts, etc.) + “glue” (integrating software, systems, and organizations). NSF’s “Atkins Report” provided a compelling vision for integrated Cyberinfrastructure SAN DIEGO SUPERCOMPUTER CENTER UCSD A Deluge of Data • Today data comes from everywhere • • • • • • • And is used by everyone • • • • • “Volunteer” data Scientific instruments Experiments Sensors and sensornets Computer simulations New devices (personal digital devices, computer-enabled clothing, cars, …) Data from sensors Data from instruments Data from simulations Researchers, educators Consumers Practitioners General public Turning the deluge of data into usable information for the research and education community requires an unprecedented level of integration, globalization, scale, and access Volunteer data Data from analysis SAN DIEGO SUPERCOMPUTER CENTER UCSD Using Data as a Driver: SDSC Cyberinfrastructure Community Databases and Data Collections, Data management, mining and preservation Data-oriented HPC, Resources, High-end storage, Large-scale data analysis, simulation, modeling Biology Workbench SDSC Data Cyberinfrastructure Data- and Computational Science Education and Training Summer Institute I T SRB Data-oriented Tools, SW Applications, and Community Codes Collaboration, Service and Community Leadership for Dataoriented Projects SAN DIEGO SUPERCOMPUTER CENTER UCSD Impact on Technology: Data and Storage are Integral to Today’s Information Infrastructure • Today’s “computer” is a coordinated set of hardware, software, and services providing an “end-to-end” resource. • Cyberinfrastructure captures how the research and education community has redefined “computer” wireless sensors field computer computer data network network computer data data storage computer viz field instrument network Data and storage are an integral part of today’s “computer” SAN DIEGO SUPERCOMPUTER CENTER UCSD Building a National Data Cyberinfrastructure Center Goal: SDSC’s Data Cyberinfrastructure should “extend the reach” of the local research and education environment. Access to community and reference data collections More capable and/or higher capacity computational resources Community codes, middleware, software tools and toolkits Multi-disciplinary expertise Long-term Scienctific Data Preservation SAN DIEGO SUPERCOMPUTER CENTER UCSD Impact on Applications: Data-oriented Research Driving the Next Generation of Technology Challenges Data (more BYTES) Data-oriented Research Applications Home, Lab, Campus, Desktop Applications Traditional HPC Applications Compute (more FLOPS) SAN DIEGO SUPERCOMPUTER CENTER UCSD Today’s Research Applications Span the Spectrum Data Mgt. Envt. Extreme I/O Environment Data-oriented Environment SCEC Visualization Data (more BYTES) EOL SCEC Climate Simulation NVO simulation ENZO Visualization GridSAT CiPres Seti@Home ENZO Turbulence field CFD Could be targeted efficiently on Grid MCell Protein Folding/MD Home, Lab, Campus, Desktop Traditional HPC environment Difficult to target efficiently on Grid CPMD QCD GAMESS EverQuest Lends itself to Grid Turbulence Reattachment length Compute (more FLOPS) SAN DIEGO SUPERCOMPUTER CENTER UCSD Working with Compute and Data – Simulation, Analysis, Modeling Resources Required Computers and Systems Simulation of Southern of 7.7 earthquake on lower San Andreas Fault • Physics-based dynamic source model – simulation of mesh of 1.8 billion cubes with spatial resolution of 200 m • Builds on 10 years of data and models from the Southern California Earthquake Center • Simulated first 3 minutes of a magnitude 7.7 earthquake, 22,728 time steps of 0.011 second each • Simulation generates 45+ TB data • • • • • 80,000 hours on DataStar 256 GB memory p690 used for testing, p655s used for production run, TG used for porting 30 TB Global Parallel file GPFS Run-time 100 MB/s data transfer from GPFS to SAMQFS 27,000 hours postprocessing for high resolution rendering People • • 20+ people for IT support 20+ people in domain research Storage • • • SAM-QFS archival storage HPSS backup SRB Collection with 1,000,000 files SAN DIEGO SUPERCOMPUTER CENTER UCSD Big Data & Big Compute: Simulating an earthquake 1: 1. Divide up Southern California into “blocks” 2. For each block, get all the data on ground surface composition, geological structures, fault information, etc. The Southern San Andreas Fault SAN DIEGO SUPERCOMPUTER CENTER UCSD Big Data & Big Compute: Simulating earthquake 2: 3. Map the blocks on to processors (brains) of the computer SDSC’s DataStar – one of the 25 fastest computers in the world SAN DIEGO SUPERCOMPUTER CENTER UCSD Big Data & Big Compute: Simulating an earthquake 3: 4. Run the simulation using current information on fault activity and the physics of earthquakes SAN DIEGO SUPERCOMPUTER CENTER UCSD Managing the data Where to store the data? • • Simulating an In HPSS, a tape storage library earthquake 4: that can hold 10 PetaByes Terabytes) -- 500 times 5. (100000 The simulation theoutputs printeddata materials on in the Library of wave Congress seismic velocity, earthquake magnitude, and other characteristics • How much data was output? • 47 TeraBytes which is • 2+ times the printed materials in the Library of Congress! or • The amount of music in 2000+ iPods! or • 47 million copies of a typical DVD movie! SAN DIEGO SUPERCOMPUTER CENTER UCSD How long will TeraShake take on your desktop computer? Computing Platform Number of Processors Floating Point (arithmetic) Operations per second Desktop 1 5.3 billion Can run TeraShake in 72 centuries! DataStar at SDSC 1024 10.4 trillion (approximate) 5 days (240 used for TeraShake) SAN DIEGO SUPERCOMPUTER CENTER UCSD Better Neurosurgery Through Cyberinfrastructure Radiologists and neurosurgeons at Brigham and Women’s Hospital, Harvard Medical School exploring transmission of 30/40 MB brain images (generated during surgery) to SDSC for analysis and alignment • • • PROBLEM: Neuro-surgeons seek to remove as much tumor tissue as possible while minimizing removal of healthy brain tissue Brain deforms during surgery Surgeons must align preoperative brain image with intra-operative images to provide surgeons the best opportunity for intra-surgical navigation Transmission repeated every hour during 6-8 hour surgery. Transmission and output must take on the order of minutes Finite element simulation on biomechanical model for volumetric deformation performed at SDSC; output results are sent to BWH where updated images are shown to surgeons SAN DIEGO SUPERCOMPUTER CENTER UCSD Community Data Repository: SDSC DataCentral • Provides “data allocations” on SDSC resources to national science and engineering community • Data collection and database hosting • • • Batch oriented access Collection management services First broad program of its kind to support research and community data collections and databases • Comprehensive resources • • • • Disk: 400 TB accessible via HPC systems, Web, SRB, GridFTP Databases: DB2, Oracle, MySQL SRB: Collection management Tape: 6 PB, accessible via file system, HPSS, Web, SRB, GridFTP • 24/7 operations, collection specialists Example Allocated Data Collections include • Bee Behavior (Behavioral Science) • C5 Landscape DB (Art) • Molecular Recognition Database (Pharmaceutical Sciences) • LIDAR (Geoscience) • AMANDA (Physics) • SIO_Explorer (Oceanography) • Tsunami and Landsat Data (Earthquake Engineering) • Terabridge (Structural Engineering) DataCentral infrastructure includes: Web-based portal, security, networking, UPS systems, web services and software tools SAN DIEGO SUPERCOMPUTER CENTER UCSD Public Data Collections Hosted in SDSC’s DataCentral Seismology 3D Ground Motion Collection for the LA Basin Atmospheric Sciences50 year Downscaling of Global Analysis over California Region Earth Sciences NEXRAD Data in Hydrometerology and Hydrology Life Sciences Protein Data Bank Neurobiology Salk data Geosciences GEON Seismology SCEC TeraShake Geosciences GEON-LIDAR Seismology SCEC CyberShake Geochemistry Kd Oceanography SIO Explorer Biology Gene Ontology Networking Skitter Astronomy Sloan Digital Sky Survey Geochemistry GERM Networking HPWREN Geology Sensitive Species Map Server Ecology HyperLter Geology SD and Tijuana Watershed data Elementary Particle Physics AMANDA data Biology AfCS Molecule Pages Biomedical Neuroscience BIRN Networking IMDC Oceanography Seamount Catalogue Networking Backbone Header Traces Biology Interpro Mirror Oceanography Seamounts Online Networking Backscatter Data Biology JCSG Data Biodiversity WhyWhere Biology Bee Behavior Government Library of Congress Data Ocean Sciences Geophysics Magnetics Information Consortium data Southeastern Coastal Ocean Observing and Prediction Data Structural Engineering TeraBridge Biology Biocyc (SRI) Art C5 landscape Database Geology Chronos Biology CKAAPS Biology Education UC Merced Japanese Art Collections Various TeraGrid data collections DigEmbryo Geochemistry NAVDAT Biology Transporter Classification Database Earth Science Education ERESE Earthquake Engineering NEESIT data Biology TreeBase Earth Sciences UCI ESMF Art Tsunami Data Education NSDL Education ArtStor Astronomy NVO Biology Yeast regulatory network Earth Sciences EarthRef.org Earth Sciences ERDA Earth Sciences ERR Government NARA Biology Apoptosis Database Biology Encyclopedia of Life Anthropology GAPP Cosmology LUSciD SAN DIEGO SUPERCOMPUTER CENTER UCSD Data Cyberinfrastructure Requires a Coordinated Approach interoperability Applications: Medical informatics, Biosciences, Ecoinformatics,… integration Visualization Data Mining, Simulation Modeling, Analysis, Data Fusion Knowledge-Based Integration Advanced Query Processing Grid Storage Filesystems, Database Systems High speed networking sensornets Storage hardware How do we represent data, information and knowledge to the user? How do we detect trends and relationships in data? How do we obtain usable information from data? How do we collect, access and organize data? How do we configure computer architectures to optimally support data-oriented computing? Networked Storage (SAN) HPC How do we combine data, knowledge and information management with simulation and modeling? instruments SAN DIEGO SUPERCOMPUTER CENTER UCSD Working with Data: Data Integration for New Discovery Data Integration in the Biosciences Users Software to access data Software to federate data Disciplinary Databases Anatomy Data Integration in the Geosciences Where can we most safely build a nuclear waste dump? Where should we drill for oil? What is the distribution and U/ Pb zircon ages of A-type plutons in VA? How does it relate to host rock structures? Organisms Data Integration Physiology Organs Cell Biology Cells Complex “multiple-worlds” mediation Proteomics Organelles Genomics Medicinal Chemistry Biopolymers Atoms GeoGeologic Chemical Map GeoPhysical GeoChronologic SAN DIEGO SUPERCOMPUTER CENTER UCSD Foliation Map Preserving Data over the Long-Term SAN DIEGO SUPERCOMPUTER CENTER UCSD Data Preservation • Many Science, Cultural, and Official Collections must be sustained for the foreseeable future • Critical collections must be preserved: • • community reference data collections (e.g. Protein Data Bank) • irreplaceable collections (e.g. field data – tsunami recon) • longitudinal data (e.g. PSID – Panel Study of Income Dynamics) No plan for preservation often means that data is lost or damaged “….the progress of science and useful arts … depends on the reliable preservation of knowledge and information for generations to come.” “Preserving Our Digital Heritage”, Library of Congress SAN DIEGO SUPERCOMPUTER CENTER UCSD How much Digital Data*? iPod Shuffle (up to 120 songs) = 512 MegaBytes 1 human brain at the micron level = 1 PetaByte Kilo 1 novel = 1 MegaByte 1 Low Resolution Photo = 100 KiloBytes Printed materials in the Library of Congress = 10 TeraBytes 103 Mega 106 Giga 109 Tera 1012 Peta 1015 Exa 1018 SDSC HPSS tape archive = 6 PetaBytes All worldwide information in one year =2 ExaBytes * Rough/average estimates SAN DIEGO SUPERCOMPUTER CENTER UCSD Key Challenges for Digital Preservation • What should we preserve? • • What materials must be “rescued”? How to plan for preservation of materials by design? • How should we preserve it? • • • Formats Storage media Stewardship – who is responsible? Print media provides easy access for long periods of time but is hard to data-mine • Who should pay for preservation? • • • The content generators? The government? The users? • Who should have access? Digital media is easier to data-mine but requires management of evolution of media and resource planning over time SAN DIEGO SUPERCOMPUTER CENTER UCSD What can go wrong Entity at risk Problem Frequency File Corrupted media, disk failure 1 year Tape + Simultaneous failure of 2 copies 5 years System + Systemic errors in vendor SW, or Malicious user, or Operator 15 years error that deletes multiple copies Archive + Natural disaster, obsolescence 50 - 100 years of standards SAN DIEGO SUPERCOMPUTER CENTER UCSD SDSC Cyberinfrastructure Community Resources COMPUTE SYSTEMS • DataStar • • • • TeraGrid Cluster • • • 2396 Power4+ processors, IBM p655 and p690 nodes 10 TB total memory Up to 2 GBps I/O to disk 512 Itanium2 IA-64 processors 1 TB total memory DATA ENVIRONMENT • • • • • • • 1 PB Storage-area Network (SAN) 10 PB StorageTek tape library DB2, Oracle, MySQL Storage Resource Broker HPSS 72-CPU Sun Fire 15K 96-CPU IBM p690s • http://datacentral.sdsc.edu/ Support for 60+ community data collections and databases Data management, mining, analysis, and preservation Intimidata • • • Only academic IBM Blue Gene system 2,048 PowerPC processors 128 I/O nodes http://www.sdsc.edu/ user_services/ SCIENCE and TECHNOLOGY STAFF, SOFTWARE, SERVICES • • • • • User Services Application/Community Collaborations Education and Training SDSC Synthesis Center Community SW, toolkits, portals, codes • http://www.sdsc.edu/ SAN DIEGO SUPERCOMPUTER CENTER UCSD Thank You kamratha@sdsc.edu www.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER UCSD