A Grand Challenge for the Information Age Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed Chair, UC San Diego SAN DIEGO SUPERCOMPUTER CENTER Fran Berman UCSD The Fundamental Driver of the Information Age is Digital Data Education Entertainment Shopping Health Information Business SAN DIEGO SUPERCOMPUTER CENTER Fran Berman UCSD Digital Data Critical for Research and Education Data from multiple sources in the Geosciences Data at multiple scales in the Biosciences Users Data Access and Use Data Integration Where should we drill for oil? Disciplinary Databases What is the Impact of Global Warming? How are the continents shifting? Anatomy Organisms Data Integration Physiology Organs Cell Biology Cells Complex “multiple-worlds” mediation Proteomics Organelles Genomics Medicinal Chemistry What genes are associated with cancer? What parts of the brain are responsible for Alzheimers? Biopolymers Atoms GeoGeologic Chemical Map GeoPhysical GeoChronologic SAN DIEGO SUPERCOMPUTER CENTER Fran Berman UCSD Foliation Map Today’s Presentation • Data Cyberinfrastructure Today – Designing and developing infrastructure to enable today’s data-oriented applications • Challenges in Building and Delivering Capable Data Infrastructure • Sustainable Digital Preservation – Grand Challenge for the Information age SAN DIEGO SUPERCOMPUTER CENTER Fran Berman UCSD Data Cyberinfrastructure Today – Designing and Developing Infrastructure for Today’s Data-Oriented Applications SAN DIEGO SUPERCOMPUTER CENTER Fran Berman UCSD Today’s Data-oriented Applications Span the Spectrum Data and High Performance Computing Data and Grids Data Grid Applications Data and Cyberinfrastructure Services DATA (more BYTES) Designing Infrastructure for Data: NETWORK (more BW) Data-intensive applications Home, Lab, Campus, Desktop Applications Data-intensive and Computeintensive HPC applications Computeintensive HPC Applications COMPUTE (more FLOPS) Grid Applications SAN DIEGO SUPERCOMPUTER CENTER Fran Berman UCSD Data and High Performance Computing • For many applications, development of “balanced systems” needed to support applications which are both data-intensive and compute-intensive. Codes for which • Grid platforms not a strong option Data-intensive Data-intensive applications applications • I/O rates exceed WAN capabilities • Continuous and frequent I/O is latency intolerant • Scalability is key • Need high-bandwidth and largecapacity local parallel file systems, archival storage DATA (more BYTES) • Data must be local to computation Data-intensive and Compute-intensive HPC applications Computeintensive HPC Compute-intensive Applications applications COMPUTE (more FLOPS) SAN DIEGO SUPERCOMPUTER CENTER Fran Berman UCSD : Earthquake Simulation at Petascale – better prediction accuracy creates greater data-intensive demands Estimated figures for simulated 240 second period, 100 hour run-time TeraShake domain (600x300x80 km^3) PetaShake domain (800x400x100 km^3) Fault system interaction NO YES Inner Scale 200m 25m Resolution of terrain grid 1.8 billion mesh points 2.0 trillion mesh points Magnitude of Earthquake 7.7 8.1 Time steps 20,000 (.012 sec/step) 160,000 (.0015 sec/step) Surface data 1.1 TB 1.2 PB Volume data 43 TB 4.9 PB SAN DIEGO SUPERCOMPUTER CENTER Information courtesy of the Southern California Earthquake Center UCSD Fran Berman Data and HPC: What you see is what you’ve measured Cray XD1 -- Custom Interconnect FLOPS alone are not enough. • Dalco Linux Cluster -- Quadrics I nterconnect Three systems using the same processor and number of processors. • Sun Fire Cluster -- Gigabit ethernet Interconnect AMD Opteron 64 processors 2.2 GHz • Appropriate benchmarks needed to rank/bring visibility to more balanced machines critical for today’s applications. • Difference is in way the processors are interconnected HPC Challenge benchmarks measure different machine characteristics • Linpack and matrix multiply are computationally intensive • PTRANS (matrix transpose), RandomAccess , bandwidth/latency tests and other tests begin to reflect stress on memory system SAN DIEGO SUPERCOMPUTER CENTER Information courtesy of Jack Dongarra Fran Berman 9 UCSD Data and Grids • Data applications some of the first applications which • required Grid environments • could naturally tolerate longer latencies • Grid model supports key data application profiles • Compute at site A with data from site B • Store Data Collection at site A with copies at sites B and C • Operate instrument at site A, move data to site B for storage, postprocessing, etc. CERN data providing key driver for grid technologies SAN DIEGO SUPERCOMPUTER CENTER Fran Berman UCSD Data Services Key for TeraGrid Science Gateways • Science Gateways provide common application interface for science communities on TeraGrid • Data services key for Gateway communities NVO LEAD • Analysis • Visualization • Management • Remote access, etc. GridChem SAN DIEGO SUPERCOMPUTER CENTER Information and images courtesy of Nancy Wilkins-Diehr Fran Berman UCSD Unifying Data over the Grid – the TeraGrid GPFS WAN Effort • User wish list • Unlimited data capacity. (everyone’s aggregate storage almost looks like this) • Transparent, high speed access anywhere on the Grid • Automatic archiving and retrieval • No Latency. • TeraGrid GPFS-WAN effort focuses on providing “infinite“(SDSC) storage over the grid • Looks like local disk to grid sites • Uses automatic migration with a large cache to keep files always “online” and accessible. • Data automatically archived without user intervention SAN DIEGO SUPERCOMPUTER CENTER Information courtesy of Phil Andrews Fran Berman UCSD Data Services – Beyond Storage to Use What services do users want? How do I make sure that my data will be there when I want it? How should I display my data? How can I combine my data with my colleague’s data? What are the trends and what is the noise in my data? How should I organize my data? My data is confidential; how do I make sure that it is seen/used only by the right people? How can I make my data accessible to my collaborators? SAN DIEGO SUPERCOMPUTER CENTER Fran Berman UCSD Services: Integrated Environment Key to Usability analysis modeling simulation Data Access • Database selection and schema design • Portal creation and collection publication • Data analysis • Data mining • Data hosting • Preservation services • Domain-specific tools visualization Data Manipulation File systems, Database systems, Collection Management Data Integration, etc. Data Management Data Storage Many Data Sources instruments Sensornets computers Integrated Infrastructure • Biology Workbench • Montage (astronomy mosaicking) • Kepler (Workflow management) • Data visualization • Data anonymization, etc. SAN DIEGO SUPERCOMPUTER CENTER Fran Berman UCSD Data Hosting: SDSC DataCentral – A Comprehensive Facility for Research Data • Broad program to support research and community data collections and databases • DataCentral services include: PDB – 28 TB • Public Data Collections and Database Hosting • Long-term storage and preservation (tape and disk) • Remote data management and access (SRB, portals) • Data Analysis, Visualization and Data Mining • Professional, qualified 24/7 support Web-based portal access • DataCentral resources include • 1 PB On-line disk • 25 PB StorageTek tape library capacity • 540 TB Storage-area Network (SAN) • DB2, Oracle, MySQL • Storage Resource Broker • Gpfs-WAN with 700 TB SAN DIEGO SUPERCOMPUTER CENTER Fran Berman UCSD DataCentral Allocated Collections include Seismology 3D Ground Motion Collection for the LA Basin Atmospheric Sciences50 year Downscaling of Global Analysis over California Region Earth Sciences NEXRAD Data in Hydrometerology and Hydrology Life Sciences Protein Data Bank Neurobiology Salk data Geosciences GEON Seismology SCEC TeraShake Geosciences GEON-LIDAR Seismology SCEC CyberShake Geochemistry Kd Oceanography SIO Explorer Biology Gene Ontology Networking Skitter Astronomy Sloan Digital Sky Survey Geochemistry GERM Networking HPWREN Geology Sensitive Species Map Server Ecology HyperLter Geology SD and Tijuana Watershed data Elementary Particle Physics AMANDA data Biology AfCS Molecule Pages Biomedical Neuroscience BIRN Networking IMDC Oceanography Seamount Catalogue Networking Backbone Header Traces Biology Interpro Mirror Oceanography Seamounts Online Networking Backscatter Data Biology JCSG Data Biodiversity WhyWhere Biology Bee Behavior Government Library of Congress Data Ocean Sciences Geophysics Magnetics Information Consortium data Southeastern Coastal Ocean Observing and Prediction Data Structural Engineering TeraBridge Biology Biocyc (SRI) Art C5 landscape Database Geology Chronos Biology CKAAPS Biology Education UC Merced Japanese Art Collections Various TeraGrid data collections DigEmbryo Geochemistry NAVDAT Biology Transporter Classification Database Earth Science Education ERESE Earthquake Engineering NEESIT data Biology TreeBase Earth Sciences UCI ESMF Art Tsunami Data Education NSDL Education ArtStor Astronomy NVO Biology Yeast regulatory network Earth Sciences EarthRef.org Earth Sciences ERDA Earth Sciences ERR Government NARA Biology Apoptosis Database Biology Encyclopedia of Life Anthropology GAPP Cosmology LUSciD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman UCSD Data Visualization is key Visualization of Cancer Tumors SCEC Earthquake simulations SAN DIEGO SUPERCOMPUTER CENTER Prokudin– Gorskii historical images Information and images courtesy of Amit Chourasia, SCEC, Steve Cutchin, Moores Cancer Center, UCSD David Minor, U.S. of Congress FranLibrary Berman Building and Delivering Capable Data Cyberinfrastructure SAN DIEGO SUPERCOMPUTER CENTER Fran Berman UCSD Infrastructure Should be Non-memorable • Good infrastructure should be • Predictable • • • • Pervasive Cost-effective Easy-to-use Reliable • Unsurprising • What’s required to build and provide useful, usable, and capable data Cyberinfrastructure? SAN DIEGO SUPERCOMPUTER CENTER Fran Berman UCSD Building Capable Data Cyberinfrastructure: Incorporating the “ilities” • • • • • • • • • • Scalability Interoperability Reliability Capability Sustainability Predictability Accessibility Responsibility Accountability … SAN DIEGO SUPERCOMPUTER CENTER Fran Berman UCSD Reliability • How can we maximize data reliability? • Replication, UPS systems, heterogeneity, etc. • How can we measure data reliability? • Network availability= 99.999% uptime (“5 nines”), • What is the equivalent number of “0’s” for data reliability? Entity at risk What can go wrong Frequency File Corrupted media, disk failure 1 year Tape + Simultaneous failure of 2 copies 5 years System + Systemic errors in vendor SW, or malicious user, or operator error that deletes multiple copies 15 years Archive + Natural disaster, obsolescence of standards 50 - 100 years Reliability: What can go wrong SAN DIEGO SUPERCOMPUTER CENTER Information courtesy of Reagan Moore Fran Berman UCSD Responsibility and Accountability • What are reasonable expectations between users and repositories? • What are reasonable expectations between federated partner repositories? • Who owns the data? • Who takes care of the data? • Who pays for the data? • Who can access the data? • What are appropriate models for evaluating repositories? • What incentives promote good stewardship? What should happen if/when the system fails? SAN DIEGO SUPERCOMPUTER CENTER Fran Berman UCSD Good Data Infrastructure Incurs Real Costs Capability Costs Capacity Costs Model A (8-yr,15.2-mo 2X) TB Stored • Planned Capacity 100000.0 Archival Storage (TB) 10000.0 1000.0 100.0 Reliability increased by up-to-date and robust hardware and software for • Replication (disk, tape, geographically) • Backups, updates, syncing • Audit trails • Verification through checksums, physical media, network transfers, copies, etc. 10.0 June-97 June-98 June-99 June-00 June-01 June-02 June-03 June-04 June-05 June-06 June-07 June-08 June-09 Date • Data professionals needed to facilitate • Most valuable data must be replicated • Infrastructure maintenance • SDSC research collections have been doubling every 15 months. • Long-term planning • Restoration, and recovery SDSC storage is 25 PB and counting. • Access, analysis, preservation, and other services • Reporting, documentation, etc. • Data is from supercomputer simulations, digital library collections, etc. SAN DIEGO SUPERCOMPUTER CENTER Information courtesy of Richard Moore Information courtesy of Richard Moore Fran Berman UCSD Economic Sustainability Relay Funding • Making Infinite Funding Finite • Difficult to support infrastructure for data preservation as an infinite, increasing mortgage User fees, recharges Consortium support Geisel Library at UCSD Endowments • Creative partnerships help create sustainable economic models Hybrid solutions SAN DIEGO SUPERCOMPUTER CENTER Fran Berman UCSD Preserving Digital Information Over the Long Term SAN DIEGO SUPERCOMPUTER CENTER Fran Berman UCSD How much Digital Data is there? • • 103 Mega 106 Giga 109 Tera 1012 25% of the 2006 digital universe is born digital (digital pictures, keystrokes, phone calls, etc.) Peta 1015 Exa 1018 75% is replicated (emails forwarded, backed up transaction records, movies in DVD format) Zetta 1021 161 exabytes of digital information produced in 2006 • • • Kilo 5 exabytes of digital information produced in 2003 1 zettabyte aggregate digital information projected for 2010 SAN DIEGO SUPERCOMPUTER CENTER SDSC HPSS tape archive = 25+ PetaBytes iPod (up to 20K songs) = 80 GB 1 novel = 1 MegaByte U.S. Library of Congress manages 295 TB of digital data, 230 TB of which is “born digital” Source: “The Expanding Digital Universe: A forecast of Worldwide Information Growth through UCSD 2010” IDC Whitepaper, March 2007 Fran Berman How much Storage is there? • 2007 is the “crossover year” where the amount of digital information is greater than the amount of available storage • Given the projected rates of growth, we will never have enough space again for all digital information SAN DIEGO SUPERCOMPUTER CENTER Source: “The Expanding Digital Universe: A forecast of Worldwide Information Growth through UCSD 2010” IDC Whitepaper, March 2007 Fran Berman Focus for Preservation: the “most valuable” data • What is “valuable”? • Community reference data collections (e.g. UniProt, PDB) • Irreplaceable collections • Official collections (e.g. census data, electronic federal records) • Collections which are very expensive to replicate (e.g. CERN data) • Longitudinal and historical data • and others … Value Cost Time SAN DIEGO SUPERCOMPUTER CENTER Fran Berman UCSD A Framework for Digital Stewardship • Preservation efforts should focus on collections deemed “most valuable” Increasing Digital Data Collections Repositories/Facilities Increasing Value risk/responsibility Increasing Increasing Trust stability Increasing infrastructure • Key issues: • What do we preserve? • How do we guard against data loss? • Who is responsible? • Who pays? Etc. National, International Scale “Regional” Scale Local Scale Reference, nationally National / and Internaional-scale important, irreplaceable data repositories, archives, and data collections libraries. Key research and “Regional”-scale libraries community data and targeted data collections centers. Private repositories. Personal data collections The Data Pyramid SAN DIEGO SUPERCOMPUTER CENTER Fran Berman UCSD Digital Collections of Community Value National, International Scale “Regional” Scale Local Scale • Key techniques for preservation: replication, heterogeneous support The Data Pyramid SAN DIEGO SUPERCOMPUTER CENTER Fran Berman UCSD : A Conceptual Model for Preservation Data Grids The Chronopolis Model • Geographically distributed preservation data grid that supports long-term management , stewardship of, and access to digital collections • Implemented by developing and deploying a distributed data grid, and by supporting its human, policy, and technological infrastructure. • Integrates targeted technology forecasting and migration to support of long-term life-cycle management and preservation Distributed Production Preservation Environment Digital Information of Long-Term Value Technology Forecasting and Migration Administration, Policy, Outreach SAN DIEGO SUPERCOMPUTER CENTER Fran Berman UCSD Chronopolis Focus Areas and Demonstration Project Partners • Chronopolis R&D, Policy, and Infrastructure Focus areas: • Assessment of the needs of potential user communities and development of appropriate service models • Development of formal roles and responsibilities of providers, partners, users • • • Assessment and prototyping of best practices for bit preservation, authentication, metadata, etc. Development of appropriate cost and risk models for long-term preservation Development of appropriate success metrics to evaluate usefulness, reliability, and usability of infrastructure 2 Prototypes: National Demonstration Project Library of Congress Pilot Project Partners UCSD Libraries SDSC/UCSD U Maryland UCSD Libraries NCAR NARA Library of Congress NSF ICPSR Internet Archive NVO SAN DIEGO SUPERCOMPUTER CENTER UCSD Fran Berman Demonstration Project information courtesy of Robert McDonald National Demonstration Project – Large-scale Replication and Distribution • Focus on supporting multiple, geographically distributed copies of preservation collections: • • • • Chronopolis Federation architecture “Bright copy” – Chronopolis site supports ingestion, collection management, user access “Dim copy” – Chronopolis site supports remote replica of bright copy and supports user access “Dark copy” – Chronopolis site supports reference copy that may be used for disaster recovery but no user access Each site may play different roles for different collections NCAR Dim copy C1 U Md Dark copy C1 Dark copy C2 Bright copy C2 Bright copy C1 Dim copy C2 SDSC Chronopolis Site Demonstration collections included: • National Virtual Observatory (NVO) [1 TB Digital Palomar Observatory Sky Survey] • Copy of Interuniversity Consortium for Political and Social Research (ICPSR) data [1 TB Web-accessible Data] SAN DIEGO SUPERCOMPUTER CENTER • NCAR Observational Data [3 TB of Observational and Re-Analysis Data] Fran Berman UCSD SDSC/ UCSD Libraries Pilot Project with U.S. Library of Congress Goal: To “… demonstrate the feasibility and performance of current approaches for a production digital Data Center to support the Library of Congress’ requirements.” • Historically important 600 GB Library of Congress image collection • Images over 100 years old with red, blue, green components (kept as separate digital files). • SDSC stores 5 copies with dark archival copy at NCAR • Infrastructure must support idiosyncratic file structure. Special logging and monitoring software developed so that both SDSC and Library of Congress could access information Prokudin-Gorskii Photographs (Library of Congress Prints and Photographs Division) http://www.loc.gov/exhibits/empire/ (also collection of web crawls from the Internet Archive) Library of Congress Pilot Project information courtesy of David Minor SAN DIEGO SUPERCOMPUTER CENTER Fran Berman UCSD Pilot Projects provided invaluable experience with key Issues • Infrastructure Issues • Technical Issues • What kinds of resources (servers, storage, networks) are required? • How should they operate? • How to address Integrity, verification, provenance, authentication, etc. • Legal/Policy Issues • Evaluation Issues • Who is responsible? • Who is liable? • What is reliable? • What is successful? • Social Issues • What formats/standards are acceptable to the community? • How do we formalize trust? • Cost Issues • What is cost-effective? • How can support be sustained over time? SAN DIEGO SUPERCOMPUTER CENTER Fran Berman UCSD It’s Hard to be Successful in the Information Age without reliable, persistent information • Inadequate/unrealistic general solution: “Let X do it” where X is: • The Government • The Libraries • The Archivists • Google • The private sector • Creative partnerships needed to provide preservation solutions with • Trusted stewards • Feasible costs for users • Sustainable costs for infrastructure • Very low risk for data loss, etc. • Data owners • Data generators, etc. SAN DIEGO SUPERCOMPUTER CENTER Fran Berman UCSD Blue Ribbon Task Force to Focus on Economic Sustainability • International Blue Ribbon Task Force (BRTF-SDPA) to begin in 2008 to study issues of economic sustainability of digital preservation and access • Support from • • • • • • National Science Foundation Library of Congress Mellon Foundation Joint Information Systems Committee National Archives and Records Administration Council on Library and Information Sources SAN DIEGO SUPERCOMPUTER CENTER October 31, University State Federal College USER Non-profit Commercial Local Image courtesy of Chris Greer Office of CyberInfrastructure Fran Berman International UCSD BRTF-SDPA Charge to the Task Force: 1. To conduct a comprehensive analysis of previous and current efforts to develop and/or implement models for sustainable digital information preservation; (First year report) 2. To identify and evaluate best practice regarding sustainable digital preservation among existing collections, repositories, and analogous enterprises; 3. To make specific recommendations for actions that will catalyze the development of sustainable resource strategies for the reliable preservation of digital information; (Second Year report) 4. Provide a research agenda to organize and motivate future work. How you can be involved: • Contribute your ideas (oral and written “testimony”) • Suggest readings (website will serve as a community bibliography) • Write an article on the issues for a new community (Important component will be to educate decision makers and the public about digital preservation) Website to be launched this Fall. Will link from www.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER Fran Berman UCSD Many Thanks • Phil Andrews, Reagan Moore, Ian Foster, Jack Dongarra, Authors of the IDC Report, Ben Tolo, Reagan Moore, Richard Moore, David Moore, Robert McDonald, Southern California Earthquake Center, David Minor, Amit Chourasia, U.S. Library of Congress, Moores Cancer Center, National Archives and Records Administration, NSF, Chris Greer, Nancy Wilkins-Diehr, and many others … www.sdsc.edu berman@sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER Fran Berman UCSD