The Future of Scientific Computing at Harvard Alyssa A. Goodman Professor of Astronomy Director, Initiative in Innovative Computing “The Heavy Red Bag” How can computers advance (my) science? A new collaborative scientific initiative at Harvard. Computational challenges are common across scientific disciplines How to: Acquire, transmit, organize, and query new kinds of data? Apply distributed computing resources to solve complex problems? Derive meaningful insight from large datasets? Share, integrate and analyze knowledge across geographically dispersed researchers? Visually represent scientific results so as to maximize understanding? Opportunity to collaborate and apply insights from one field to another Filling the “Gap” between Science and Computer Science Scientific disciplines Computer Science departments Increasingly, core problems in science require computational solution Focused on finding elegant solutions to basic computer science challenges Typically hire/“home grow” computationalists, but often lack the expertise or funding to go beyond the immediate pressing need Often see specific, “applied” problems as outside their interests “Workflow” & “Continuum” Workflow Examples Astronomy Public Health “Collect” Telescope Microscope, Stethoscope, Survey COLLECT “National Virtual Observatory”/ COMPLETE CDC Wonder “Analyze” Study the density structure of a starforming glob of gas Find a link between one factory’s chlorine runoff & disease ANALYZE Study the density structure of all starforming gas in… Study the toxic effects of chlorine runoff in the U.S. “Collaborate” Work with your student COLLABORATE Work with 20 people in 5 countries, in real-time “Respond” Write a paper for a Journal. RESPOND Write a paper, the quantitative results of which are shared globally, digitally. Workflow IIC contact: AG, FAS Workflow a.k.a. The Scientific Method (in the Age of the Age of High-Speed Networks, Fast Processors, Mass Storage, and Miniature Devices) IIC contact: Matt Welsh, FAS Workflow: The Harvard Virtual Brain Establishing a Harvard-wide Neuroscience Infrastructure Data Acquisition MRI PET Microscopy etc. Faculty of Arts and Sciences Harvard College Division of Engineering Distributed Data Storage Harvard School of Public Health Data Processing Analysis Visualization Integration etc. BWH/MGH and UCSD Data Left Hippocampal Volume 6000 Faculty of Medicine Harvard Medical School Affiliated Teaching Hospitals 5000 4000 3000 2000 Information Access Query Statistical Analysis Knowledge Management etc. 1000 0.3 IIC contact: David Kennedy, HMS/MGH 0.4 0.5 0.6 0.7 0.8 0.9 CVLT Discriminability Score 1.0 Harvard IIC New technologies for measurement and simulation are transforming the “workflow.” Biomedicine: pre-genomics Biomedicine: genomics era QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. • • • • Manual/low throughput Solitary Limited by two hands Analog • • • • High throughput Automated/networked Highly scalable Digital Continuum “Computational Science” Missing at Most Universities “Pure” Discipline Science “Pure” Computer Science (e.g. Galileo) (e.g. Turing) Workflow & Continuum For any particular scientific investigation: Where does, and could, “computational science” make improvements in this cycle? Harvard Public Health “NOW” (Oct. 2004) "In the past, experiments did not involve such large data sets," observed Dyann Wirth, professor of infectious diseases in the Department of Immunology and Infectious Diseases and member of the advisory group for the core. "There has been a dramatic change in the past five to 10 years in the amount and availability of genomic data [or the DNA sequences themselves] and functional genomic data, [or the sequences’ purpose]." In the past five years alone, the genomes of humans, rats, and the malaria parasite Plasmodium Falciparum have been published, for example. "One of the purposes of bioinformatics is to reduce the number of experiments that need to be done to achieve reliable information," said L.J. Wei, professor of biostatistics in the Department of Biostatistics and member of the advisory group for the core. "However, an issue right now is that there are huge data sets that can be run through different kinds of software programs, ending up with many data points. Unless we understand and use bioinformatics well, we may not even know which of those data points are important." Filling the “computational science” gap: IIC Problem-driven approach …focusing effort on solving problems that will have greatest impact & educational value Collaborative projects …combining disciplinary knowledge with computer science expertise Interdisciplinary effort …to ensure that best practices are shared across fields and that new tools and methodologies will be broadly applicable Links with industry …to draw on and learn from experience in applied computation Institutional funding …to ensure effort is directed towards key needs and not driven solely by narrow priorities of funding agencies IIC at Harvard Qui ckTime™ and a TIFF (U ncompr essed) decompressor are needed to see thi s pi cture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Numerical Simulation of Star Formation •MHD turbulence gives “t=0” conditions; Jeans mass=1 Msun •50 Msun, 0.38 pc, navg=3 x 105 ptcls/cc •forms ~50 objects •T=10 K •SPH, no B or L, G •movie=1.4 free-fall times Bate, Bonnell & Bromm 2002 (UKAFF) QuickTime™ and a Cinepak decompressor are needed to see this picture. Simulations & Public Health Goal: Statistical Comparison of “Real” and “Synthesized” Star Formation Figure based on work of Padoan, Nordlund, Juvela, et al. Excerpt from realization used in Padoan & Goodman 2002. Measuring Motions: Molecular Line Maps Spectral Line Observations Radio Spectral-line Observations of Interstellar Clouds Radio Spectral-Line Survey Alves, Lada & Lada 1999 Velocity from Spectroscopy Observed Spectrum Telescope Spectrometer 1.5 Intensity 1.0 0.5 0.0 All thanks to Doppler -0.5 100 150 200 250 "Velocity" 300 350 400 Velocity from Spectroscopy Observed Spectrum Telescope Spectrometer 1.5 Intensity 1.0 0.5 0.0 All thanks to Doppler -0.5 100 150 200 250 "Velocity" 300 350 400 QuickTime™ and a TIFF (UncQuickTime™ ompressed) deco andmpre a ssor are needed to see YUV420 codec decompressor this picture. are needed to see this picture. COMPLETE/FCRAO W(13CO) Barnard’s Perseus “Astronomical Medicine” Excerpts from Junior Thesis of Michelle Borkin (Harvard College); IIC Contacts: AG (FAS) & Michael Halle (HMS/BWH/SPL) IC 348 IC 348 “Astronomical Medicine” “Astronomical Medicine” “Astronomical Medicine” Before “Medical Treatment” After “Medical Treatment” 3D Slicer Demo (available after talk) IIC contacts: Michael Halle & Ron Kikinis IIC: Five Research Branches Visualization Distributed Computing Databases/ Provenance Physically meaningful combination of diverse data types. e-Science aspects of large collaborations. Management, and rapid retrieval, of data. Sharing of data and computational resources and tools in real-time. “Research reproducibility” …where did the data come from? How? Analysis & Simulations Development of efficient algorithms. Cross-disciplinary comparative tools (e.g. statistical). Instrumentation Improved data acquisition. Novel hardware approaches (e.g. GPUs, sensors). IIC: Innovative Organizational Model Staffing Highly accomplished academics and senior experts whose careers have been primarily in industry, working together Promotion/ career path Criteria for promotion will give equal weight to scholarly activities, and to technological invention Culture No “class” distinctions made between teaching and nonteaching faculty, scientists and engineers, artists and designers working in the visualization program How IIC will Function: Overview IIC Objectives Project selection Identify and fund projects that are likeliest to have the greatest and broadest impact Project execution Pursue projects in way that will yield best outcome, enable shared learning, etc. Dissemination of knowledge Enable new research for specific scientific discipline Generate new computational tools for broader application Project Selection Project proposals Program Advisory Committee IIC Management Team Who participates Any Harvard researcher (e.g., in genomics, fluid dynamics, epidemiology,neuroscience, nanoscience, comp bio, chemical biology, optics, geology, astronomy, quantum mechanics, et al.) Role Submit proposal in response to call for ideas Harvard researchers representing broad interests of IIC stakeholders plus IIC Director & Dir. of Research Evaluate/rank proposals for scientific merit: should this be a priority for IIC? Consists of • IIC Director • Dirs. of Res. & Adm/Ops • Heads of IIC branches Evaluate/prioritize proposals according to technical feasibility, assess resource needs Project Execution IIC Project Team C, etc. IIC Project TeamManager B Project IIC Project TeamManager A Project Project Manager Discipline scientists Discipline scientists Discipline scientists Scientists who “own” the problem and are committed to working with IIC staff to tackle it Responsible for project execution and metrics for tracking progress/performance; interfaces with IIC branch heads IIC staff IIC staff IIC staff IIC staff scientists assigned to work on project by relevant IIC branch heads. The same IIC staff member may serve on multiple IIC project teams Dissemination of Knowledge Communities of practice • Internal... • External… Seminars/colloquia Publications Knowledge management system • New tools • IIC process • Scientific journals • IIC white papers Education is central to IIC’s mission At Harvard: Undergraduate & graduate courses focused on “data-intensive science” New graduate certificate program, within existing Ph.D. programs Research opportunities at undergraduate, graduate, and postdoctoral levels Beyond Harvard: New museum, highlighting the kind of science done at the IIC IIC organization: research and education Provost Dean, Physical Sciences Assoc Provost IIC Director Dir of Admin & Operations Dir of Research Assoc Dir, Instrumentation Project 1 (Proj Mgr 1) Project 2 (Proj Mgr 2) Project 3 (Proj Mgr 3) Etc. CIO (systems) Knowledge mgmt Assoc Dir, Visualization Assoc Dir, Databases/Data Provenance Assoc Dir, Analysis & Simulation Assoc Dir, Distributed Computing Education & Outreach staff Dir of Education & Outreach IIC: Examples Visualization Distributed Computing Databases/ Provenance Physically meaningful combination of diverse data types. e-Science aspects of large collaborations. Management, and rapid retrieval, of data. Sharing of data and computational resources and tools in real-time. “Research reproducibility” …where did the data come from? How? Analysis & Simulations Development of efficient algorithms. Cross-disciplinary comparative tools (e.g. statistical). Instrumentation Improved data acquisition. Novel hardware approaches (e.g. GPUs, sensors). Visualization: 3D Slicer (BWH Surgical Planning Lab) IIC contacts: Michael Halle & Ron Kikinis “Image and Meaning” (Visualization) QuickTime™ and a Cinepak decompressor are needed to see this picture. QuickTime™ and a Cinepak decompressor are needed to see this picture. IIC contact: Felice Frankel (MIT) Work: Garstecki/Whitesides (FAS) Distributed Computing: Semantics, Ontologies IIC Contact: Tim Clark (HMS/MGH) IIC Contact: Tim Clark (HMS/MGH) Distributed Computing & Large Databases: Large Synoptic Survey Telescope Optimized for time domain scan mode deep mode 7 square degree field 6.5m effective aperture 24th mag in 20 sec > 5 Tbyte/night Real-time analysis Simultaneous multiple science goals IIC contact: Christopher Stubbs (FAS) Astronomy LSST SDSS 2MASS 2011 1998 2001 5000 Peak 500 Avg 8.3 Daily average data rate (TB/day) 20 Annual data store (TB) High Energy Physics DLS BaBar Atlas RHIC 1992 1999 1998 2007 1999 1 1 2.7 60 (zerosuppressd) 6* 540* 120* (’03) 250* (’04) 0.02 0.016 0.008 0.012 0.6 60.0 3 (’03) 10 (’04) 2000 3.6 6 1 0.25 300 7000 200 (’03) 500 (’04) Total data store capacity (TB) 20,000 (10 yrs) 200 24.5 8 2 10,000 100,000 (10 yrs) 10,000 (10 yrs) Peak computational load (GFLOPS) 140,000 100 11 1.00 0.600 2,000 100,000 3,000 Average computational load (GFLOPS) 140,000 10 2 0.700 0.030 2,000 100,000 3,000 Data release delay acceptable 1 day moving 3 months static 2 months 6 months 1 year 6 hrs (trans) 1 yr (static) 1 day (max) <1 hr (typ) Few days 100 days 30 sec none none <1 hour 1 hr none none none TBD 1GHz Xeon 18 450MHz Sparc 28 60-70MHz Sparc 10 500MH z Mixed/ 20GHz/ Pentium/ Pentium 5000 10,000 2500 First year of operation Run-time data rate to storage (MB/sec) Real-time alert of event Type/number of processors MACHO 5 Analysis & Simulations Figure based on work of Padoan, Nordlund, Juvela, et al. Excerpt from realization used in Padoan & Goodman 2002. Analysis & Simulations: Neural Net Models of Intelligence Does Speed of Convergence in Neural Nets Predict Scores on Measures of “General Intelligence”? Network Architecture ¥ (Asymmetric) Fully Connected Networks Π Every node is connected to every other node Π Connection may be excitatory (positive), inhibitory (negative), or irrelevant ( 0). Π Most general Π Symmetric fully connected nets: weights are symmetric ( wij = wji) Input nodes : receive input from the environment Output nodes : send signals to the environment Hidden nodes : no direct interaction to the environment Select from the lower 8 the one that completes the pattern in the top 9 IIC contact: Stephen Kosslyn (Psychology) (Easier) Analysis of Large Data Sets: Mendelian Disease Genes world world world world 189 189 189 189 Hello Hello world 189 world 189 Hello Hello world 189 world 189 reformat, merge, and filter Large data files Can a biologist get from here to there? Without programming? OMIM on the genome 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Chromosome Hello Hello Hello Hello 0 50 100 150 Position (MB) 200 Location of every known disease gene on the human genome IIC contact: Eitan Rubin (FAS/CGR) 250 Instrumentation IIC contact: Matt Welsh, FAS IIC: Mission The Institute for Innovative Computing (IIC) will make Harvard a world leader in the innovative and creative use of computational resources to address forefront scientific problems. We will focus on developing capabilities that are applicable to multiple disciplines, by undertaking specific, well-defined projects, thereby developing tools and approaches that can be generalized and shared. We will foster the flow of ideas and inventions along the continuum from basic science to scientific computation to computational science to computer science. We will train a next generation of creative and computationally capable scientists, build linkages to industry, and communicate with the public at large. Why Here? Diverse group of senior faculty and accomplished scientists… …spanning a wide range of relevant disciplines, e.g., Computer science Physics, Chemistry, Astronomy, Statistics, Biology, Medicine, etc. Psychology, Graphic Design …with backgrounds in both academia and industry… …deeply committed to the vision of a collaborative approach to solving the most compelling computing challenges facing scientists today