An Overview of PTI at Indiana University Beth Plale PTI Managing Director William K. Barnett Director, Science Community Tools, Research Technologies Robert H. McDonald Associate Dean for Library Technologies Mayfield Visit 12.12.12 PTI Fact Sheet • Pervasive Technology Institute employs about 120 full time employees • At any one time PTI has over 70 graduate research assistants engaged in research in one of the PTI centers • Total amount of active grants from external sources in PTI as of 30 November 2012 is $72,641,407 • PTI outreach activities, which number over 100 every year, reach 10,000 people, the majority located in Indiana What is making Big Data? • Some driving forces: Patient records growing fast (70PB pathology); network graphs from Internet leading to community detection • Large Hadron Collider (Switzerland, physics): analysis is mainly through creating histogram charts • Commercial: Google, Bing have largest data analytics in world • Time Series: Earthquakes, Twitter tweets, Stock Market • Image Processing: from climate simulations to NASA to DoD to Radiology • Financial decision support: marketing; fraud detection; automatic preference detection (map users to books, films) -From Professor Geoffrey Fox, director of the Digital Science Center in PTI PTI Big Data Strategy • Developing new technology – New system implementations – New software and technology • Training 21st century workforce – People with strong analytical and technical skills in statistics and machine learning who can analyze large volumes of data to derive business (and other) insights – Data-savvy managers and analysts who have the skills to be effective consumers of big data insights and who are capable of posing the right questions for analysis, interpreting and challenging the results, and making appropriate decisions – Technology personnel who develop, implement, and maintain the hardware and software tools needed to make and use big data. Big Data Needs Big Storage • Big Data requires Big Storage • Spreading this data over many machines and servers lets them share the work – Data storage, like computing, can take advantage of parallelism • IU has operated the Data Capacitor since 2006, providing high-end performance for Big Data and Big Science. Just announced upgrade to 5 PetaBytes of storage – this much data on CDs would make a stack 5 miles high Big Data Needs Big Computation • Big Red II – first university-funded, university owned supercomputer capable of 1 PetaFLOPS (a thousand trillion mathematical operations per second). It would take one person, doing one calculation with a calculator, 31 trillion years to do what Big Red II will be able to do in a second • Big Red II and Data Capacitor II provide the system resources to address Big Data challenges Networked Data Access • Monon100 100 Gigabit per second connection to Internet2 (Indiana first state to announce!). Moves data FAST • IU has leveraged its significant network expertise to make the Data Capacitor available to users at IU, in Indiana, nationally, and internationally – High-performance networks together with high-performance data storage for a “data cloud” for Big Science • To keep pace with the tremendous growth in data, we must stay on the cutting edge of computing, storage and network technologies – We can’t sit still or we will be crushed by the data deluge! • PTI is a leader in developing and integrating the latest approaches across these various technology domains Dealing with Big Data - NSF DataNet Program Motivation: “… one of the major challenges of this scientific generation: how to develop the new methods, management structures and technologies to manage the diversity, size, and complexity of current and future data sets and data streams.” Response: DataNet creates “a set of exemplar national and global data research infrastructure organizations” to address this challenge. SEAD Approach to DataNet Challenges SEAD Partners - http://sead-data.net • Contribute infrastructure to the NSF DataNet vision that supports data access, sharing, reuse, and preservation for the long tail • Develop a data access and preservation environment that supports the research, technical, and economic requirements for data management in the long tail • Enable Active and Social Curation Utilize emerging preservation and access infrastructures SEAD Social Networking/Virtual Archive at IU Time Curator Preview Ingest Data To VA User Queries VA for DOI Query DOI Metadata Query Endpoint Metadata update and View SWORD Endpoint (SPARQL) Query Metadata Return Metadata VA UI Curator Virtual Archive Mark Data For Publication (and Accept Licensing Terms) Curator Request for Preview Active Curation Repository ACR UI RoCE [rok-ee] Demonstration at SC12 • At SC’12, we demonstrated a data system capable of moving enough data to stream ~1000 high-definition Blu-ray movies at once – This was possible previously, but our approach reduced the server stack required from 6 feet tall to about 9 inches tall, reducing power and increasing efficiency - - We deployed this system in collaboration with Orange Telecom (the telephone company in France), who offers service worldwide and fields a research office in San Francisco Many of their clients do video distribution and our solution eliminates much of the customized, expensive and power-hungry hardware they currently use RoCE [rok-ee] Technology • Our approach integrated years of experience tuning networks and filesystem with an emerging protocol called RoCE (pronounced “Rocky”) • RoCE eliminates many sources of overhead and inefficiency in the venerable Internet Protocols – If the Internet is like a highway full of cars and trucks, RoCE is like an Indy car pulling a semi trailer! • The expertise required to tune and operate a system with Lustre and RoCE is significant, and we at IU were the first to demonstrate it working over a long distance – We focus today on making it work for Big Data and Big Science, and tomorrow on automating it for a wider audience NATIONAL CENTER FOR GENOME ANALYSIS SUPPORT (NCGAS) • Sequencing a human genome cost $95M in 2001. Now it costs $5,000 • Genomics are now part of most biology and all disease research. • Sequences are huge, each one is 250 Gigabytes (IU has the storage) and need supercomputers to analyze (IU has the supercomputers). • Researchers don’t know how to use supercomputers – we help them • The National Science Foundation has provided $1.5M for us to support genomics analysis. FutureGrid Motivation: FutureGrid will make it possible for researchers to conduct experiments by submitting an experiment plan that is then executed via a sophisticated workflow engine, preserving the provenance and state information necessary to allow reproducibility. Response: The FutureGrid Project provides a distributed test-bed of networked HPC resources that makes it possible for researchers to tackle complex research challenges in computer science related to the use and security of grids and clouds. What Does FutureGrid Offer? • • • • • • • • • Traditional HPC and Grid computing support Cloud platforms – Nimbus, Eucalyptus, OpenStack GPU computing Dynamic Provisioning through RAIN and RAIN-MOVE – Image Generation and Registration – Generic Image Repository – Image Deployment Experiment Management Information Services and Performance tools Networks device and virtual networks tools A convenient portal for easy account and project management Help and Support via a ticket system Big Data & 21st century economy - PTI creates high quality jobs 1,108 person-years of employment supported by grants & contracts since 1999 Q & A on PTI PTI Impact at IU