Infrastructure for Sharing Very Large Data Sets Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling Research Associate Professor Departments of Chemistry and Computational & Systems Biology University of Pittsburgh http://www.sam.pitt.edu PARTS OF THE INFRASTRUCTURE PUZZLE • Hardware • Networking • Storage • Compute • Software • Beyond scp/rsync • Globus • gtdownload • Policies • Not all data is “free” • Access controls PARTS OF THE INFRASTRUCTURE PUZZLE • Hardware • Networking • Storage • Compute • Software • Beyond scp/rsync • Globus, gtdownload, bbcp, etc. • Policies • Not all data is “free” • Access controls THE “OLD” MODEL Main Memory Disk Bus L3 Cache L2 Cache L1i Cache L1 Cache CPU Core NETWORK IS THE NEW BUS Disk Network Main Memory Bus L3 Cache L2 Cache L1i Cache L1 Cache CPU Core DATA SOURCES AT PITT • TCGA • Currently 1.1 PB growing by ~50 TB/mo. • Pitt is largest single contributor • UPMC Hospital System • 27 individual hospitals generating clinical and genomic data • ~30,000 patients in BRCA alone • LHC • Generates more than 10 PB/year • Pitt is a Tier 3 site TCGA DATA BREAKDOWN Cancer Mesothelioma (MESO) Pitt Contribution All Univ's Contribution Pitt's Percentage 9 37 24.32 95 427 22.25 107 536 19.96 74 517 14.31 149 1061 14.04 63 597 10.55 6 57 10.53 Thyroid carcinoma (THCA) 49 500 9.80 Skin Cutaneous Melanoma (SKCM) 41 431 9.51 Bladder Urothelial Carcinoma (BLCA) 23 268 8.58 Uterine Corpus Endometrial Carcinoma (UCEC) 44 556 7.91 Lung adenocarcinoma (LUAD) 31 500 6.20 7 113 6.19 Colon adenocarcinoma (COAD) 21 449 4.68 Lung squamous cell carcinoma (LUSC) 21 493 4.26 Stomach adenocarcinoma (STAD) 15 373 4.02 Kindey renal papillary cell carcinoma (KIRP) 9 227 3.96 Rectum adenocarcinoma (READ) 6 169 3.55 Sarcoma (SARC) 7 199 3.52 Pheochromocytoma and Paraganglioma (PCPG) 4 179 2.23 Liver hepatocellular carcinoma (LIHC) 3 240 1.25 Cervical Squamous cell carcinoma and endocervical adenocarcinoma (CESC) 3 242 1.24 Esophageal carcinoma (ESCA) 2 165 1.21 Adrenocortical Carcinoma (ACC) 0 92 0.00 Lymphoid Neoplasm Diffuse Large B-cell Lymphoma (DLBC) 0 38 0.00 Gliobastoma mutliforme (GBM) 0 661 0.00 Kidney chromophobe (KICH) 0 113 0.00 Acute Myeloid Leukemia (LAML) 0 200 0.00 Brain Lower Glade Glioma (LGG) 0 516 0.00 Prostate adenocarcinoma (PRAD) Kidney renal clear cell carcinoma (KIRC) Head and Neck squamous cell carcinoma (HNSC) Breast Invasive Carcinoma (BRCA) Ovarian serous cystadenocarcinoma (OV) Uterine Carcinosarcoma (UCS) Pancreatic adenocarcinoma (PAAD) HOW QUICKLY DO YOU NEED YOUR DATA? http://fasterdata.es.net/home/requirements-and-expectations HOW DO WE LEVERAGE THIS ON CAMPUS? http://noc.net.internet2.edu/i2network/maps-documentation/maps.html SCIENCE DMZ http://fasterdata.es.net/science-dmz/science-dmz-architecture/ AFTER THE DMZ • Now that you have a DMZ, what’s next? • It’s the last mile • Relatively easy to bring 100 Gbps to the data center • It’s another thing entirely to deliver such speeds to clients (disk, compute, etc.) • How do we address the challenge? • DCE and IB are converging • Right now, high bandwidth network to storage is probably the best we can do • Select users and laboratories get 10 GE to their systems CAMPUS 100GE NETWORKING PITT/UPMC NETWORKING BEYOND THE CAMPUS: XSEDE Single virtual system that scientists can use to interactively share computing resources, data, and expertise … • The most advanced, powerful, and robust collection of integrated digital resources and services in the world. • 11 supercomputers, 3 dedicated visualization servers. Over 2 PFLOPs peak computational power. • Online training for XSEDE and general HPC topics • XSEDE Annual XSEDE conference Learn more at http://www.xsede.org PSC/PITT STORAGE http://www.psc.edu/index.php/research-programs/advanced-systems/data-exacell SLASH2 ARCHITECTURE AFTER THE DMZ (CONT.) • Need the right file systems to backend a DMZ • Lustre/GPFS • How do you pull data from the high-speed network? • Where will it land? • DMZ explicitly avoids certain security restrictions • Access Controls • Genomics/Bioinformatics is growing enormously • DMZ is likely not HIPPA-compliant • Is it EPHI? • Can we let it live with non-EPHI data? CURRENT FILE SYSTEMS • /home directories are traditional NFS • SLASH2 filesystem for long-term storage • 1 PB of total storage • Accessible from both PSC and Pitt compute hardware • Lustre for “active” data • 5 GB/s total throughput • 800 MB/s single-stream performance • InfiniBand connectivity • Important for both compute and I/O • Computing on Distributed Genomes • How do we make this work once we get the data? • Need the APIs • Genomic data from UPMC • UPMC has data collection • UPMC lacks HPC systems for analysis INSTITUE FOR PERSONALIZED MEDICINE • Pitt/UPMC joint venture • • • • Drug Discovery Institute Pitt Cancer Institute UPMC Cancer Institute UPMC Enterprise Analytics • Improve patient care • Discover novel uses for existing therapeutics • Develop novel therapeutics • Enable genomics-based research and treatement WHAT IS PGRR? What PGRR IS… What PGRR is not…. 1. A common information technology 1. A place to store your framework for accessing deidentified individual research national big data datasets that are results important for Personalized Medicine 2. A system to access 2. A portal that allows you to use this UPMC clinical data data easily with tools and resources 3. A service for analyzing provided by the Simulation and data on your behalf Modeling Center (SaM), Pittsburgh Supercomputing Center (PSC), and UPMC Enterprise Analytics (EA) 3. A managed environment to help you meet the information security and regulatory requirements for using this data 4. A process for helping you stay current about updates and modifications made to these datasets Pittsburgh Genome Resource Repository TCGA Source (e.g. NCI, CGHub) NonBAM NonBAM NonBAM BAM M Metadata Pitt (IPM, UPCI) Bl1 local 75 TB Replication n 0 n 1 n 2 Pipeline Codes Bl2 local 100 TB Data Exacell ~100 TB* Storage (SLASH2)BAM Brashear 290 TB Databas e nodes Blackligh t ~8 TB* InfiniBand 1Gbit (assumed) n 3 MDS Sherloc k 10 Gbit (throttled to 2 Gbit) Network BAM Xyrate x 240 TB NonBAM Panasa s 40 TB NonBAM supercell 100 TB NonBAM Frank PGR R GO PSC IPM Portal Pitt Virtuoso *Growing to ~1 PB of BAM data and 33 TB of nonBAM data How Do We Protect Data? • Genomic Data (~424 TB) • Deidentified genomic data • Patient genomic data from UPMC system • DUAs (Data Use Agreements) • Umbrella document signed by all Pitt/UPMC researchers • Required training for all users • Access restricted to DUA users only • dBGap (not HIPAA) • We host, but user (via DUA) is ultimately responsible for data protection TCGA ACCESS RULES CONTROLING ACCESS PGRR DATA NOTIFICATIONS ACKNOWLEDGEMENTS • Albert DeFusco (Pitt/SaM) • Brian Stengel (Pitt/CSSD) • Rebecca Jacobson (Pitt/DBMI) • Adrian Lee (Pitt/Cancer Institute) • J. Ray Scott (PSC) • Jared Yanovich (PSC) • Phil Blood (PSC) CENTER FOR SIMULATION AND MODELING Center for Simulation and Modeling (SaM) 326 Eberly (412) 648-3094 http://www.sam.pitt.edu • Co-directors: Ken Jordan & Karl Johnson • Associate Director: Michael Barmada • Executive Director: Antonio Ferreira • Administrative Coordinator: Wendy Janocha • Consultants: Albert DeFusco, Esteban Meneses, Patrick Pisciuneri, Kim Wong Network Operations Center (NOC) • RIDC Park • Lou Passarello • Jeff Raymond, Jeff White Swanson School of Engineering (SSoE) • Jeremy Dennis