Infrastructure for Sharing Very Large Data Sets

advertisement
Infrastructure for Sharing Very
Large Data Sets
Antonio M. Ferreira, PhD
Executive Director, Center for Simulation and Modeling
Research Associate Professor Departments of Chemistry and
Computational & Systems Biology
University of Pittsburgh
http://www.sam.pitt.edu
PARTS OF THE INFRASTRUCTURE PUZZLE
• Hardware
• Networking
• Storage
• Compute
• Software
• Beyond scp/rsync
• Globus
• gtdownload
• Policies
• Not all data is “free”
• Access controls
PARTS OF THE INFRASTRUCTURE PUZZLE
• Hardware
• Networking
• Storage
• Compute
• Software
• Beyond scp/rsync
• Globus, gtdownload, bbcp, etc.
• Policies
• Not all data is “free”
• Access controls
THE “OLD” MODEL
Main Memory
Disk
Bus
L3 Cache
L2 Cache
L1i Cache
L1 Cache
CPU Core
NETWORK IS THE NEW BUS
Disk
Network
Main Memory
Bus
L3 Cache
L2 Cache
L1i Cache
L1 Cache
CPU Core
DATA SOURCES AT PITT
• TCGA
• Currently 1.1 PB growing by ~50 TB/mo.
• Pitt is largest single contributor
• UPMC Hospital System
• 27 individual hospitals generating clinical and genomic
data
• ~30,000 patients in BRCA alone
• LHC
• Generates more than 10 PB/year
• Pitt is a Tier 3 site
TCGA DATA BREAKDOWN
Cancer
Mesothelioma (MESO)
Pitt Contribution
All Univ's Contribution
Pitt's
Percentage
9
37
24.32
95
427
22.25
107
536
19.96
74
517
14.31
149
1061
14.04
63
597
10.55
6
57
10.53
Thyroid carcinoma (THCA)
49
500
9.80
Skin Cutaneous Melanoma (SKCM)
41
431
9.51
Bladder Urothelial Carcinoma (BLCA)
23
268
8.58
Uterine Corpus Endometrial Carcinoma (UCEC)
44
556
7.91
Lung adenocarcinoma (LUAD)
31
500
6.20
7
113
6.19
Colon adenocarcinoma (COAD)
21
449
4.68
Lung squamous cell carcinoma (LUSC)
21
493
4.26
Stomach adenocarcinoma (STAD)
15
373
4.02
Kindey renal papillary cell carcinoma (KIRP)
9
227
3.96
Rectum adenocarcinoma (READ)
6
169
3.55
Sarcoma (SARC)
7
199
3.52
Pheochromocytoma and Paraganglioma (PCPG)
4
179
2.23
Liver hepatocellular carcinoma (LIHC)
3
240
1.25
Cervical Squamous cell carcinoma and endocervical
adenocarcinoma (CESC)
3
242
1.24
Esophageal carcinoma (ESCA)
2
165
1.21
Adrenocortical Carcinoma (ACC)
0
92
0.00
Lymphoid Neoplasm Diffuse Large B-cell Lymphoma (DLBC)
0
38
0.00
Gliobastoma mutliforme (GBM)
0
661
0.00
Kidney chromophobe (KICH)
0
113
0.00
Acute Myeloid Leukemia (LAML)
0
200
0.00
Brain Lower Glade Glioma (LGG)
0
516
0.00
Prostate adenocarcinoma (PRAD)
Kidney renal clear cell carcinoma (KIRC)
Head and Neck squamous cell carcinoma (HNSC)
Breast Invasive Carcinoma (BRCA)
Ovarian serous cystadenocarcinoma (OV)
Uterine Carcinosarcoma (UCS)
Pancreatic adenocarcinoma (PAAD)
HOW QUICKLY DO YOU NEED YOUR DATA?
http://fasterdata.es.net/home/requirements-and-expectations
HOW DO WE LEVERAGE THIS ON CAMPUS?
http://noc.net.internet2.edu/i2network/maps-documentation/maps.html
SCIENCE DMZ
http://fasterdata.es.net/science-dmz/science-dmz-architecture/
AFTER THE DMZ
• Now that you have a DMZ, what’s next?
• It’s the last mile
• Relatively easy to bring 100 Gbps to the data center
• It’s another thing entirely to deliver such speeds to
clients (disk, compute, etc.)
• How do we address the challenge?
• DCE and IB are converging
• Right now, high bandwidth network to storage is
probably the best we can do
• Select users and laboratories get 10 GE to their
systems
CAMPUS 100GE NETWORKING
PITT/UPMC NETWORKING
BEYOND THE CAMPUS: XSEDE
Single virtual system that scientists can
use to interactively share computing
resources, data, and expertise …
• The most advanced,
powerful, and robust
collection of integrated
digital resources and
services in the world.
• 11 supercomputers, 3
dedicated visualization
servers. Over 2 PFLOPs
peak computational power.
• Online training for XSEDE and general HPC topics
• XSEDE Annual XSEDE conference
Learn more at http://www.xsede.org
PSC/PITT STORAGE
http://www.psc.edu/index.php/research-programs/advanced-systems/data-exacell
SLASH2 ARCHITECTURE
AFTER THE DMZ (CONT.)
• Need the right file systems to backend a DMZ
• Lustre/GPFS
• How do you pull data from the high-speed network?
• Where will it land?
• DMZ explicitly avoids certain security restrictions
• Access Controls
• Genomics/Bioinformatics is growing enormously
• DMZ is likely not HIPPA-compliant
• Is it EPHI?
• Can we let it live with non-EPHI data?
CURRENT FILE SYSTEMS
• /home directories are traditional NFS
• SLASH2 filesystem for long-term storage
• 1 PB of total storage
• Accessible from both PSC and Pitt compute
hardware
• Lustre for “active” data
• 5 GB/s total throughput
• 800 MB/s single-stream performance
• InfiniBand connectivity
• Important for both compute and I/O
• Computing on Distributed Genomes
• How do we make this work once we get the data?
• Need the APIs
• Genomic data from UPMC
• UPMC has data collection
• UPMC lacks HPC systems for analysis
INSTITUE FOR PERSONALIZED MEDICINE
• Pitt/UPMC joint venture
•
•
•
•
Drug Discovery Institute
Pitt Cancer Institute
UPMC Cancer Institute
UPMC Enterprise Analytics
• Improve patient care
• Discover novel uses for existing therapeutics
• Develop novel therapeutics
• Enable genomics-based research and
treatement
WHAT IS PGRR?
What PGRR IS…
What PGRR is not….
1. A common information technology 1. A place to store your
framework for accessing deidentified
individual research
national big data datasets that are
results
important for Personalized Medicine
2. A system to access
2. A portal that allows you to use this
UPMC clinical data
data easily with tools and resources
3. A service for analyzing
provided by the Simulation and
data on your behalf
Modeling Center (SaM), Pittsburgh
Supercomputing Center (PSC), and
UPMC Enterprise Analytics (EA)
3. A managed environment to help you
meet the information security and
regulatory requirements for using this
data
4. A process for helping you stay
current about updates and
modifications made to these datasets
Pittsburgh Genome Resource Repository
TCGA
Source (e.g. NCI, CGHub)
NonBAM
NonBAM
NonBAM
BAM
M
Metadata
Pitt (IPM, UPCI)
Bl1
local
75
TB
Replication
n
0
n
1
n
2
Pipeline
Codes
Bl2
local
100
TB
Data Exacell
~100 TB*
Storage (SLASH2)BAM
Brashear
290 TB
Databas
e nodes
Blackligh
t
~8 TB*
InfiniBand
1Gbit (assumed)
n
3
MDS
Sherloc
k
10 Gbit
(throttled to
2 Gbit)
Network
BAM
Xyrate
x
240
TB
NonBAM
Panasa
s
40 TB
NonBAM
supercell
100 TB
NonBAM
Frank
PGR
R
GO
PSC
IPM Portal
Pitt
Virtuoso
*Growing to ~1 PB of BAM data and 33 TB of nonBAM data
How Do We Protect Data?
• Genomic Data (~424 TB)
• Deidentified genomic data
• Patient genomic data from UPMC system
• DUAs (Data Use Agreements)
• Umbrella document signed by all Pitt/UPMC
researchers
• Required training for all users
• Access restricted to DUA users only
• dBGap (not HIPAA)
• We host, but user (via DUA) is ultimately
responsible for data protection
TCGA ACCESS RULES
CONTROLING ACCESS
PGRR DATA NOTIFICATIONS
ACKNOWLEDGEMENTS
• Albert DeFusco (Pitt/SaM)
• Brian Stengel (Pitt/CSSD)
• Rebecca Jacobson (Pitt/DBMI)
• Adrian Lee (Pitt/Cancer Institute)
• J. Ray Scott (PSC)
• Jared Yanovich (PSC)
• Phil Blood (PSC)
CENTER FOR SIMULATION AND MODELING
Center for Simulation and Modeling (SaM)
326 Eberly (412) 648-3094
http://www.sam.pitt.edu
•
Co-directors: Ken Jordan & Karl Johnson
•
Associate Director: Michael Barmada
•
Executive Director: Antonio Ferreira
•
Administrative Coordinator: Wendy Janocha
•
Consultants: Albert DeFusco, Esteban
Meneses, Patrick Pisciuneri, Kim Wong
Network Operations Center (NOC)
•
RIDC Park
•
Lou Passarello
•
Jeff Raymond, Jeff White
Swanson School of Engineering (SSoE)
•
Jeremy Dennis
Download