Data - Indiana University

advertisement
Empowering Bioinformatics Workflows Using the Lustre
Wide Area File System across a 100 Gigabit Network
Stephen Simms
Manager, High Performance File Systems
Indiana University
ssimms@indiana.edu
Today’s talk brought to you by NCGAS
•
•
•
•
•
Funded by National Science Foundation
Large memory clusters for assembly
Bioinformatics consulting for biologists
Optimized software for better efficiency
Open for business at: http://ncgas.org
Data
• In the 21st Century everything is data
– Patient data
– Nutritional data
– Musical data
• Raw material for
– Scientific advancement
– Technological development
Better Technology = More Data
Better Telescopes
• ODI – One Degree Imager
– WIYN (Wisconsin, Indiana, Yale,
NOAO)telescope in Arizona
– ODI will provide 1 billion pixels/image
One Degree Imager
32k x 32k CCD
• Pan-STARRS
– Providing 1.4 billion pixels/image
– Currently has over 1 Petabyte of images stored
Better Televisions
• Ultra High Definition Television (UHDTV)
–
–
–
–
16 times more pixels than HDTV
Last month LG began sales of 84” UHDTV
Tested at the 2012 Summer Olympics
Storage media lags behind
Genomics
• Next Gen sequencers are generating more
data and getting cheaper
• Sequencing is:
 Becoming commoditized at large centers and
 Multiplying at individual labs
• Analytical capacity has not kept up
 Storage support
 Computational support (thousand points solution)
 Bioinformatics support
Data Capacitor
NSF Funded in 2005
535 Terabytes Lustre storage
(currently 1.1 PB)
24 Servers with 10Gb NICs
Short to mid-term storage
http://www.flickr.com/photos/shadowstorm/404158384/
http://www.flickr.com/photos/dvd5/163647219/
http://www.flickr.com/photos/vidiot/431357888/
The Lustre Filesystem
• Open Source
• Supports many thousands of client systems
• Supports petabytes of storage
• Over 240 GB/s measured throughput at ORNL
• Scalable
– aggregates separate servers for performance
– user specified “stripes”
• Standard POSIX interface
Lustre
Scalable Object Storage
MDS
metadata server
Client
OSS
object storage
server
Computation
Workflow - The Data Lifecycle
http://www.flickr.com/photos/davesag/4307240/in/set-799526/
Data Lifecycle – Centralized Storage
Compute
Resource #1
Data Source
Researcher's
Computer
Data
Capacitor
Compute
Resource #2
Tape
Archive
Visualization
Resource
NCGAS Cyberinfrastructure at IU
• Mason large memory cluster (512 GB/node)
• Quarry cluster (16 GB/node)
• Data Capacitor (1.1 PB)
• Research File System (RFS)
• Research Database Cluster for structured data
• Bioinformaticians and software engineers
Galaxy: Make it easier for Biologists
• Galaxy interface provides a
“user friendly” window to
NCGAS resources
• Supports many
bioinformatics tools
• Available for both research
and instruction.
Computational Skills
Common
Rare
LOW
HIG
H
GALAXY.IU.EDU Model
Virtual box hosting
Galaxy.IU.edu
Individual labs can get
duplicate boxes – provided
they support it themselves.
The host for each tool is
configured to meet IU needs
Quarry
A custom Galaxy
tool can be made
to import data
from the RFS to
the DC.
UITS/NCGAS
establishes tools,
hardens them, and
moves them into
production.
RFS
Mason
Data Capacitor
Policies on the DC
guarantee that untouched
data is removed with time.
Increasing DC’s Utility
• If we’re getting high speed performance across
campuses
– What could we do across longer distances?
• Empower geographically distributed workflows
• Facilitate data sharing among colleagues
• Provide data everywhere all the time
2006 - 10 Gb Lustre WAN
977 MB/s between ORNL and IU
Using a single Dell 2950 client
Across 10Gb TeraGrid connection
2007 Bandwidth Challenge Win:
Five Applications Simultaneously
• Acquisition and Visualization
– Live Instrument Data
• Chemistry
– Rare Archival Material
• Humanities
• Acquisition, Analysis, and Visualization
– Trace Data
• Computer Science
– Simulation Data
• Life Science
• High Energy Physics
Beyond a Demo
• To make Lustre across the Wide Area Network
useful and more than a demo we needed to be
able to span heterogeneous name spaces
– In Unix each user has a UID
– It could differ from system to system
– To preserve ownership across systems we
created a method for doing so
IU’s Data Capacitor WAN Filesystem
• Funded by Indiana University in 2008
• Put into production in April of 2008
• 360TB of storage available as production
service
• Centralized short-term storage for resources
nationwide:
– Simplifies use of distributed resources
– Projects space exists for mid-term storage
Gas Giant Planet Research
Visualization
Resource
PSC
Pittsburgh, PA
410 miles
Data
Capacitor
WAN
NCSA
Urbana, IL
147 miles
Tape
Archive
MSU
Starkville, MS
607 miles
2010: Lustre WAN at 100Gb
100 Gbit Testbed – Full Duplex Results
Writing to
Freiberg
10.8 GB/s
5*40 Gbit/s QDR IB
16*8 Gbit/s
16*20 Gbit/s DDR IB
100GbE
Writing to
Dresden
11.1 GB/s
16*8 Gbit/s
100 Gbit Testbed – Uni-Directional Efficiency
Unidirectional Lustre: 11.79 GByte/s (94.4%)
TCP/IP: 98.5 Gbit/s (98.5%)
Link: 100 Gbit/s (100.0%)
2011: SCinet Research Sandbox
• Supercomputing 2011, Seattle
– Joint effort of SCinet and Technical Program
• Software Defined Networking and 100 Gbps
– From Seattle to Indianapolis (2,300 miles)
• Demonstrations using Lustre WAN
– network
– benchmark
– applications
Network, Hardware and Software
• Internet2 and ESnet, 50.5 ms RTT
Network, Hardware and Software
Application Results
• Applications
– Peak: 6.2 GB/s
– Sustained: 5.6 GB/s
NCGAS Workflow Demo at SC 11
Bloomington, IN
•
STEP 1: data preprocessing, to
evaluate and improve
the quality of the
input sequence
•
STEP 2: sequence
alignment to a
known reference
genome
•
STEP 3: SNP
detection to scan the
alignment result for
new polymorphisms
Seattle, WA
Monon 100
• Provides 100Gb connectivity between IU and
Chicago
• Internet2 deploying 100Gb networks nationally
• New opportunities for sharing Big Data
• New opportunities for moving Big Data
100
Internet2 (100Gbps)
DDR3 SDRAM (51.2 Gbps, 6.4GBps, )
Gbps
IU Data Capacitor WAN (20 Gbps throughput)
0
NLR to Sequencing Centers (10Gbps/link)
Ultra SCSI 160 Disk (1.2 Gbps, 160 MBps)
Commodity Internet (1Gbps but highly variable)
NCGAS Logical Model
Your Friendly
National
Sequencing Center
10 Gbps
Lustre WAN
File System
NCGAS Mason
(Free for
NSF users)
Data Capacitor
Your Friendly
Regional
Sequencing Lab
IU POD
(12 cents
per core hour)
NO data storage Charges
100 Gbps
Amazon EC2
(20 cents
per core hour)
Your Friendly
Neighborhood
Sequencer
Amazon Cloud Storage
$80 – 120 per TB per month
National Center for Genome Analysis
Support (NCGAS)
• Using high speed networks like I2 and the
Monon 100, the DC-WAN facility will be
ingesting data from Laboratories with next
generation sequencers and serving reference
data sets from sources like NCBI.
• Data will be processed using IU’s
Cyberinfrastructure
Special Thanks To
•
•
•
•
•
•
•
NCGAS – Bill Barnett and Rich LeDuc
IU’s High Performance Systems Group
Application owners and IU’s HPA Team
IU’s Data Capacitor Team
Matt Davy, Tom Johnson, Ed Balas, Jeff Ambern, Martin Swany
Andrew Lee, Chris Robb, Matthew Zekauskas and Internet2
Evangelos Chaniotakis, Patrick Dorn and ESnet
•
•
•
•
•
•
Brocade
Ciena
DDN
IBM
Internet2 , ESnet
Whamcloud
–
–
–
–
–
–
10Gb Cards, 100Gb Cards, and optics
100 Gb optics
2 SFA 10K
iDataPlex nodes
Network link and equipement
Lustre support
Thank you!
Stephen Simms
ssimms@iu.edu
High Performance File Systems
hpfs-admin@iu.edu
Download