ITS – Research Services
Sequencing changes faster than IT
Understand the data you will produce
Understand the data you will keep
Understand how the data will move
Understand the sizes of the data each instrument produces
◦ How often will you collect this data?
◦ What IT resources are needed for each data set?
How will you handle?
◦ Raw Data
◦ Intermediate Data
◦ Derived Data
Must decide what data to keep
◦ How long?
◦ How will it be stored?
Is it cheaper to:
◦ Rerun the experiment
◦ Rerun the analysis
Data captured by the instrument must be moved
Terabytes of data may be involved
Moving terabytes of data across networks is non-trivial
◦ The network is not always the bottleneck
Common Data Movements
◦ Instrument to local capture storage
◦ Capture storage to shared storage
◦ Shared storage to HPC resource
◦ Shared storage to desktop
◦ Shared storage to backup/replication
Globus Online – Fastest for big files but requires GridFTP scp – Fetch (Mac) WS_FTP (Windows)
Network Drive File Copy – Slowest but simplest
External Hard Drive – Reasonably fast but requires physical movement
External Hard Drive
Typical Desktop Hard Drive
Typical Desktop SSD
GridFTP over 1Gb
CIFS over 1Gb scp over 1Gb
100MB/second read +
100MB/second write + Walking
Up to 120MB/second
Fastest network filesystem on campus
600MB/second single copy
Moving 1TB can easily take 3 hours or more!
There are many solutions
◦ Wiki, spreadsheet, database, etc
◦ Campus Options
Make sure you have backups!
Cheap storage is easy
◦ 2TB External USB Drive
Big storage is harder
◦ 50TB Storage Server
Big, fast, cheap, safe storage is much harder
◦ 50TB Storage Server Pair
High Performance Network
Cost of storage does not scale linearly
1TB 50TB 500TB
Where you store the data can impact how fast you can analyze your data.
On Helium during testing we saw over 100% difference in analysis time for BWA depending on where we stored the data.
If doing analysis on your desktop fast storage will likely improve analysis time for NGS.
Galaxy is being optimized to take advantage of this.
If running directly on a cluster ask for recommendations.
◦ /nfsscratch – 110TB
◦ /glusterscratch – 146TB
ITS Research Data Storage Service Pilot
Lab/Shared ZFS Systems
◦ All data in Galaxy should be considered as transient
Deleted after 30 days
◦ Data processing platform only
◦ Please backup all data that is valuable to you!
◦ Solutions to allow longer term storage of data
Increased availability of 10Gb Networking
Research Data Storage Service
Galaxy Data Libraries
Tom Bair – Economy of Scale Reversed
Safe Photo http://americanbestlocksmith.com/wpcontent/uploads/2010/09/safe-installation.jpg
=s&source=web&cd=1&ved=0CF4QFjAA&url=htt p%3A%2F%2Fwww.bioteam.net%2Fwpcontent%2Fuploads%2F2010%2F03%2FcdagxgenstorageForNGS_v3.pdf&ei=0cwWUJPJG4WHqQGih oHoDw&usg=AFQjCNFrzHSvQ8y4Ze3igsXd9mFV_