Ben Rogers
ITS – Research Services
1
Data Awareness
Data Management
Data Storage
Campus Resources
Questions
2
Sequencing changes faster than IT
Understand the data you will produce
Understand the data you will keep
Understand how the data will move
3
Understand the sizes of the data each instrument produces
◦ How often will you collect this data?
◦ What IT resources are needed for each data set?
How will you handle?
◦ Raw Data
◦ Intermediate Data
◦ Derived Data
4
Must decide what data to keep
◦ How long?
◦ How will it be stored?
Is it cheaper to:
◦ Rerun the experiment
◦ Rerun the analysis
5
Data captured by the instrument must be moved
Terabytes of data may be involved
Moving terabytes of data across networks is non-trivial
◦ The network is not always the bottleneck
6
Common Data Movements
◦ Instrument to local capture storage
◦ Capture storage to shared storage
◦ Shared storage to HPC resource
◦ Shared storage to desktop
◦ Shared storage to backup/replication
7
Globus Online – Fastest for big files but requires GridFTP scp – Fetch (Mac) WS_FTP (Windows)
Network Drive File Copy – Slowest but simplest
External Hard Drive – Reasonably fast but requires physical movement
8
Transfer Mechanism
External Hard Drive
Gigabit Ethernet
Typical Desktop Hard Drive
Typical Desktop SSD
GridFTP over 1Gb
CIFS over 1Gb scp over 1Gb
Transfer Speed
100MB/second read +
100MB/second write + Walking
Time
Up to 120MB/second
100MB/second
300MB/second
120MB/second
60-80MB/second
60-100MB/second
Fastest network filesystem on campus
600MB/second single copy
6GB/second aggregate
Moving 1TB can easily take 3 hours or more!
9
Very important
There are many solutions
◦ Wiki, spreadsheet, database, etc
◦ Campus Options
Campus Wiki
Sharepoint
Redcap
Galaxy
Make sure you have backups!
10
Cheap storage is easy
◦ 2TB External USB Drive
Big storage is harder
◦ 50TB Storage Server
Big, fast, cheap, safe storage is much harder
◦ 50TB Storage Server Pair
Checksum
High Performance Network
Backups
Cost of storage does not scale linearly
11
$1,8
$1,6
$1,4
$1,2
$1,0
$0,8
$0,6
$0,4
$0,2
$0,0
1TB 50TB 500TB
AWS
USB
USBx3
12
Where you store the data can impact how fast you can analyze your data.
On Helium during testing we saw over 100% difference in analysis time for BWA depending on where we stored the data.
If doing analysis on your desktop fast storage will likely improve analysis time for NGS.
Galaxy is being optimized to take advantage of this.
If running directly on a cluster ask for recommendations.
13
Galaxy
Redcap
Helium
◦ Colocation
◦ /nfsscratch – 110TB
◦ /glusterscratch – 146TB
R Drive
ITS Research Data Storage Service Pilot
Lab/Shared ZFS Systems
14
Today
◦ All data in Galaxy should be considered as transient
Deleted after 30 days
◦ Data processing platform only
◦ Please backup all data that is valuable to you!
Future
◦ Solutions to allow longer term storage of data
15
Increased availability of 10Gb Networking
Research Data Storage Service
Backup Service
Cloud Storage
Galaxy Data Libraries
16
Ben-rogers@uiowa.edu
17
Tom Bair – Economy of Scale Reversed
Safe Photo http://americanbestlocksmith.com/wpcontent/uploads/2010/09/safe-installation.jpg
BioTeam http://www.google.com/url?sa=t&rct=j&q=&esrc
=s&source=web&cd=1&ved=0CF4QFjAA&url=htt p%3A%2F%2Fwww.bioteam.net%2Fwpcontent%2Fuploads%2F2010%2F03%2FcdagxgenstorageForNGS_v3.pdf&ei=0cwWUJPJG4WHqQGih oHoDw&usg=AFQjCNFrzHSvQ8y4Ze3igsXd9mFV_
EWb_Q
18