Data Awareness Data Management Data Storage Campus Resources Questions 2
Sequencing changes faster than IT Understand the data you will produce Understand the data you will keep Understand how the data will move 3
◦ ◦ ◦ Understand the sizes of the data each instrument produces ◦ ◦ How often will you collect this data?
What IT resources are needed for each data set?
How will you handle?
Raw Data Intermediate Data Derived Data 4
◦ ◦ Must decide what data to keep ◦ ◦ How long?
How will it be stored?
Is it cheaper to: Rerun the experiment Rerun the analysis 5
Data captured by the instrument must be moved ◦ Terabytes of data may be involved Moving terabytes of data across networks is non-trivial The network is not always the bottleneck 6
◦ ◦ ◦ ◦ ◦ Common Data Movements Instrument to local capture storage Capture storage to shared storage Shared storage to HPC resource Shared storage to desktop Shared storage to backup/replication 7
Globus Online – Fastest for big files but requires GridFTP scp – Fetch (Mac) WS_FTP (Windows) Network Drive File Copy – Slowest but simplest External Hard Drive – Reasonably fast but requires physical movement 8
Transfer Mechanism External Hard Drive Gigabit Ethernet Typical Desktop Hard Drive Typical Desktop SSD GridFTP over 1Gb CIFS over 1Gb scp over 1Gb Transfer Speed 100MB/second read + 100MB/second write + Walking Time Up to 120MB/second 100MB/second 300MB/second 120MB/second 60-80MB/second 60-100MB/second Fastest network filesystem on campus 600MB/second single copy 6GB/second aggregate Moving 1TB can easily take 3 hours or more!
◦ ◦ Very important There are many solutions Wiki, spreadsheet, database, etc Campus Options Campus Wiki Sharepoint Redcap Galaxy Make sure you have backups!
◦ Cheap storage is easy 2TB External USB Drive ◦ Big storage is harder 50TB Storage Server ◦ Big, fast, cheap, safe storage is much harder 50TB Storage Server Pair Checksum High Performance Network Backups Cost of storage does not scale linearly 11
$1,8 $1,6 $1,4 $1,2 $1,0 $0,8 $0,6 $0,4 $0,2 $0,0 1TB 50TB 500TB AWS USB USBx3 12
Where you store the data can impact how fast you can analyze your data.
On Helium during testing we saw over 100% difference in analysis time for BWA depending on where we stored the data.
If doing analysis on your desktop fast storage will likely improve analysis time for NGS.
Galaxy is being optimized to take advantage of this.
If running directly on a cluster ask for recommendations.
Galaxy Redcap Helium ◦ ◦ ◦ Colocation /nfsscratch – 110TB /glusterscratch – 146TB R Drive ITS Research Data Storage Service Pilot Lab/Shared ZFS Systems 14
◦ Today All data in Galaxy should be considered as transient ◦ ◦ ◦ Deleted after 30 days Data processing platform only Please backup all data that is valuable to you!
Future Solutions to allow longer term storage of data 15
Increased availability of 10Gb Networking Research Data Storage Service Backup Service Cloud Storage Galaxy Data Libraries 16
Tom Bair – Economy of Scale Reversed Safe Photo http://americanbestlocksmith.com/wp content/uploads/2010/09/safe-installation.jpg
BioTeam http://www.google.com/url?sa=t&rct=j&q=&esrc =s&source=web&cd=1&ved=0CF4QFjAA&url=htt p%3A%2F%2Fwww.bioteam.net%2Fwp content%2Fuploads%2F2010%2F03%2Fcdag xgen storageForNGS_v3.pdf&ei=0cwWUJPJG4WHqQGih oHoDw&usg=AFQjCNFrzHSvQ8y4Ze3igsXd9mFV_ EWb_Q 18