ITS Spotlight Workgroup Edition Enterprise

advertisement

Ben Rogers

ITS – Research Services

1

Data Awareness

Data Management

Data Storage

Campus Resources

Questions

2

Sequencing changes faster than IT

Understand the data you will produce

Understand the data you will keep

Understand how the data will move

3

Understand the sizes of the data each instrument produces

◦ How often will you collect this data?

◦ What IT resources are needed for each data set?

How will you handle?

◦ Raw Data

◦ Intermediate Data

◦ Derived Data

4

Must decide what data to keep

◦ How long?

◦ How will it be stored?

Is it cheaper to:

◦ Rerun the experiment

◦ Rerun the analysis

5

Data captured by the instrument must be moved

Terabytes of data may be involved

Moving terabytes of data across networks is non-trivial

◦ The network is not always the bottleneck

6

Common Data Movements

◦ Instrument to local capture storage

◦ Capture storage to shared storage

◦ Shared storage to HPC resource

◦ Shared storage to desktop

◦ Shared storage to backup/replication

7

Globus Online – Fastest for big files but requires GridFTP scp – Fetch (Mac) WS_FTP (Windows)

Network Drive File Copy – Slowest but simplest

External Hard Drive – Reasonably fast but requires physical movement

8

Transfer Mechanism

External Hard Drive

Gigabit Ethernet

Typical Desktop Hard Drive

Typical Desktop SSD

GridFTP over 1Gb

CIFS over 1Gb scp over 1Gb

Transfer Speed

100MB/second read +

100MB/second write + Walking

Time

Up to 120MB/second

100MB/second

300MB/second

120MB/second

60-80MB/second

60-100MB/second

Fastest network filesystem on campus

600MB/second single copy

6GB/second aggregate

Moving 1TB can easily take 3 hours or more!

9

Very important

There are many solutions

◦ Wiki, spreadsheet, database, etc

◦ Campus Options

 Campus Wiki

 Sharepoint

 Redcap

 Galaxy

Make sure you have backups!

10

Cheap storage is easy

◦ 2TB External USB Drive

Big storage is harder

◦ 50TB Storage Server

Big, fast, cheap, safe storage is much harder

◦ 50TB Storage Server Pair

 Checksum

 High Performance Network

 Backups

Cost of storage does not scale linearly

11

$1,8

$1,6

$1,4

$1,2

$1,0

$0,8

$0,6

$0,4

$0,2

$0,0

1TB 50TB 500TB

AWS

USB

USBx3

12

Where you store the data can impact how fast you can analyze your data.

On Helium during testing we saw over 100% difference in analysis time for BWA depending on where we stored the data.

If doing analysis on your desktop fast storage will likely improve analysis time for NGS.

Galaxy is being optimized to take advantage of this.

If running directly on a cluster ask for recommendations.

13

Galaxy

Redcap

Helium

◦ Colocation

◦ /nfsscratch – 110TB

◦ /glusterscratch – 146TB

R Drive

ITS Research Data Storage Service Pilot

Lab/Shared ZFS Systems

14

Today

◦ All data in Galaxy should be considered as transient

 Deleted after 30 days

◦ Data processing platform only

◦ Please backup all data that is valuable to you!

Future

◦ Solutions to allow longer term storage of data

15

Increased availability of 10Gb Networking

Research Data Storage Service

Backup Service

Cloud Storage

Galaxy Data Libraries

16

Ben-rogers@uiowa.edu

17

Tom Bair – Economy of Scale Reversed

Safe Photo http://americanbestlocksmith.com/wpcontent/uploads/2010/09/safe-installation.jpg

BioTeam http://www.google.com/url?sa=t&rct=j&q=&esrc

=s&source=web&cd=1&ved=0CF4QFjAA&url=htt p%3A%2F%2Fwww.bioteam.net%2Fwpcontent%2Fuploads%2F2010%2F03%2FcdagxgenstorageForNGS_v3.pdf&ei=0cwWUJPJG4WHqQGih oHoDw&usg=AFQjCNFrzHSvQ8y4Ze3igsXd9mFV_

EWb_Q

18

Download