Cancer Genome ucsd2015

advertisement

Enabling Genomic

BIG DATA with

Content Centric Networking

J.J. Garcia-Luna-Aceves

UC Santa Cruz jj@soe.ucsc.edu

Example Today

 Cancer Genome Hub (CGhub)

 CGhub’s purpose was to store the genome’s sequenced as part of The Cancer Genome Atlas (TCGA) project.

 At about 300 GB/genome this translated to about

17,000 genomes over the 44 month lifetime of the project.

 Transmission requirements of archiving effort reached a sustained rate of 17 Gbps by the end of the 44 month project

Cancer Genomics Hub (UCSC) is Housed in SDSC CoLo:

Large Data Flows to End Users

1G

8G

Cumulative TBs of CGH

Files Downloaded

30 PB

15G

Data Source: David Haussler,

Brad Smith, UCSC

Example Today

CGHub had to use current technology:

Data organization and search: XML schema definitions

Security: Existing techniques, HTTPS

Big data transfer: Modified Bit Torrent (Gene Torrent or GT)

• HTTPS and Bit Torrent are problematic

• No caching with

HTTPS

• TCP limitations percolate to multiple connections under BT or GT

• A potential playground for DDoS?

The Future of Genomic BIG DATA

Is the Internet ready to support personalized medicine?

Is the future of genomic data really different?

 If not, what technology would be limiting progress?

First:

 Genomic data are really BIG DATA.

Personalized medicine will make genomic data volumes explode, and many other applications of genomic data will develop

Even if one site or a few mirrors are used for a personal genome, it has to be uploaded.

Is Technology Ready in in 5-10 Years?

 Communication, storage and and computing technologies are not the problem:

– Production optical transport @ 1

Tbps http://www.lightreading.com/document.asp?d

oc_id=188442&

– Individual hosts able to transmit at 100 Gbps

– I/O throughput can keep up with the network speeds (i.e., disk will be able to handle 100 Gbps =

12.5 GBps).

– Memory and processing costs will continue to decline.

Networking is The BIG PROBLEM for

Genomic BIG DATA

 Speed of light will not increase  but number of genomic data repositories or distance between them will

 Internet protocol stack was not designed for BIG

DATA transfer over paths with large bandwidthdelay products:

– TCP throughput

– DDoS vulnerabilities (e.g., SYN flooding)

– Caching vs privacy (e.g., HTTPS)

– Static directory services (e.g., DNS vs content directories).

Sobering Results for Today’s Internet

TCP and variations (e.g., BT) cannot be the baseline to support big data genomics

Storage must be used to reduce bandwidth-delay products

Simulation results

-4-day simulation

-20 locations

-40 Gbps links with 5 to

25ms latency

- ave. degree of 5

TCP (client/server)

Content centric approach

Internetworking BeND

 TCP/IP architecture must change for BIG DATA, but how?

 Content Centric Network architectures (CCN) such as NDN and CCNx have been proposed

 The main advantage of CCN solutions is caching

 But…NDN and CCNx still at early stages of development

 Big Data Networking is all about bandwidth-delay product, not replacing IP addresses with names

Download