J.J. Garcia-Luna-Aceves
Cancer Genome Hub (CGhub)
CGhub’s purpose was to store the genome’s sequenced as part of The Cancer Genome Atlas (TCGA) project.
At about 300 GB/genome this translated to about
17,000 genomes over the 44 month lifetime of the project.
Transmission requirements of archiving effort reached a sustained rate of 17 Gbps by the end of the 44 month project
Cancer Genomics Hub (UCSC) is Housed in SDSC CoLo:
Large Data Flows to End Users
1G
8G
Cumulative TBs of CGH
Files Downloaded
30 PB
15G
Data Source: David Haussler,
Brad Smith, UCSC
CGHub had to use current technology:
Data organization and search: XML schema definitions
Security: Existing techniques, HTTPS
Big data transfer: Modified Bit Torrent (Gene Torrent or GT)
• HTTPS and Bit Torrent are problematic
• No caching with
HTTPS
• TCP limitations percolate to multiple connections under BT or GT
• A potential playground for DDoS?
Is the Internet ready to support personalized medicine?
Is the future of genomic data really different?
If not, what technology would be limiting progress?
First:
Genomic data are really BIG DATA.
Personalized medicine will make genomic data volumes explode, and many other applications of genomic data will develop
Even if one site or a few mirrors are used for a personal genome, it has to be uploaded.
Communication, storage and and computing technologies are not the problem:
– Production optical transport @ 1
Tbps http://www.lightreading.com/document.asp?d
oc_id=188442&
– Individual hosts able to transmit at 100 Gbps
– I/O throughput can keep up with the network speeds (i.e., disk will be able to handle 100 Gbps =
12.5 GBps).
– Memory and processing costs will continue to decline.
Speed of light will not increase but number of genomic data repositories or distance between them will
Internet protocol stack was not designed for BIG
DATA transfer over paths with large bandwidthdelay products:
– TCP throughput
– DDoS vulnerabilities (e.g., SYN flooding)
– Caching vs privacy (e.g., HTTPS)
– Static directory services (e.g., DNS vs content directories).
TCP and variations (e.g., BT) cannot be the baseline to support big data genomics
Storage must be used to reduce bandwidth-delay products
Simulation results
-4-day simulation
-20 locations
-40 Gbps links with 5 to
25ms latency
- ave. degree of 5
TCP (client/server)
Content centric approach
TCP/IP architecture must change for BIG DATA, but how?
Content Centric Network architectures (CCN) such as NDN and CCNx have been proposed
The main advantage of CCN solutions is caching
But…NDN and CCNx still at early stages of development
Big Data Networking is all about bandwidth-delay product, not replacing IP addresses with names