Data storage

advertisement
Storage Solutions for
Bioinformatics
Li Yan
Director of FlexLab, Bioinformatics core technology laboratory
liyan3@genomics.cn
http://www.genomics.cn/FlexLab/index.html
Science and Technology Division, BGI-Shenzhen
OUTLINE
• Background
• Hardware Infrastructure of Data Storage
• Data Management
• Data Storage Architecture In BGI
• Distributed Computing on Storage Server
Background:
Fast Growing Big Data
Background
Fast growing big data
• From small genomes to large complex genomes





E. coli Genome: 4.9M
Caenorhaditis elegans Genome: 100M
Human Genome: 3G
Wheat Genome: 16G
Salamander: 45G
• From one sample to populations



Human Genome: 3 billion DNA subunits (A,T,C,G)
80~100X Sequencing: 600GB Raw data for individual study
1000 Genome Project: 600TB Raw data for population study
• From the first generation sequencing to the second generation sequencing
Long-Term Data Storage Needs
• Properly secure the data
 Plan for data redundancy, which generally means we mirror data with
two or more copies
• Available(24x7x365) for all kinds of uses
 Readily accessible and in the right format
• Fast Data Transfer for collaborations
 Fast Network server(Aspera) instead of mailing a hard drive
• Scalable, easy to scale up
 Choosing reliable file systems
Hardware infrastructure
of data storage
Type of Storage infrastructure
• Disk library
• A high-capacity storage system that holds a quantity of CD-ROM, DVD or magnetooptic (MO) disks in a storage rack and feeds them to one or more drives for reading
and writing.
• Magnetic tape
• A high-capacity data storage system for storing, retrieving, reading and writing
multiple magnetic tape cartridges.
• Redundant array of independent disks (RAID)
• RAID is a storage technology that combines multiple disk drive components into a
logical unit
• Direct-attached storage (DAS)
• a digital storage system directly attached to a server or workstation, without a
storage network in between
• Network-attached storage (NAS)
• Network-attached storage (NAS) is file-level computer data storage connected to a
computer network providing data access to heterogeneous clients.
• Storage area network (SAN)
• A storage area network (SAN) is a dedicated network that provides access to
consolidated, block level data storage.
Type of Storage
Disk library
Pros
•Fast
•High storage capacity
•High data availability
Cons
General use
•Not as easily accessible as DAS
•Intended for write once, read
rarely info
•Disk-to-disk backup
•Archiving
•Near line storage
Magnetic tape •Low cost per megabytes •Inconvenient for fast recovery
•Portable
of individual or group files
•Unlimited capacity (with
multiple tapes)
Redundant
array of
independent
disks (RAID)
•Fast
•High storage capacity
•High data availability
•Reliable
•Security
•Fault tolerance
•Archiving
•Limited-budget
businesses
•Offsite storage
•Possible false sense of security •Swap files
•Some recovery difficulty on
•Internet service
some systems
providers
•High cost for optimum systems •Redundant storage
Type of Storage
Pros
Direct-attached •Simple
storage (DAS) •Low starting cost
•Easy to use
•Fast file access for
multiple clients
•Ease of data sharing
•High storage capacity
•Redundancy
•Ease of drive mirroring
•Consolidated resources
Storage area •Excellent for moving
network (SAN) large blocks of data
•Exceptional reliability
•Easily availible
•Fault tolerance
•Scalability
Networkattached
storage (NAS)
Cons
General use
•Needs separate storage •Data and application
for each server
sharing
•Not easy to transfer
•Data backup
•Archiving
data in network
•Server takes
application processing
load
•Less convenient than •Backup
SAN for moving large •Archiving
•Redundant storage
blocks of data
•Expensive
•Lack of standardization
•Management
complexity
•Large databases
•Bandwidth-intensive
applications
•Mission-critical
applications
Software Level of Data storage
Data flow of NGS
Alignment
Assembly
Association
Raw Data
Sequencer
•
•
•
•
•
Data Store
Annotation of features
Variations/Mutations
Protein Structural
Gene Expressions
Function Networks
Meaningful Biology Data
Complex workflow
Data Management


Classify the data into different levels

First Level of Storage: Dynamic, fast, Temporary

Secondary Level of storage: Slower than first level, but enduring and
safety

Third Level of storage: High capacity medium for backups and
archives
Choosing file systems

Current popular distributed file systems include: Lustre, HDFS,
MogileFS, FreeNAS, FastDFS, OpenAFS, MooseFS, pNFS, and
GoogleFS.
Classify the data into different levels
• First Level of Storage: Dynamic, fast, Temporary
• intermediate results of data analysis
• Reference data
• …
• Secondary Level of storage: Slower than first level, but enduring
and safety
• Sequencing raw data
• Meaningful data
• Third Level of storage: High capacity medium for backups and
archives
• Backups and archives of raw data and meaningful data
Distributed File systems
• Lustre
lustre is a large, safe and reliable, highly available cluster file system, which is
developed and maintained by the SUN. Lustre can support more than 10,000 nodes,
the number to the number of PB storage system.
• Hadoop(HDFS)
Hadoop and not just a hadoop distributed file system for storage, but designed for
general-purpose computing device in the form of large-scale distributed applications
running on the cluster framework.
• OneFS
OneFS enables to scale data access capacity to more than 1.6 petabytes and up to 10
Gb/sec of throughput for a single cluster capacity of up to 10 GBS (Gigabytes per
second) of throughput.
Distributed file systems
Storage Server
Distributed File systems
• MogileFS (www.danga.com)
• FreeNAS ( www.openqrm.org )
• FastDFS (code.google.com / p / fastdfs)
• OpenAFS ( www.openafs.org )
• MooseFS (derf.homelinux.org)
• pNFS ( www.pnfs.com )
• GoogleFS
Data compression&& Data security

Data compression

Common used:


Exclusive used for DNA sequences:


Lemple-Ziv, BWT
Biocompress, GeneCompress, CTW-LZ, GeNML, fqzcomp,
sam_comp
Data security



Raid system failure/ Redundancy
File system
Network
Data Storage Architecture
In BGI
Data Storage Architecture In BGI
Two Copies
Write
Write
Write
Tape Library
Read
Sequencers
Compute Nodes
Data Storage Architecture In BGI
Two Copies
Write
Write
Write
Tape Library
Read
Sequencers
Compute Nodes
First Level Storage
Data Storage Architecture In BGI
Two Copies
Write
Write
Second Level Storage
Write
Tape Library
Read
Sequencers
Compute Nodes
Data Storage Architecture In BGI
Two Copies
Write
Write
Write
Tape Library
Read
Third Level Storage
Sequencers
Compute Nodes
Data Storage Architecture In BGI
Two Copies
Write
Write
Write
Tape Library
Read
Sequencers
Compute Nodes
Distributed Computing
on Storage Server
Traditional Genome Assembly
Costly, Unscalable
NGS read file
Sequence Assembly
Large memory
server
>500GB
Storage
Users
26
Distributed Genome Assembly
Several storage server (IBM3630*16 for human genome)
Assembly
Cost effectively, Scalable
……
Hecate
Constructing de bruijn Graph
Solving Tiny Repeats
Scaffolding
Merging Bubbles
Merging Contigs
Reads
Gaea 2.1
Reference genome
Preprocessing
Distributed Indexing
for load balancing
Flexible splitting
tolerates more
mistmatches
Locating
Aligning
Dynamic
Programming for
robust gap alignment
SNP calling
Standard mapping
quality for SNP calling
29
Q&A
Download