Storage Solutions for Bioinformatics Li Yan Director of FlexLab, Bioinformatics core technology laboratory liyan3@genomics.cn http://www.genomics.cn/FlexLab/index.html Science and Technology Division, BGI-Shenzhen OUTLINE • Background • Hardware Infrastructure of Data Storage • Data Management • Data Storage Architecture In BGI • Distributed Computing on Storage Server Background: Fast Growing Big Data Background Fast growing big data • From small genomes to large complex genomes E. coli Genome: 4.9M Caenorhaditis elegans Genome: 100M Human Genome: 3G Wheat Genome: 16G Salamander: 45G • From one sample to populations Human Genome: 3 billion DNA subunits (A,T,C,G) 80~100X Sequencing: 600GB Raw data for individual study 1000 Genome Project: 600TB Raw data for population study • From the first generation sequencing to the second generation sequencing Long-Term Data Storage Needs • Properly secure the data Plan for data redundancy, which generally means we mirror data with two or more copies • Available(24x7x365) for all kinds of uses Readily accessible and in the right format • Fast Data Transfer for collaborations Fast Network server(Aspera) instead of mailing a hard drive • Scalable, easy to scale up Choosing reliable file systems Hardware infrastructure of data storage Type of Storage infrastructure • Disk library • A high-capacity storage system that holds a quantity of CD-ROM, DVD or magnetooptic (MO) disks in a storage rack and feeds them to one or more drives for reading and writing. • Magnetic tape • A high-capacity data storage system for storing, retrieving, reading and writing multiple magnetic tape cartridges. • Redundant array of independent disks (RAID) • RAID is a storage technology that combines multiple disk drive components into a logical unit • Direct-attached storage (DAS) • a digital storage system directly attached to a server or workstation, without a storage network in between • Network-attached storage (NAS) • Network-attached storage (NAS) is file-level computer data storage connected to a computer network providing data access to heterogeneous clients. • Storage area network (SAN) • A storage area network (SAN) is a dedicated network that provides access to consolidated, block level data storage. Type of Storage Disk library Pros •Fast •High storage capacity •High data availability Cons General use •Not as easily accessible as DAS •Intended for write once, read rarely info •Disk-to-disk backup •Archiving •Near line storage Magnetic tape •Low cost per megabytes •Inconvenient for fast recovery •Portable of individual or group files •Unlimited capacity (with multiple tapes) Redundant array of independent disks (RAID) •Fast •High storage capacity •High data availability •Reliable •Security •Fault tolerance •Archiving •Limited-budget businesses •Offsite storage •Possible false sense of security •Swap files •Some recovery difficulty on •Internet service some systems providers •High cost for optimum systems •Redundant storage Type of Storage Pros Direct-attached •Simple storage (DAS) •Low starting cost •Easy to use •Fast file access for multiple clients •Ease of data sharing •High storage capacity •Redundancy •Ease of drive mirroring •Consolidated resources Storage area •Excellent for moving network (SAN) large blocks of data •Exceptional reliability •Easily availible •Fault tolerance •Scalability Networkattached storage (NAS) Cons General use •Needs separate storage •Data and application for each server sharing •Not easy to transfer •Data backup •Archiving data in network •Server takes application processing load •Less convenient than •Backup SAN for moving large •Archiving •Redundant storage blocks of data •Expensive •Lack of standardization •Management complexity •Large databases •Bandwidth-intensive applications •Mission-critical applications Software Level of Data storage Data flow of NGS Alignment Assembly Association Raw Data Sequencer • • • • • Data Store Annotation of features Variations/Mutations Protein Structural Gene Expressions Function Networks Meaningful Biology Data Complex workflow Data Management Classify the data into different levels First Level of Storage: Dynamic, fast, Temporary Secondary Level of storage: Slower than first level, but enduring and safety Third Level of storage: High capacity medium for backups and archives Choosing file systems Current popular distributed file systems include: Lustre, HDFS, MogileFS, FreeNAS, FastDFS, OpenAFS, MooseFS, pNFS, and GoogleFS. Classify the data into different levels • First Level of Storage: Dynamic, fast, Temporary • intermediate results of data analysis • Reference data • … • Secondary Level of storage: Slower than first level, but enduring and safety • Sequencing raw data • Meaningful data • Third Level of storage: High capacity medium for backups and archives • Backups and archives of raw data and meaningful data Distributed File systems • Lustre lustre is a large, safe and reliable, highly available cluster file system, which is developed and maintained by the SUN. Lustre can support more than 10,000 nodes, the number to the number of PB storage system. • Hadoop(HDFS) Hadoop and not just a hadoop distributed file system for storage, but designed for general-purpose computing device in the form of large-scale distributed applications running on the cluster framework. • OneFS OneFS enables to scale data access capacity to more than 1.6 petabytes and up to 10 Gb/sec of throughput for a single cluster capacity of up to 10 GBS (Gigabytes per second) of throughput. Distributed file systems Storage Server Distributed File systems • MogileFS (www.danga.com) • FreeNAS ( www.openqrm.org ) • FastDFS (code.google.com / p / fastdfs) • OpenAFS ( www.openafs.org ) • MooseFS (derf.homelinux.org) • pNFS ( www.pnfs.com ) • GoogleFS Data compression&& Data security Data compression Common used: Exclusive used for DNA sequences: Lemple-Ziv, BWT Biocompress, GeneCompress, CTW-LZ, GeNML, fqzcomp, sam_comp Data security Raid system failure/ Redundancy File system Network Data Storage Architecture In BGI Data Storage Architecture In BGI Two Copies Write Write Write Tape Library Read Sequencers Compute Nodes Data Storage Architecture In BGI Two Copies Write Write Write Tape Library Read Sequencers Compute Nodes First Level Storage Data Storage Architecture In BGI Two Copies Write Write Second Level Storage Write Tape Library Read Sequencers Compute Nodes Data Storage Architecture In BGI Two Copies Write Write Write Tape Library Read Third Level Storage Sequencers Compute Nodes Data Storage Architecture In BGI Two Copies Write Write Write Tape Library Read Sequencers Compute Nodes Distributed Computing on Storage Server Traditional Genome Assembly Costly, Unscalable NGS read file Sequence Assembly Large memory server >500GB Storage Users 26 Distributed Genome Assembly Several storage server (IBM3630*16 for human genome) Assembly Cost effectively, Scalable …… Hecate Constructing de bruijn Graph Solving Tiny Repeats Scaffolding Merging Bubbles Merging Contigs Reads Gaea 2.1 Reference genome Preprocessing Distributed Indexing for load balancing Flexible splitting tolerates more mistmatches Locating Aligning Dynamic Programming for robust gap alignment SNP calling Standard mapping quality for SNP calling 29 Q&A