Hardware: parallel I/O

advertisement

Hardware: parallel I/O

I/O performance as well as CPU performance is one of the limiting factors in system performance. But if for the past decade CPU performance has risen more that 55% per year, disk performance only improved from 4% to 6% per year making an increasingly severe gap between CPU and disk I/O. Moreover, in parallel and distributed systems CPUs are employed simultaneously thus strengthening the speed mismatch. That’s why I/O parallelism has become an attractive approach to bridge this performance gap.

There are three main kinds of data intensive applications which require high performance I/O: scientific simulations, multimedia applications and database management systems. I/O bottleneck solutions are possible in all levels of system design: applications, algorithms, language and compiler support, run-time libraries, operating systems and architectures. But the main challenges lie in communication strategies, file system policies and data storage layout and are discussed in this paper.

Parallel and distributed systems are collection of nodes connected via the network of cables. Let’s first have a closer look at the bus and storage technologies.

Storage technologies.

During the past three decades there’s been achieved a huge decrease in price/MB in storage technology. For example, during the 1990s the capacity of PC hard-disk drives increased more than 100 times, while the prices of subsystems fell dramatically. Maintaining such performance growth poses new challenges especially in the field of very high storage densities. The following storage solutions are available nowadays: tapes, probe storage and disks.

Tape systems have unique characteristics that make them attractive for developers:

 Interchangeability of the employed storage medium;

 Variability in tape speed;

 Simultaneous write/read functionality to and from a number of parallel tracks;

But these features create new challenges for developers among which are:

Signal attenuation caused by electronic and media noises;

Symbol timing recovery;

 Signal dropout effects.

Alternative solution is a probe storage which is characterized by:

 Ultrahigh storage densities of up to 1 TB/inch²;

Small form factor;

High data rates achieved by parallel operation of large 2D arrays with thousands of micro/nano-mechanical cantilevers/tips that can be batch-fabricated by silicon surface-micromachining techniques.

However, data rates of 1 GB/sec achieved by magnetic recording pose a significant challenge, because the mechanical resonant frequencies of AFM

(atomic force microscope) cantilevers limit the data rate of a single cantilever to a few MB/sec. The solution here is to use MEMS (micro-electro-mechanical systems)-based arrays of cantilevers operating in parallel, with each cantilever performing write/read/erase operations in an individual storage field (i.e.

Millipede).

Disk arrays are actively used today as a means to improve aggregate I/O performance of storage systems. Modern disk-array-based data storage systems employ two main architectural techniques:

1) Data striping across disks to improve performance by balancing I/O load across the discs comprising an array and

2) Data redundancy to improve reliability, which means the enabling recovery of user data if the disk containing it should fail.

These two ideas lie behind RAID systems , where RAID stands for

R edundant A rrays of I nexpensive D isks. The basic idea of RAID is to combine multiple small, inexpensive disc drives into an array of disk drives, which exceed the performance of a single large expensive drive and appears to the computer as a single logical storage unit.

Several levels of RAID architecture are defined:

RAID 0(striping):

This level doesn’t truly fit the acronym, because it is not redundant. In level

0 data is split across drives resulting in higher data throughput. It makes performance very good but a disk failure in the array results in data loss. RAID level 0 is used in applications which require very high speed, but do not need redundancy, i.e. Photoshop temporary files.

RAID 1(mirroring):

This level provides redundancy by duplicating all data from one drive to another, which slightly improves the performance as compared to the single drive, but if one of the drives fails no data is lost. This is a good entry-level redundant system, since only two drives are required, but it leads to a high data cost per megabyte. RAID level 1 is used in applications which require redundancy along with fast random writes, where only two drives are available, i.e. small file servers.

RAID 1+0 is a combination of RAID levels 0 and 1, which was initially described as RAID.

RAID 2:

It uses Hamming error correction codes and is intended for use with drives which do not have built-in error detection, so this level is of little use when using

SCSI(small computers system interface) drives.

RAID 3:

It stripes data at a byte level across several drives with parity control stored on one drive. Byte level striping requires hardware support for efficient use.

RAID 4:

It stripes data at a block level across several drives with parity stored on one drive, which allows recovery from the failure of any single drive. Reads and large or sequential writes performance is very high, whereas small random writes are slow, because it requires parity data to be updates each time. Since only one drive in the array stores redundant data, the cost per megabyte is fairly low. RAID level

4 is used in applications with low cost redundancy and high speed reads, which is particularly good for archival storage, i.e. larger file servers.

RAID 5:

Is similar to RAID 4, but distributes parity among the drives. Small writes are faster, but reads are slower as compared to level 4, while the cost per megabyte is the same. RAID level 5 is used for the same purposes as RAID level 4, but it may provide higher performance if most I/O is random and in small chunks, i.e. database servers.

Out of all the RAID levels described above only two are of commercial significance today: RAID 1 and RAID 5.

There are two main points to be said about RAID systems:

1.

Currently, most hard-disk drive manufacturers guarantee an uncorrectable bit error rate of 10 -14 to 10 -15 .RAID does not eliminate the possibility of data loss due to the dist failure, but it improves the odds greatly.

2.

RAID is not a substitute for proper data center management.

Bus technologies.

When we want to build a network we are looking for maximum bandwidth and minimum latency. There are several bus technologies available in the market now.

Historically, the first technology was ISA card, which is still widely used for low-performance cards.

The higher performance and the most common now is the PCI bus interface card. PCI stands for Peripheral Component Interface. It was originally developed as a local bus expansion for the PC bus and was called a PCI Local bus. The PCI spec defines the Electrical requirements for the interface. No bus terminations are

specified, the bus relies on signal reflection to achieve level threshold. It operates either synchronously or asynchronously with the "mother Board bus rate. Unlike earlier PC buses, the PCI bus is processor independent. PCI uses either 32 or 64 bits of parallel data depending on the version. So with each clock tick 32 or 64 bits of data is transferred over the bus.

Transferring 64 bits at a time translates to a very large parallel bus, using a minimum of 64 lines in addition to all the requested control and signal lines. A new version of PCI bus has been released using a differential serial bus instead of a parallel bus. The new serial PCI bus is called PCI Express bus . The PCI Express bus offers a reduced cost solution because it only requires a few sets of differential lines freeing up board space and requiring a smaller connector.

The PCI Express bus is not compatible with the standard PCI bus, because its connectors, signal voltage levels and signal formats are different then with PCI.

However the physical size of both cards has the same dimensions. The main physical difference between these two bus formats is in the connectors and electrical difference is in using differential serial bus instead of an ended parallel one. The main point here is that, although parallel PCI is not yet obsolete, there is a state-of-the-art replacement in PCI Express.

The latest technology is HyperTransport bus which provides high speed, high performance and moreover the lowest possible latency for chip-to-chip links.

It was designed to provide a flexible, scalable interconnection architecture and reduce the number of buses within the system; it provides a high-performance link for applications ranging from embedded systems to personal computers and servers, to network equipment and supercomputers. HyperTransport technology’s aggregated bandwidth of 22.4GB/sec represents a better increase in data throughput over PCI buses which bandwidth is 2.5GB/sec.

A great challenge in cluster computing is how to effectively manage huge amounts of data generated by large-scale data intensive applications and provide an efficient I/O system to the user. Main challenges lie in appropriate network organization and data transfer protocol – middleware which supports parallel I/O.

Traditionally on client server systems data was stored on devices either inside or directly attached to the server. Next in the revolutionary scale came

Network Attached Storage (NAS) which took the storage devices away from the server and connected them directly to the network. Storage Area Networks (SAN) take the principle one step further by allowing storage devices exist on their own separate network and communicate directly with each other over fast media. Users can gain access to these storage devices through server systems which are connected to both the LAN and the SAN.

There are numerous advantages of SAN (i.e. serverless backup) which makes it so attractive to use, but the main disadvantage is money (as it uses a special kind of switches to interconnect devices). That’s why it is not always the best solution for the particular application. Currently different projects are running in order to find new solutions for the issue of parallel I/O. They are run in two major fields:

1.

2.

Building ad hoc network architectures (i.e. DPFS, HPSS) or

Developing a middleware between computational nodes and applications to be run (i.e. Logistical Computing and

Internetworking, Multi-Storage Resource Architecture).

Let’s have a closer look to some of these projects.

DPFS: D istributed P arallel F ile S ystem

DPFS combines features of distributed file systems and parallel file systems:

1.

It collects distributed storage resources from network;

2.

It adopts parallel I/O techniques to achieve high performance. Besides general striping methods it also proposes novel striping method such as multidimensional striping and array striping;

3.

File system is designed and implemented as a general file system. It provides

API and storage location transparency;

4.

Database is what distinguishes DPFS from other parallel file systems. It’s used to store meta-data of file system which makes meta-data management easy and reliable in a distributed environment.

DPFS is built on the top of the local file system of each storage resource, therefore there is no need to change the underlying system SW of LFS and DPFS can take an advantage of I/O optimization (caching and prefetching).

HPSS : H igh P erformance S torage S ystem

HPSS is a network-centered file system which separates data movement and control functions. It offers a secure, global file space with characteristics normally associated with both LAN- and SAN-based architectures.

+ It provides very high bandwidth.

+ It supports parallel, striped data transfers across multiple disks/tape devices (85%

- 95% of the best possible device data transfer rates achievable at the block I/O level).

+ It provides a flexibility to reconfigure disks and tape devices.

- But though it’s inexpensive for lower I/O rate devices, it’s relatively expensive for high throughput disk environments (100 MB/sec per Mover).

To solve this issue SAN is to be used. It’s a single file system that manages and grants access to data in the shared storage with a high bandwidth. The objective is to eliminate file servers between client and storage with min or no impact to the controlling applications. Control information is typically separated from data traffic (in some architectures they are isolated on completely separate networks).

Idea of Mover is to attach SCSI disks and tape drives to low-cost computer running lightweight HPSS Mover Protocol. A data Mover and disks attached to it thus form the equivalent of the intelligent 3 rd party device -> scalability.

There are current researches going on the improvement of the suggested idea:

1.

´I/O Redirect Movers´: Devices are assigned to the single Mover as in the current system. But in case of I/O between SAN-attached disk device and

SAN-attached client, Mover redirects I/O to the client which in turn can perform the I/O operation directly with the SAN disk.

2.

´Multiple Dynamic Movers´: Instead of static dynamic Mover mapping is proposed. This approach is especially useful in the case of failure recovery and load balancing.

Security aspect: let trusted users access the storage through SAN while unauthorized user access the storage through the LAN using built-in security mechanism i.e. authorization. This increases security and allows direct I/O operations between SAN and trusted users and applications.

LoCI: Lo gistic C omputing and I nternetworking

Logistical Network Computing model (LNC):

1.

Predictable networking and computation which move and transform data;

2.

Storage that is accessible from the network.

It uses the idea of scheduling of program execution which leads to highest possible performance.

LNC differs from traditional one because it is based on global scheduling expressed at the programming interface but implemented by local allocation throughout the network (Figure 4).

IBP allows expressing logistical data movement:

serves up both writable and readable storage to anonymous clients as a wide area network resource;

allows remote control of storage activities;

decouples the notion of user identification from storage.

Clients initially allocate storage through a request to an IBP server. On success the server returns 3 capabilities to the client: read, write and manage. Therefore

applications can pass and copy capabilities among themselves without coordinating through IBP -> we gain higher performance.

The main issue: resource usage must be carefully scheduled or application performance will suffer. Scheduler must predict the future performance of set of resources. For this NWS is used.

Network Weather Service periodically monitors available resource performance by passively or actively querying each resource, forecasts future performance level by statistical analysis and reports both up-to-date performance monitor data and performance forecasts via a set of well-defined interfaces.

Currently implemented monitors include TCP/IP latency and bandwidth and work for short-term predictions only.

Multi-Storage Resource Architecture

It allows increasing logical storage capacity of the system, flexible and reliable computing environment and contains new opportunities for further performance improvement (Figure 5).

MS I/O is a middleware that optimizes the data flow between user application and storage resources:

1.

Basic I/O routines provide basic and native I/O interfaces to various storages.

These native interfaces are not optimized for parallel and distributed data access.

2.

Optimization Candidates provide optimization schemes i.e. collective I/O, prefetching, sub file, super file, asynchronous I/O, data location selection, data replication, data access history, etc.

3.

Optimization decision maker ( ODM ) makes decisions on which optimization candidate(s) should be utilized for a data access. To make an accurate decision ODM uses information about:

user’s access pattern which describes current and future data usage;

database which keeps data access history. Note that in other systems users have to deal with optimization manually.

4.

Database provides an easy way to manage and use multiple resources in the system (i.e. file name and location management). It also keeps user’s data access history:

basic information on each application run i.e. location of experiment, problem size, I/O frequency, date, time, etc.;

dataset (modem) names involved in each run and user access pattern for each dataset;

shared data attributes for datasets with same associations i.e. data dimension, data size, I/O mode (read/write), data type, etc.;

I/O activities i.e. I/O storage location, file name, path, offset, optimization policy, etc.;

Timings of experiment i.e. total execution time, total I/O time; total compute time, local disk I/O time, remote disk I/O time, remote tape

I/O time, etc.

User is involved in decision making in a high level, it means he/she only describes the features of the data usage i.e. when, how frequently and of what size the data will be used. The main issue: how to define frequent/seldom or large/small? User decides this upon his/her own experience, there are no absolute values.

Download