striping - UCSB Computer Science

advertisement
Swap-Space Management
• Swap-space — Virtual memory uses disk space
as an extension of main memory
• Swap-space can be carved out of the normal file
system, or, more commonly, it can be in a
separate disk partition
• Swap-space management
 Allocate swap space when process starts; holds
text segment (the program) and data segment
 Kernel uses swap maps to track swap-space use
Data Structures for Swapping on
Linux Systems
Mass-Storage Systems
UCSB CS170
Tao Yang
Mass-Storage Systems: What to Learn
• Structure of mass-storage devices and the
resulting effects on the uses of the devices
 Hard Disk Drive
 SSD
 Hybrid Disk
• Performance characteristics and management of
mass-storage devices
 Disk Scheduling in HDD
• RAID – improve performance/reliability
• Text book Chapter 12 and 14.2.
Mass Storage: HDD and SSD
• Most popular: Magnetic hard disk drives
• Solid state drives: (SSD)
Magnetic Tape
• Relatively permanent and holds large quantities of
data
• Random access ~1000 times slower than disk
• Mainly used for backup, storage of infrequentlyused data, transfer medium between systems
• 20-1.5TB typical storage
• Common technologies are 4mm, 8mm, 19mm, LTO2 and SDLT
Disk Attachment
• Drive attached to computer
via I/O bus
• USB
• SATA (replacing ATA, PATA, EIDE)
• SCSI
 itself is a bus, up to 16 devices on one cable, SCSI
initiator requests operation and SCSI targets
perform tasks
• FC (Fiber Channel) is high-speed serial architecture
 Can be switched fabric with 24-bit address space –
the basis of storage area networks (SANs) in which
many hosts attach to many storage units
 Can be arbitrated loop (FC-AL) of 126 devices
• SATA connectors
• SCSI
• FC with SAN-switch
Network-Attached Storage
• Network-attached storage (NAS) is storage made
available over a network rather than over a local
connection (such as a bus)
• NFS and CIFS are common protocols
• Implemented via remote procedure calls (RPCs)
between host and storage
• New iSCSI protocol uses IP network to carry the
SCSI protocol
Storage Area Network (SAN)
• Special/dedicated network for accessing block
level data storage
• Multiple hosts attached to multiple storage arrays
- flexible
Performance characteristics of disks
• Drives rotate at 60 to 200 times per second
• Positioning time is
 time to move disk arm to
desired cylinder (seek time)
 plus time for desired sector to rotate
under the disk head (rotational latency)
 Effective bandwidth: “average data transfer rate
during a transfer– that is the, number of bytes
divided by transfer time”
– data rate includes positioning overhead
Moving-head Disk Mechanism
Disk Performance
Disk Latency = Seek Time + Rotation Time + Transfer Time
Seek Time: time to move disk arm over track (1-20ms)
Fine-grained position adjustment necessary for head to
“settle”
Head switch time ~ track switch time (on modern disks)
Rotation Time: time to wait for disk to rotate under disk head
Disk rotation: 4 – 15ms (depending on price of disk)
On average, only need to wait half a rotation
Transfer Time: time to transfer data onto/off of disk
Disk head transfer rate: 50-100MB/s (5-10 usec/sector)
Host transfer rate dependent on I/O connector (USB,
SATA, …)
Toshiba Disk (2008)
Moving-head Disk Mechanism
128MB/s
54MB/s
Question
• How long to complete 500 random disk reads, in
FIFO order? Each reads one sector(512 bytes).
Question
• How long to complete 500 random disk reads, in
FIFO order? Each reads one sector(512 bytes).
 Seek: average 10.5 msec
 Rotation: average 4.15 msec
– Disk spins 120 times per second (7200 RPM/60)
– Average rotational cost is time to travel half track: 1/120 *
50%=4.15msec
 Transfer: 5-10 usec
– 54MB/second to transfer 512 bytes per sector
– 0.5K/(54K) =0.01 msec
• 500 * (10.5 + 4.15 + 0.01)/1000 = 7.3 seconds
• Effective bandwidth:
 500 sectors*512 Bytes / 7.3 sec =0.034MB/sec
 Copying 1GB of data takes 8.37 hours
Question
• How long to complete 500 sequential disk reads?
Question
• How long to complete 500 sequential disk reads?
 Seek Time: 10.5 ms (to reach first sector)
 Rotation Time: 4.15 ms (to reach first sector)
 Transfer Time: (outer track)
500 sectors * 512 bytes / 128MB/sec = 2ms
Total: 10.5 + 4.15 + 2 = 16.7 ms
• Effective bandwidth:
 500 sectors*512 Bytes / 16.7 ms =14.97 MB/sec
 This is 11.7% of the maximum transfer rate with
250KB data transferring.
Question
• How large a transfer is needed to achieve 80% of
the max disk transfer rate?
Question
• How large a transfer is needed to achieve 80% of
the max disk transfer rate?
Assume x rotations are needed, then solve for x:
0.8 (10.5 ms + (1ms + 8.5ms) x) = 8.5ms x
Total: x = 9.1 rotations, 9.8MB ( with 2100
sectors/track)
• A simplified approximation is to compute the
effective bandwidth first
x/(10.5ms + x/128 ) = 0.8 *128  x=7.5MB
 Copying 1GB of data takes 10 seconds!
Disk Scheduling: Objective
• Given a set of IO requests
Hard Disk
Drive
• Coordinate disk access of multiple I/O
requests for faster performance and
reduced seek time.
 Seek time  seek distance
 Measured by total head movement in
terms of cylinders from one request to
another.
FCFS (First Come First Serve)
total head movement: 640 cylinders for executing all requests
…
2
1
Disk Head
199
SSTF (Shortest Seek Time First)
• Selects the request with the minimum seek time
from the current head position
• total head movement: 236 cylinders
Question
• Consider the following sequence of requests (2, 4, 1, 8),
and assume the head position is on track 9. Then, the
order in which SSTF services the requests is
_________
Anthony D. Joseph
UCB CS162
Question
• Q5: Consider the following sequence of requests (2, 4,
1, 8), and assume the head position is on track 9. Then,
the order in which SSTF services the requests is
_________
(8, 4, 2, 1)
SCAN Algorithm for Disk Scheduling
• SCAN: move disk arm
in one direction, until
all requests satisfied,
then reverse direction
• Also called “elevator
scheduling”
SCAN: Elevator algorithm
• total head movement : 208 cylinders
CSCAN for Disk Scheduling
• CSCAN: move disk arm in one
direction, until all requests
satisfied, then start again from
farthest request
Provides a more uniform wait time than
SCAN by treating cylinders as a circular
list.
The head moves from one end of the disk
to the other, servicing requests as it goes.
When it reaches the other end, it
immediately returns to the beginning of
the disk, without servicing any requests
on the return trip
C-SCAN (Circular-SCAN)
Scheduling Algorithms
Algorithm Name Description
FCFS
First-come first-served
SSTF
Shortest seek time first; process the
request that reduces next seek time
SCAN (aka
Elevator)
C-SCAN
Move head from end to end (has a
current direction)
Only service requests in one direction
(circular SCAN)
Selecting a Disk-Scheduling Algorithm
• SSTF is common with its natural appeal (but
it may lead to starvation issue).
• C-LOOK is fair and efficient
• SCAN and C-SCAN perform better for
systems that place a heavy load on the disk
• Performance depends on the number and
types of requests
Solid State Disks (SSDs)
• Use NAND Multi-Level Cell (2-bit/cell) flash memory
 Non-volatile storage technology
 Sector (4 KB page) addressable, but stores 4-64 “pages”
per memory block
 No moving parts (no rotate/seek motors)
 Very low power and lightweight
SSD Logic Components
Transfer time: transfer a 4KB page
Limited by controller and disk interface (SATA: 300600MB/s)
Latency = Queuing Time + Controller time + Xfer Time
SSD Architecture – Writes (I)
•



Writing data is complex! (~200μs – 1.7ms )
Can only write empty pages in a block
Erasing a block takes ~1.5ms
Controller maintains pool of empty blocks by coalescing
used pages (read, erase, write), also reserves some %
of capacity
https://en.wikipedia.org/wiki/Solid-state_drive
Anthony D. Joseph
UCB CS162
SSD Architecture – Writes (II)
• Write A, B, C, D
https://en.wikipedia.org/wiki/Solid-state_drive
Anthony D. Joseph
UCB CS162
SSD Architecture – Writes (II)
• Write A, B, C, D
• Write E, F, G, H and
A’, B’, C’, D’
 Record A, B, C, D as
obsolete
https://en.wikipedia.org/wiki/Solid-state_drive
Anthony D. Joseph
UCB CS162
SSD Architecture – Writes (II)
• Write A, B, C, D
• Write E, F, G, H and
A’, B’, C’, D’
 Record A, B, C, D as
obsolete
• Controller garbage collects
obsolete pages by copying
valid pages to new (erased)
block
• Typical steady state
behavior when SSD is
almost full
 One erase every 64 or 128 writes
Anthony D. Joseph
UCB CS162
SSD Architecture – Writes (III)
• Write and erase cycles require “high” voltage
 Damages memory cells, limits SSD lifespan
 Controller uses ECC, performs wear leveling
• Result is very workload dependent performance
 Latency = Queuing Time + Controller time (Find Free
Block) + Xfer Time
 Highest BW: Seq. OR Random writes (limited by empty
pages)
Rule of
thumb: writes 10x more expensive than reads,
and erases 10x more expensive than writes
Flash Drive (2011)
Storage Performance & Price
Bandwidth
(Sequential R/W)
Cost/GB
Size
HDD2
50-100 MB/s
$0.03-0.07/GB
2-4 TB
SSD1,2
200-550 MB/s (SATA)
6 GB/s (read PCI)
4.4 GB/s (write PCI)
$0.87-1.13/GB
200GB-1TB
DRAM2
10-16 GB/s
$4-14*/GB
64GB-256GB
*SK Hynix 9/4/13 fire
1http://www.fastestssd.com/featured/ssd-rankings-the-fastest-solid-state-drives/
2http://www.extremetech.com/computing/164677-storage-pricewatch-hard-drive-and-ssd-prices-drop-making-for-a-good-time-to-buy
BW: SSD up to x10 than HDD, DRAM > x10 than SSD
Price: HDD x20 less than SSD, SSD x5 less than DRAM
Anthony D. Joseph
UCB CS162
SSD Summary
• Pros (vs. hard disk drives):
 Low latency, high throughput (eliminate seek/rotational
delay)
 No moving parts:
– Very light weight, low power, silent, very shock insensitive
 Read at memory speeds (limited by controller and I/O
bus)
• Cons
 Small storage (0.1-0.5x disk), very expensive (20x disk)
– Hybrid alternative: combine small SSD with large HDD
 Asymmetric block write performance: read
pg/erase/write pg
 Limited drive lifetime
– Avg failure rate is 6 years, life expectancy is 9–11 years
Anthony D. Joseph
UCB CS162
Questions: HDDs and SSDs
• Q1: True _ False _ The block is the smallest
addressable unit on a disk
• Q2: True _ False _ An SSD has zero seek time
• Q3: True _ False _ For an HDD, the read and write
latencies are similar
• Q4: True _ False _ For an SSD, the read and write
latencies are similar
Anthony D. Joseph
UCB CS162
Questions: HDDs and SSDs
• Q1: True _ False _X The block is the smallest
addressable unit on a disk
• Q2: True _X False _ An SSD has zero seek time
• Q3: True _X False _ For an HDD, the read and write
latencies are similar
X For an SSD, the read and write
• Q4: True _ False _
latencies are similar
Hybrid Disk Drive
 A hybrid disk uses a small SSD as a buffer for a
larger drive
 All dirty blocks can be flushed to the actual hard
drive based on:

Time, Threshold, Loss of power/computer
shutdown
Dram
Cache
ATA
Interface
Add a nonvolatile cache
NV
Cache
Hybrid Disk Drive Benefits
Up to 90% Power
Saving
when powered down
Dram
Cache
ATA
Interface
Read and Write
instantly while
spindle stopped
NV
Cache
How often do disk drives fail?
• Schroeder and Gibson. “Disk Failures in the Real
World: What Does and MTTF of 1,000,000 Hours
Mean to You?” USENIX FAST 2007
 Typical drive replacement rate is 2-4% annually
 In 2011, spinning disk have 0.5% (1.7*106 hours)
• 1000 drives
 2%* 10000 means 20 failed drives per year
 A failure every 2-3 weeks!
• 1000 machines, each has 4 drives
 2%*4000 = 80 drive failures
 A failure every 4-5 days!
Fault Tolerance: Measurement
• Mean time before failure (MTTF)
 Inverse of annual failure rate
 In 2011, advertised failure rates of spinning disks
– 0.5% (MTTF= 1.7*106 hours)
– 0.9% ( MTTF= 106 hours)
– Actual failure rates are often reported 2-4%.
• Mean Time To Repair (MTTR) is a basic measure of
the maintainability of repairable items. It represents
the average time required to repair a failed component
or device
 Typically hours to days.
High Availability System Classes
Availability %
Downtime per year Downtime per month Downtime per week
90% (“one nine”)
36.5 days
72 hours
16.8 hours
99% (“two nines”)
3.65 days
7.20 hours
1.68 hours
99.9% (“three nines”) 8.76 hours
43.2 minutes
10.1 minutes
99.99% (“four nines”) 52.56 minutes
4.32 minutes
1.01 minutes
99.999% (“five nines”) 5.26 minutes
25.9 seconds
6.05 seconds
99.9999% (“six nines”) 31.5 seconds
2.59 seconds
0.605 seconds
Gmail, Hosted Exchange target 3 nines (unscheduled)
2010: Gmail (99.984), Exchange (>99.9)
UnAvailability ~ MTTR/MTBF
Can cut it by reducing MTTR or increasing MTBF
4/14/2014
Anthony D. Joseph
CS162
©UCB Spring 2014
Lec 20.49
RAID (Redundant Array of Inexpensive Disks)
• Multiple disk drives provide reliability
via redundancy.
Increases the mean time to failure
• Hardware RAID with RAID controller
vs software RAID
RAID (Cont.)
• RAID
 multiple disks work cooperatively
 Improve reliability by storing redundant data
 Improve performance with disk striping (use a
group of disks as one storage unit)
• RAID is arranged into six different levels
 Mirroring (RAID 1) keeps duplicate of each disk
 Striped mirrors (RAID 1+0) or mirrored stripes
(RAID 0+1) provides high performance and high
reliability
 Block interleaved parity (RAID 4, 5, 6) uses
much less redundancy
RAID Level 0
•
•
•
•
•
Level 0 is nonredundant disk array
Files are striped across disks, no redundant info
High read throughput
Best write throughput (no redundant info to write)
Any disk failure results in data loss
RAID Level 1
• Mirrored Disks
• Data is written to two places
 On failure, just use surviving disk and easy to
rebuild
• On read, choose fastest to read
 Write performance is same as single drive,
read performance is 2x better
• Expensive
(high space overhead)
RAID 1: Mirroring
• Replicate writes to
both disks
• Reads can go to
either disk
RAID 5
•Files are striped as blocks.
•Blocks are distributed among disks.
•Parity blocks are added
Parity Block for Failure Recovery
• Parity block: Block1 xor block2 xor block3 …
10001101
block1
01101100
block2
11000110
block3
-------------00100111
parity block
• Can reconstruct any missing block from the others
 Assume block 3 is lost
– Block3 = block1 xor block 2 xor Parity
– = 10001101 xor 01101100 xor 00100111
RAID 5: Rotating Parity
RAID Update
• Mirroring
 Write every mirror
• RAID-5: to write one block
 Read old data block
 Read old parity block
 Write new data block
 Write new parity block
– Old data xor old parity xor new data
• RAID-5: to write entire stripe
 Write data blocks and parity
RAID Summary
• Replicate data for availability
 RAID 0: no replication
 RAID 1: mirror data across two or more disks
– Google File System replicated its data on three disks,
spread across multiple racks
 RAID 5: split data across disks, with redundancy to
recover from a single disk failure
 RAID 6: RAID 5, with extra redundancy to recover
from two disk failures
6 RAID Levels
RAID Level 0+1
• Stripe on a set of disks
• Then mirror of data blocks is striped on the second set.
Stripe 0
Stripe 1
Stripe 4
Stripe 5
Stripe 8
Stripe 9
Stripe 2
Stripe 6
data disks
Stripe 3
Stripe 0
Stripe 1
Stripe 7
Stripe 4
Stripe 5
Stripe 8
Stripe 2
Stripe 6
Stripe 9
mirror copies
Stripe 3
Stripe 7
RAID Level 1+0
• Pair mirrors first.
• Then stripe on a set of paired mirrors
• Better reliability than RAID 0+1
Stripe 0
Stripe 0
Stripe 4
Stripe 4
Stripe 8
Stripe 8
Stripe 2
Stripe 2
Stripe 6
Stripe 6
Stripe 10
Stripe 10
Stripe 3
Stripe 3
Stripe 7
Stripe 7
Stripe 11
Stripe 11
Mirror pair
Stripe 1
Stripe 5
Stripe 9
Stripe 1
Stripe 5
Stripe 9
Mass-Storage Systems: Summary
• Structure of mass-storage devices and the
resulting effects on the uses of the devices
 Hard Disk Drive: high seek cost
 SSD: fast seek time, but more expensive
 Hybrid Disk
• Performance characteristics and management of
mass-storage devices
 Disk Scheduling in HDD
• RAID – improve performance/reliability for higher
availability
 MTTF, MTTR
 RAID 0, RAID 1, RAID 1+0, RAID 5
Download