Transactions and Reliability Andy Wang Operating Systems COP 4610 / CGS 5765

advertisement

Transactions and Reliability

Andy Wang

Operating Systems

COP 4610 / CGS 5765

Motivation

 File systems have lots of metadata :

 Free blocks, directories, file headers, indirect blocks

 Metadata is heavily cached for performance

Problem

 System crashes

 OS needs to ensure that the file system does not reach an inconsistent state

 Example: move a file between directories

 Remove a file from the old directory

 Add a file to the new directory

 What happens when a crash occurs in the middle?

UNIX File System (Ad Hoc Failure-

Recovery)

 Metadata handling:

 Uses a synchronous write-through caching policy

 A call to update metadata does not return until the changes are propagated to disk

 Updates are ordered

 When crashes occur, run fsck to repair inprogress operations

Some Examples of Metadata Handling

 Undo effects not yet visible to users

 If a new file is created, but not yet added to the directory

 Delete the file

 Continue effects that are visible to users

 If file blocks are already allocated, but not recorded in the bitmap

 Update the bitmap

UFS User Data Handling

 Uses a write-back policy

 Modified blocks are written to disk at 30-second intervals

 Unless a user issues the sync system call

 Data updates are not ordered

 In many cases, consistent metadata is good enough

Example: Vi

 Vi saves changes by doing the following

1. Writes the new version in a temp file

 Now we have old_file and new_temp file

2. Moves the old version to a different temp file

 Now we have new_temp and old_temp

3. Moves the new version into the real file

 Now we have new_file and old_temp

4. Removes the old version

 Now we have new_file

Example: Vi

 When crashes occur

 Looks for the leftover files

 Moves forward or backward depending on the integrity of files

Transaction Approach

 A transaction groups operations as a unit, with the following characteristics:

 Atomic: all operations either happen or they do not (no partial operations)

 Serializable: transactions appear to happen one after the other

 Durable : once a transaction happens, it is recoverable and can survive crashes

More on Transactions

 A transaction is not done until it is committed

 Once committed, a transaction is durable

 If a transaction fails to complete, it must rollback as if it did not happen at all

 Critical sections are atomic and serializable, but not durable

Transaction Implementation (One Thread)

 Example: money transfer

Begin transaction x = x – 1; y = y + 1;

Commit

Transaction Implementation (One Thread)

 Common implementations involve the use of a log , a journal that is never erased

 A file system uses a write-ahead log to track all transactions

Transaction Implementation (One Thread)

 Once accounts of x and y are on a log, the log is committed to disk in a single write

 Actual changes to those accounts are done later

Transaction Illustrated

x = 1; y = 1; x = 1; y = 1;

Transaction Illustrated

x = 0; y = 2; x = 1; y = 1;

Transaction Illustrated

x = 0; y = 2; begin transaction old x: 1 new x: 0 old y: 1 new y: 2 commit x = 1; y = 1;

Commit the log to disk before updating the actual values on disk

Transaction Steps

 Mark the beginning of the transaction

 Log the changes in account x

 Log the changes in account y

 Commit

 Modify account x on disk

 Modify account y on disk

Scenarios of Crashes

 If a crash occurs after the commit

 Replays the log to update accounts

 If a crash occurs before the commit

 Rolls back and discard the transaction

 A crash cannot occur during the commit

 Commit is built as an atomic operation

 e.g. writing a single sector on disk

Two-Phase Locking (Multiple Threads)

 Logging alone not enough to prevent multiple transactions from trashing one another (not serializable)

 Solution: two-phase locking

1. Acquire all locks

2. Perform updates and release all locks

 Thread A cannot see thread B’s changes until thread A commits and releases locks

Transactions in File Systems

 Almost all file systems built since 1985 use write-ahead logging

 Windows NT, Solaris, OSF, etc

+ Eliminates running fsck after a crash

+ Write-ahead logging provides reliability

- All modifications need to be written twice

Log-Structured File System (LFS)

 If logging is so great, why don’t we treat everything as log entries?

 Log-structured file system

 Everything is a log entry (file headers, directories, data blocks)

 Write the log only once

 Use version stamps to distinguish between old and new entries

More on LFS

 New log entries are always appended to the end of the existing log

 All writes are sequential

 Seeks only occurs during reads

 Not so bad due to temporal locality and caching

 Problem:

 Need to create contiguous space all the time

RAID and Reliability

 So far, we assume that we have a single disk

 What if we have multiple disks?

 The chance of a single-disk failure increases

 RAID: redundant array of independent disks

 Standard way of organizing disks and classifying the reliability of multi-disk systems

 General methods: data duplication, parity, and errorcorrecting codes (ECC)

RAID 0

 No redundancy

 Uses block-level striping across disks

 i.e., 1 st block stored on disk 1, 2 nd block stored on disk 2

 Failure causes data loss

Non-Redundant Disk Array Diagram

(RAID Level 0) open(foo) read(bar) write(zoo)

File

System

Mirrored Disks (RAID Level 1)

 Each disk has a second disk that mirrors its contents

 Writes go to both disks

+ Reliability is doubled

+ Read access faster

- Write access slower

- Expensive and inefficient

Mirrored Disk Diagram (RAID Level 1) open(foo) read(bar) write(zoo)

File

System

Memory-Style ECC (RAID Level 2)

 Some disks in array are used to hold ECC

+ More efficient than mirroring

+ Can correct, not just detect, errors

- Still fairly inefficient

 e.g., 4 data disks require 3 ECC disks

Memory-Style ECC Diagram

(RAID Level 2) open(foo) read(bar) write(zoo)

File

System

Bit-Interleaved Parity (RAID Level 3)

 Uses bit-level striping across disks

 i.e., 1 st bit stored on disk 1, 2 nd bit stored on disk 2

 One disk in the array stores parity for the other disks

+ More efficient than Levels 1 and 2

Parity disk doesn’t add bandwidth

Parity Method

 Disk 1: 1001

 Disk 2: 0101

 Disk 3: 1000

 Parity: 0100 = 1001 xor 0101 xor 1000

 To recover disk 2

 Disk 2: 0101 = 1001 xor 1000 xor 0100

Bit-Interleaved RAID Diagram (Level 3) open(foo) read(bar) write(zoo)

File

System

Block-Interleaved Parity (RAID Level 4)

 Like bit-interleaved, but data is interleaved in blocks

+ More efficient data access than level 3

Parity disk can be a bottleneck

Small writes

To update just one block

 Do we need to read in the entire stripe?

To update just one block

 Do we need to read in the entire stripe?

 old_parity

= old_block1

 old_block2

 old_block3

 new_parity

= new_block1

 old_block2

 old_block3

 old_party

 new_party

= old_block1

 new_block1

 new_parity

= new_block1

 old_block1

 old_parity

Block-Interleaved Parity Diagram

(RAID Level 4) open(foo) read(bar) write(zoo)

File

System

Block-Interleaved Distributed-Parity

(RAID Level 5)

 Sort of the most general level of RAID

 Spreads the parity out over all disks

+ No parity disk bottleneck

+ All disks contribute read bandwidth

– Requires 4 I/Os for small writes

Block-Interleaved Distributed-Parity

Diagram (RAID Level 5) open(foo) read(bar) write(zoo)

File

System

Download