CS 7500 Dr. Cannon Fault-tolerant Systems Spring ‘05 Introduction to data coding One common source of faults in a distributed system involves data corruption. This can occur during transmission or during storage in disk or memory. Data corruption can occur as a result of two failure modes – random and systematic. A random error is just that – data is corrupted in a random and unpredictable way. This might occur due to noise on a transmission line, conflicts with other software in a mutual exclusion violation, interrupted communications – the possibilities are endless. A systematic error occurs in a predictable way. For example, a communications system might erroneously drop the last bit of every message. Or perhaps the Nth bit of every message is toggled. Perhaps anytime a specific sequence of bits is encountered a error is introduced. To address these problems, data is often coded. Data coding simply means replacing a data word with a code word. ‘Word’ in this case may mean a bit sequence of length N – rather than 4 or 8 bytes. While coding is also used for security, in this sense we will use coding for error detection and possible recovery. There is no magic in error detection and recovery codes. In general, the larger the code word with respect to the original data word, the more information is present and the more error detection and recovery can occur. In other words, there is information replication. There is a trade-off; longer messages have a higher probability of errors. Data codes are either separable or non-separable. A separable code is easily stripped off a data message. A non-separable code requires the original data message be reconstructed. Data coding is utilized in communications, memory systems, storage media, and anywhere data corruption is possible. In general, the type of coding utilized depends upon the nature of expected errors. No one coding scheme is practical and efficient for all types and modes of faults. Simple replication Data words are replicated. Usually, some length of spatial separation is used. For example, a data message might be duplicated: 128 bytes of data become 256 message bytes – 128 original bytes followed by the 128 replicated bytes. This would allow detection of any errors in the 256-byte message, but no recovery. If a data message is replicated twice – 128 bytes become 384 bytes (one original 128 byte sequence followed by two replications), any error in any one data byte can be detected and repaired as long as the other two copies are valid. To avoid systematic errors, replications might be reversed or scrambled in some recoverable way. Hamming distance coding The Hamming distance is the number of bit positions in which any 2 code words differ. The larger the Hamming distance, the greater the ability to detect and possibly recover from bit errors; CS 7500 Dr. Cannon Fault-tolerant Systems Hamming distance Errors detected 1 None 2 Single bit 3 Single bit with recovery Spring ‘05 For example, consider a nibble of 4 bits contains 8 distance-2 codes; 0000, 0011, 1001, 1100, 1010, 0101, 0110, and 1111 Since there are 8 codes, we would replace every 3-bits of data with an associated 4-bit code (23 = 8). For example 000 => 0000, 001 => 0011, 010 => 1001, etc. Now, since this is a distance-2 code, any single bit error in the coded data will produce an invalid code word – allowing detection. A nibble of 3 bits contains two distance-3 codes; 000 and 111 Since 21 = 2, we would replace every single bit in the data with one of these 3-bit codes. For example 0 => 000 and 1=>111. Now, any single bit error in the coded data will produce an invalid code that is only 1 bit away from a valid code – allowing both detection and replacement. Example: 010110 (original data) becomes 000 111 000 111 111 000 (coded data) Suppose this coded data becomes corrupted; 001 101 000 110 111 010 We can detect these errors and recover by replacing each invalid code word with the closest valid code word. So 001 becomes 000, 101 becomes 111, etc. Parity codes In practice, parity generally means the evenness or oddness of the number of bits in a byte. A single bit is added to the end of each byte to represent even or odd. In an EVEN parity system, the parity bit is set if the byte is even. If a single-bit error occurs within the data, then the data byte will not match the attached parity bit. No recovery is possible with a single parity bit. Notice that only a single bit error can be detected. A corruption of the parity bit is indistinguishable from a data corruption. If two bits change, the evenness or oddness of a byte won’t change. There are other less-obvious limitations to parity codes. For example, the most common corruption of data is having all bits go to zero (loss of power, failure of a memory chip, loss of a communication channel). In this case, no error is detected in an ODD parity system. Parity codes are easy to implement in hardware and only slightly enlarge a message. If you think about it, this approach is really a distance-1 Hamming code! CS 7500 Dr. Cannon Fault-tolerant Systems Spring ‘05 Expansion of the parity concept to more parity bits gives more detection power. Parity is often calculated in more than one dimension. For example, each byte is given a parity bit and each block of N bytes is given a parity byte. Think of this as parity in rows and in columns. The parity bit (end of each byte) is set according to the evenness of the row of 8 bit. The parity byte bits are set according to the evenness of corresponding columns of bits in the block. In this strategy, it may be possible in some cases to detect and recover from some bit errors. When you read about error-correction coding (ECC), this is typically what is being referred to. Other strategies are also possible: Parity bits can be assigned so that no two adjacent bits in the data message come under the same parity bit to improve detection of adjacent multi-bit errors. Parity bits can be overlapped so that a bit is covered by more than one parity bit to support some error correction. Such schemes are usually labeled “Hamming error correcting coding” and are commercially available in memory systems. Checksums A checksum is simply a sum (modulo N) of message data treated as integer words. Typically, each byte is treated as an unsigned integer and summed modulo 2N to produce an N-bit checksum which is appended to the end of a data message. When a message is validated, the checksum is simply recomputed and compared with the one appended to the message. The larger the value of N, the less likely a corruption of the data message would produce the same checksum. No recovery is possible with checksums. Cyclic Redundancy Checkword (CRC) CRC coding is a variation of the simple checksum. Data word bits are considered coefficients of a polynomial. For example: 1011 represents 1 +0x + 1x2 +1x3. Each data polynomial is multiplied by a chosen generator polynomial. For example, 1101. During multiplication, the corresponding coefficients are summed modulo 2 to produce the resulting polynomial CRC code word. For example, the product of the above two polynomials is: 1 + x + x2 + 3x3 + x4 + x5 + x6. Taking these coefficients module 2 produces the CRC of 1111111. This may seem like a lot of work, but the polynomial multiplication and summing can be easily and quickly done using shift registers. The CRC technique is capable of detecting bit errors like a simple checksum, but it is also effective in detecting burst errors – errors in short sequences of adjacent bits. As opposed to a simple checksum, a single-bit error is unlikely to have a complimentary single-bit error that would produce the same checkword. Gray coding Gray coding is used to insure that any two adjacent codes differ by only one bit. For example, consider the following sequence of codes; 0000, 0001, 0011, 0010, 0110, 0100, etc. CS 7500 Dr. Cannon Fault-tolerant Systems Spring ‘05 Gray coding is useful when data is expected to follow a natural progression. For example, suppose an input to a system is a code representing an aircraft control surface. One would expect changes to the input to progress through a natural sequence. (It is not possible for an aircraft rudder to instantaneously go from full-left to full-right without going through all intermediate steps – however quickly.) Gray coding allows the easy detection of an erroneous input or lost data value.