Introduction to data coding

advertisement
CS 7500
Dr. Cannon
Fault-tolerant Systems
Spring ‘05
Introduction to data coding
One common source of faults in a distributed system involves data corruption. This can
occur during transmission or during storage in disk or memory. Data corruption can occur
as a result of two failure modes – random and systematic. A random error is just that –
data is corrupted in a random and unpredictable way. This might occur due to noise on a
transmission line, conflicts with other software in a mutual exclusion violation,
interrupted communications – the possibilities are endless.
A systematic error occurs in a predictable way. For example, a communications system
might erroneously drop the last bit of every message. Or perhaps the Nth bit of every
message is toggled. Perhaps anytime a specific sequence of bits is encountered a error is
introduced.
To address these problems, data is often coded. Data coding simply means replacing a
data word with a code word. ‘Word’ in this case may mean a bit sequence of length N –
rather than 4 or 8 bytes. While coding is also used for security, in this sense we will use
coding for error detection and possible recovery.
There is no magic in error detection and recovery codes. In general, the larger the code
word with respect to the original data word, the more information is present and the more
error detection and recovery can occur. In other words, there is information replication.
There is a trade-off; longer messages have a higher probability of errors.
Data codes are either separable or non-separable. A separable code is easily stripped off a
data message. A non-separable code requires the original data message be reconstructed.
Data coding is utilized in communications, memory systems, storage media, and
anywhere data corruption is possible. In general, the type of coding utilized depends upon
the nature of expected errors. No one coding scheme is practical and efficient for all types
and modes of faults.
Simple replication
Data words are replicated. Usually, some length of spatial separation is used. For
example, a data message might be duplicated: 128 bytes of data become 256 message
bytes – 128 original bytes followed by the 128 replicated bytes. This would allow
detection of any errors in the 256-byte message, but no recovery.
If a data message is replicated twice – 128 bytes become 384 bytes (one original 128 byte
sequence followed by two replications), any error in any one data byte can be detected
and repaired as long as the other two copies are valid.
To avoid systematic errors, replications might be reversed or scrambled in some
recoverable way.
Hamming distance coding
The Hamming distance is the number of bit positions in which any 2 code words differ.
The larger the Hamming distance, the greater the ability to detect and possibly recover
from bit errors;
CS 7500
Dr. Cannon
Fault-tolerant Systems
Hamming distance
Errors detected
1
None
2
Single bit
3
Single bit with recovery
Spring ‘05
For example, consider a nibble of 4 bits contains 8 distance-2 codes;
0000, 0011, 1001, 1100, 1010, 0101, 0110, and 1111
Since there are 8 codes, we would replace every 3-bits of data with an associated 4-bit
code (23 = 8). For example 000 => 0000, 001 => 0011, 010 => 1001, etc. Now, since this
is a distance-2 code, any single bit error in the coded data will produce an invalid code
word – allowing detection.
A nibble of 3 bits contains two distance-3 codes;
000 and 111
Since 21 = 2, we would replace every single bit in the data with one of these 3-bit codes.
For example 0 => 000 and 1=>111. Now, any single bit error in the coded data will
produce an invalid code that is only 1 bit away from a valid code – allowing both
detection and replacement.
Example: 010110 (original data) becomes 000 111 000 111 111 000 (coded data)
Suppose this coded data becomes corrupted; 001 101 000 110 111 010
We can detect these errors and recover by replacing each invalid code word with the
closest valid code word. So 001 becomes 000, 101 becomes 111, etc.
Parity codes
In practice, parity generally means the evenness or oddness of the number of bits in a
byte. A single bit is added to the end of each byte to represent even or odd. In an EVEN
parity system, the parity bit is set if the byte is even. If a single-bit error occurs within the
data, then the data byte will not match the attached parity bit.
No recovery is possible with a single parity bit. Notice that only a single bit error can be
detected. A corruption of the parity bit is indistinguishable from a data corruption. If two
bits change, the evenness or oddness of a byte won’t change.
There are other less-obvious limitations to parity codes. For example, the most common
corruption of data is having all bits go to zero (loss of power, failure of a memory chip,
loss of a communication channel). In this case, no error is detected in an ODD parity
system.
Parity codes are easy to implement in hardware and only slightly enlarge a message. If
you think about it, this approach is really a distance-1 Hamming code!
CS 7500
Dr. Cannon
Fault-tolerant Systems
Spring ‘05
Expansion of the parity concept to more parity bits gives more detection power. Parity is
often calculated in more than one dimension. For example, each byte is given a parity bit
and each block of N bytes is given a parity byte. Think of this as parity in rows and in
columns. The parity bit (end of each byte) is set according to the evenness of the row of 8
bit. The parity byte bits are set according to the evenness of corresponding columns of
bits in the block. In this strategy, it may be possible in some cases to detect and recover
from some bit errors. When you read about error-correction coding (ECC), this is
typically what is being referred to.
Other strategies are also possible: Parity bits can be assigned so that no two adjacent bits
in the data message come under the same parity bit to improve detection of adjacent
multi-bit errors. Parity bits can be overlapped so that a bit is covered by more than one
parity bit to support some error correction. Such schemes are usually labeled “Hamming
error correcting coding” and are commercially available in memory systems.
Checksums
A checksum is simply a sum (modulo N) of message data treated as integer words.
Typically, each byte is treated as an unsigned integer and summed modulo 2N to produce
an N-bit checksum which is appended to the end of a data message. When a message is
validated, the checksum is simply recomputed and compared with the one appended to
the message. The larger the value of N, the less likely a corruption of the data message
would produce the same checksum.
No recovery is possible with checksums.
Cyclic Redundancy Checkword (CRC)
CRC coding is a variation of the simple checksum. Data word bits are considered
coefficients of a polynomial. For example: 1011 represents 1 +0x + 1x2 +1x3.
Each data polynomial is multiplied by a chosen generator polynomial. For example,
1101. During multiplication, the corresponding coefficients are summed modulo 2 to
produce the resulting polynomial CRC code word. For example, the product of the above
two polynomials is: 1 + x + x2 + 3x3 + x4 + x5 + x6. Taking these coefficients module 2
produces the CRC of 1111111.
This may seem like a lot of work, but the polynomial multiplication and summing can be
easily and quickly done using shift registers. The CRC technique is capable of detecting
bit errors like a simple checksum, but it is also effective in detecting burst errors – errors
in short sequences of adjacent bits.
As opposed to a simple checksum, a single-bit error is unlikely to have a complimentary
single-bit error that would produce the same checkword.
Gray coding
Gray coding is used to insure that any two adjacent codes differ by only one bit. For
example, consider the following sequence of codes;
0000, 0001, 0011, 0010, 0110, 0100, etc.
CS 7500
Dr. Cannon
Fault-tolerant Systems
Spring ‘05
Gray coding is useful when data is expected to follow a natural progression. For example,
suppose an input to a system is a code representing an aircraft control surface. One would
expect changes to the input to progress through a natural sequence. (It is not possible for
an aircraft rudder to instantaneously go from full-left to full-right without going through
all intermediate steps – however quickly.) Gray coding allows the easy detection of an
erroneous input or lost data value.
Download