Basic Concepts in Data Compression In this section the basic concepts of data compression are shown:- The Unary Code The unary code of the non-negative integer n is defined as n-1 ones followed by one zero or, alternatively, as n-1 zeros followed by a single one. Table : Some Unary Codes n 1 2 3 4 5 Code 0 10 110 1110 11110 Alt. Code 1 01 001 0001 00001 Entropy Coding We can define the entropy of a signal symbol ai as –Pi log2Pi. This is the smallest number of bits needed, on the average, to represent the symbol. The amount of information contained in one, base-n symbol is: n 1 H Pi log 2 Pi i 0 This quantity is called the entropy of the data being transited. The entropy of the data depends on the individual probabilities Pi, and is smallest when all n probabilities are equal. Data Compression Strategies There are different ways that data compression techniques can be categorized, smith gives a compression classification as below: 1 a. Lossless or Lossy Lossless Lossy RLE JPEG Huffman MPEG Arithmetic Vector quantization Quadtree b. Fixed or variable group size Huffman Group Size Input Fixed Output Variable Arithmetic Variable Variable RLE, LZW Variable Fixed Method Most data compression programs operate by taking a group of data from the original file and compressed it in some way, and then writing the compressed group to the output file. 1.1.2 Irreversible Text Compression This method used to compress text by throwing away some information .the decompressed text will not be identical to the original, so such methods are not general purpose, they can be used in special cases. Ex1//A run of consecutive blank spaces may be replaced by a single space. Ex2// In extreme cases all text characters except letters and spaces may be thrown away, and the letters may be case flattened (converted to all lower- 2 or all uppercase). This will leave just 27 symbols, so a symbol can be encoded in 5 instead of the usual 8 bits. The compression ratio is 5/8 = .625, not bad, but the loss may normally be too great. Ad Hoc Text Compression If the text contains many spaces but they are not clustered, they may be removed and their positions indicated by a bit-string that contains a 0 for each text character that is not a space and a 1 for each space. Thus, the text Here are some ideas, is encoded as the bit-string “0000100010000100000” followed by the text Herearesomeideas. If the number of blank spaces is small, the bit-string will be sparse. Run-Length Encoding The idea behind this approach to data compression is this: If a data item d occurs n consecutive times in the input stream, replace the n occurrences with the single pair nd. The n consecutive occurrences of a data item are called a run length of n, and this approach to data compression is called run length encoding or RLE. We apply this idea first to text compression then to image compression. a) RLE Text Compression Just replacing “2._all_is_too_well” with “2._a2_is_t2_we2” will not work. Clearly, the decompressor should have a way to tell that the first “2” is part of the text while the others are repetition factors for the letters “o” and “l”. Even the string “2._a2l_is t2o_we2l” does not solve this problem (and 3 also does not provide any compression). One way to solve this problem is to precede each repetition with a special escape character. If we use the character “@” as the escape character, then the string “2._a@2l_is_t@2o we@2l” can be decompressed unambiguously. However, it is longer than the original string, since it replaces two consecutive letters with three characters. We have to adopt the convention that only three or more repetitions of the same character will be replaced with a repetition factor. Figure 1.3a is a flowchart for such a simple run-length text compressor. After reading the first character, the count is 1 and the character is saved. Subsequent characters are compared with the one already saved and, if they are identical to it, the repeat-count is incremented. When a different character is read, the operation depends on the value of the repeat count. If it is small, the saved character is written on the compressed file and the newly read character is saved. Otherwise, an “@” is written, followed by the repeat-count and the saved character. Decompression is also straightforward. It is shown in Figure 1.3b. When an “@” is read, the repetition count n and the actual character are immediately read, and the character is written n times on the output stream. 4 5 6