compression-v4 - Team

An area where the entropy theory has made a huge impact in is data compression. In his famous work “A Mathematical Theory of Communication”, Shannon showed how the concept of entropy could be used to compress data optimally. In this section, we will introduce why data may need to be compressed and what effect compression has on the original data. We shall then discuss the different types of data that can be compressed, and what compression methods can be used to compress certain data types. Then we will show how the Shannon-Fano and Huffman codings can be used to compress data, and which method is the most efficient. We will then finish off this section by summarizing the type of compression that is relevant to today’s modern computer files, to give an insight on how data compression has advanced. We can apply the concept of entropy to data compression, to find the most efficient way to compress a piece of data. Data compression works by eliminating or minimizing redundancy in a file, making your files smaller without losing any information. Every character on your computer, every letter, digit and punctuation mark, is actually made up of several characters that make up computer code. A simple example of compression is: If you have a set of characters "AAAADDDDDDD" representing a letter, one type of compression software can rewrite this as "4A7D", saving seven spaces and making that line 64% smaller. Compression software uses algorithms to do this. Original Data Compressed Data Decompressor Compressor Compressed Data Original Data Fig 1. Basic Compression Concept Compression makes files smaller so they take up less storage space and can be transferred faster from machine to machine. Combined with archiving, it becomes a useful way to organize files and projects. Data compression also removes some of the redundancies in a non-compressed text file, which actually contributes to data security and has found its application in constructing the so-called ‘ideal’ secrecy systems. The reader should refer to the section on Cryptography for the detailed explanation of on this. Data compression also has its disadvantages. One of the disadvantages of data compression is that reliability is reduced. This is because there is a reduction in redundancies, which is useful for error detection. Data compression is most useful for large archival files and files that need to be transferred over long communicational distances, for example over the internet, as data compression can offer dramatic savings on storage requirements. Data compression can be split into two categories, Lossless (entropy encoding) and Lossy (source coding). Lossy and Lossless Some compression techniques (usually image and multimedia files such as JPEG and mp3) lose information when they are compressed, reducing the quality of the file. Compressing these types of files are irreversible and therefore cannot match the original file. The size of the file is proportional to the quality degradation. Therefore this type of compression is not used with files that need to be restored to its original form. This is called Lossy data compression. When the data of the file needs to be conserved, the so-called Lossless data compression is used. This ensures that the data during the compression process is not lost, so the information contained in the data is the same before and after compression. Continuous compression of a Lossless file does not mean continuous reduction in size, as there is a lower limit to what the file can be compressed to - its entropy. Lossless Data Compression (Entropy Coding) As has already been mentioned, lossless data compression uses a data compression algorithm that allows the original data to be reconstructed fully from the compressed data. This is used, when the original piece of data must be fully reconstructed and once decompressed, the data should be identical to the data before compression. As previously discussed, Shannon defined the entropy of a set of discrete events {x1, x2,…,xk} as: H = -  Pi log Pi where Pi is the probability of the event xi, i∈[1,k]. By using the entropy of a set of symbols and their probabilities, we can deduce the optimum compression ratio we can get. (Section ‘More about Entropy’.) For example, the English language has an entropy of 1.3, therefore an optimum level, we can use 1.3 bits per character. The Shannon-Fano and Huffman compression methods are based on statistics obtained from the data. These statistics take into consideration the probability, or how often each symbol will appear, and with this information we assign a binary string for each symbol. The aim is to assign the most occurring symbol with the shortest binary code and the least occurring with the longest. This allows the coded data to be smaller in size of the original given data. Compression of data using Shannon-Fano coding As we have already discussed the Shannon-Fano code in the previous chapter, we can skip the method on how to compute the code strings and jump to how the Shannon-Fano code can be affective in compression. So, if we consider the set of symbols with their corresponding probabilities of occurrence: Set of Symbols (x) Probability of Occurrence P(Xi): a P(a) = 0.20 b P(b) = 0.18 c P(c) = 0.13 d P(d) = 0.10 e P(e) = 0.09 f P(f) = 0.08 g P(g) = 0.08 h P(h) = 0.07 i P(i) = 0.04 j P(i) = 0.03 Aplying the Shannon-Fano compression method, we can use this order of decreasing probabilities of symbols to help compress the data. Firstly, we divide the list of decreasing probabilities into two groups, where sum of the symbols of each group, has approximately half of the total probability. Then we continue this division process until there is only one symbol in each group. We will now demonstrate this method: Group one Group two a d b e c f g h i j Total Probabilities 0.51 0.49 A tree diagram can represent the further splitting of these groups: The entropy of this string of data is: H = - ((0.2)log(0.2) + (0.18)log(0.18) + (0.13)log(0.13) + (0.1)log(0.1) + (0.09)log(0.09) + (0.08)log(0.08) + (0.08)log(0.08) + (0.07)log(0.07) + (0.04)log(0.04)+ (0.03)log(0.03) = 3.1262772 Now the symbols have a binary string attached to them. With this binary string and their probability of occurrence, we can calculate the average length, and see how close it is to the entropy of the data. We can calculate the average length of this code by being the sum of each symbols binary string length multiplied by the probability of occurrence. So the average length = (2x0.2) + (3 x (0.18+0.13+0.10) + (4 x (0.09+0.08+0.08+0.07+0.04+0.03) = 3.19 bits We can see how close the average length of the data is to its entropy. Now using this, we can compare which method of coding is the most effective by seeing which method of coding leaves the average length closest to the entropy. Compression of data using Huffman Coding Again, as we have already discussed the Huffman code in the previous chapter, we can skip the explanation of the algorithm on how to construct the codes. So, if we consider the data used above, with the same probabilities, we have: Set of Symbols (x) Probability of Occurrence P(Xi): a P(a) = 0.20 b P(b) = 0.18 c P(c) = 0.13 d P(d) = 0.10 e P(e) = 0.09 f P(f) = 0.08 g P(g) = 0.08 h P(h) = 0.07 i P(i) = 0.04 j P(i) = 0.03 As we can see, the Huffman compression method, similarly to the Shannon-Fano compression method, uses this order of decreasing probabilities of symbols to help compress the data. However, that’s where the similarities end. In the Huffman compression method, we firstly create a binary tree in the order of decreasing probabilities. Then from the binary tree, we branch out from each symbol, starting from the least probable simple occurring branching to the next least probable symbol occurring. Then finally for each symbol, we label each branch with a binary number and the bit sequence obtained from the binary tree is the Huffman code. As computed earlier, the entropy of this string of data is: H = - ((0.2)log(0.2) + (0.18)log(0.18) + (0.13)log(0.13) + (0.1)log(0.1) + (0.09)log(0.09) + (0.08)log(0.08) + (0.08)log(0.08) + (0.07)log(0.07) + (0.04)log(0.04)+ (0.03)log(0.03) = 3.1262772 As we now have the binary strings attached to each symbol derived from the Huffman code, we can calculate the average length, and see how close it is to the entropy of the data. Again we can calculate the average length of this code by being the sum of each symbols binary string length multiplied by the probability of occurrence. So the average length = (9 x (0.03+0.04) ) + (8 x 0.07) + (7 x 0.08) + (6 x 0.08) + (5 x 0.09) + (4 x 0.10) + (3 x 0.13) + (2 x 0.18) + 0.2 = We can see how close the average length of the data is to its entropy. From the two compression methods, we can deduce that the Huffman compression method is more effective as the average length of the code is closer to the entropy. This is why today, Huffman coding is used more than the Shannon-Fano method, to compress data. (Can refer to the Dice problem) Other Data Compression Algorithms Compression using Lempel-Ziv method The Lempel-Ziv method has been explained in the previous chapter, so we can just summarise the use of the compression method. The Lempel- Ziv compression method is used primarily to compress text files. In the Lempel-Ziv compression method, the input sequence is put in to nonoverlapping blocks of different lengths, whilst doing that we make a dictionary of blocks that we have already seen. Lossy Data Compression Lossy compression is a data compression method that loses some of the data, in order to achieve its goal of compression. However, when decompressed the data content is different from the original, though similar enough to be useful in some way. Lossy compression is most commonly used to compress multimedia data, especially in applications such as streaming media and using the telephone over the internet. Compressing and Archiving Compression is typically applied to a single file and compressed formats only contain a single item. Archiving allows files and folders to be grouped together for compression. Archive formats can contain a single item, or many items, and preserve the hierarchy of nested folders. However, most archive formats include compression as part of the archiving process. Expansion and Extraction When an archive is created, it can be accessed in two different modes. One of the modes is Expansion, where the entire contents of an archive are expanded out at once. The second is the Browse mode, where the archive is accessed like a folder. The hierarchical structure can be navigated and individual files can be extracted without having to expand the entire contents of the archive. In some cases, archive content can be manipulated: items can be renamed, moved, deleted, added, etc. (I would remove these green sections) Alena Re-compression Re-compression involves making files smaller by disassembling the structure of the data and then compressing the file more efficiently. Then when the file is then expanded, the data structure for that file is then reassembled. Recompression usually results in an output which is 100% identical to its original, however in some cases, it may not be. In these cases the content and any data is never lost, however the encoding may be slightly different. (Can add some more here or explain better) References: 1977, Martin J. Computer Data-Base Organisation. Englewood Cliffs, NJ: Prentice-Hall 1980. Reghbati, H.K. Technical Aspects of Teleprocessing. Saskatoon, Sask 1983 Encyclopedia of computer science and engjneering, Anthony Ralston

compression-v4 - Team

Related documents

Products

Support

compression-v4 - Team

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib