compression-v4 - Team

advertisement
An area where the entropy theory has made a huge impact in is data compression. In his
famous work “A Mathematical Theory of Communication”, Shannon showed how the concept of
entropy could be used to compress data optimally. In this section, we will introduce why data may
need to be compressed and what effect compression has on the original data. We shall then discuss
the different types of data that can be compressed, and what compression methods can be used to
compress certain data types. Then we will show how the Shannon-Fano and Huffman codings can
be used to compress data, and which method is the most efficient. We will then finish off this
section by summarizing the type of compression that is relevant to today’s modern computer files,
to give an insight on how data compression has advanced.
We can apply the concept of entropy to data compression, to find the most efficient way to
compress a piece of data. Data compression works by eliminating or minimizing redundancy in a
file, making your files smaller without losing any information. Every character on your computer,
every letter, digit and punctuation mark, is actually made up of several characters that make up
computer code. A simple example of compression is: If you have a set of characters
"AAAADDDDDDD" representing a letter, one type of compression software can rewrite this as
"4A7D", saving seven spaces and making that line 64% smaller. Compression software uses
algorithms to do this.
Original
Data
Compressed Data
Decompressor
Compressor
Compressed Data
Original
Data
Fig 1. Basic Compression Concept
Compression makes files smaller so they take up less storage space and can be transferred
faster from machine to machine. Combined with archiving, it becomes a useful way to organize
files and projects. Data compression also removes some of the redundancies in a non-compressed
text file, which actually contributes to data security and has found its application in constructing the
so-called ‘ideal’ secrecy systems. The reader should refer to the section on Cryptography for the
detailed explanation of on this.
Data compression also has its disadvantages. One of the disadvantages of data compression is
that reliability is reduced. This is because there is a reduction in redundancies, which is useful for
error detection.
Data compression is most useful for large archival files and files that need to be transferred
over long communicational distances, for example over the internet, as data compression can offer
dramatic savings on storage requirements.
Data compression can be split into two categories, Lossless (entropy encoding) and Lossy
(source coding).
Lossy and Lossless
Some compression techniques (usually image and multimedia files such as JPEG and mp3)
lose information when they are compressed, reducing the quality of the file. Compressing these
types of files are irreversible and therefore cannot match the original file. The size of the file is
proportional to the quality degradation. Therefore this type of compression is not used with files
that need to be restored to its original form. This is called Lossy data compression.
When the data of the file needs to be conserved, the so-called Lossless data compression is
used. This ensures that the data during the compression process is not lost, so the information
contained in the data is the same before and after compression. Continuous compression of a
Lossless file does not mean continuous reduction in size, as there is a lower limit to what the file
can be compressed to - its entropy.
Lossless Data Compression (Entropy Coding)
As has already been mentioned, lossless data compression uses a data compression algorithm
that allows the original data to be reconstructed fully from the compressed data. This is used, when
the original piece of data must be fully reconstructed and once decompressed, the data should be
identical to the data before compression.
As previously discussed, Shannon defined the entropy of a set of discrete events {x1, x2,…,xk}
as:
H = -  Pi log Pi
where Pi is the probability of the event xi, i∈[1,k].
By using the entropy of a set of symbols and their probabilities, we can deduce the optimum
compression ratio we can get. (Section ‘More about Entropy’.) For example, the English language
has an entropy of 1.3, therefore an optimum level, we can use 1.3 bits per character.
The Shannon-Fano and Huffman compression methods are based on statistics obtained from
the data. These statistics take into consideration the probability, or how often each symbol will
appear, and with this information we assign a binary string for each symbol. The aim is to assign
the most occurring symbol with the shortest binary code and the least occurring with the longest.
This allows the coded data to be smaller in size of the original given data.
Compression of data using Shannon-Fano coding
As we have already discussed the Shannon-Fano code in the previous chapter, we can skip the
method on how to compute the code strings and jump to how the Shannon-Fano code can be
affective in compression.
So, if we consider the set of symbols with their corresponding probabilities of occurrence:
Set of Symbols (x)
Probability of Occurrence P(Xi):
a
P(a) = 0.20
b
P(b) = 0.18
c
P(c) = 0.13
d
P(d) = 0.10
e
P(e) = 0.09
f
P(f) = 0.08
g
P(g) = 0.08
h
P(h) = 0.07
i
P(i) = 0.04
j
P(i) = 0.03
Aplying the Shannon-Fano compression method, we can use this order of decreasing
probabilities of symbols to help compress the data. Firstly, we divide the list of decreasing
probabilities into two groups, where sum of the symbols of each group, has approximately half of
the total probability. Then we continue this division process until there is only one symbol in each
group.
We will now demonstrate this method:
Group one
Group two
a
d
b
e
c
f
g
h
i
j
Total Probabilities
0.51
0.49
A tree diagram can represent the further splitting of these groups:
The entropy of this string of data is:
H = - ((0.2)log(0.2) + (0.18)log(0.18) + (0.13)log(0.13) + (0.1)log(0.1) + (0.09)log(0.09) +
(0.08)log(0.08) + (0.08)log(0.08) + (0.07)log(0.07) + (0.04)log(0.04)+
(0.03)log(0.03) = 3.1262772
Now the symbols have a binary string attached to them. With this binary string and their
probability of occurrence, we can calculate the average length, and see how close it is to the entropy
of the data. We can calculate the average length of this code by being the sum of each symbols
binary string length multiplied by the probability of occurrence.
So the average length = (2x0.2) + (3 x (0.18+0.13+0.10) + (4 x (0.09+0.08+0.08+0.07+0.04+0.03)
= 3.19 bits
We can see how close the average length of the data is to its entropy. Now using this, we can
compare which method of coding is the most effective by seeing which method of coding leaves the
average length closest to the entropy.
Compression of data using Huffman Coding
Again, as we have already discussed the Huffman code in the previous chapter, we can skip the
explanation of the algorithm on how to construct the codes.
So, if we consider the data used above, with the same probabilities, we have:
Set of Symbols (x)
Probability of Occurrence P(Xi):
a
P(a) = 0.20
b
P(b) = 0.18
c
P(c) = 0.13
d
P(d) = 0.10
e
P(e) = 0.09
f
P(f) = 0.08
g
P(g) = 0.08
h
P(h) = 0.07
i
P(i) = 0.04
j
P(i) = 0.03
As we can see, the Huffman compression method, similarly to the Shannon-Fano compression
method, uses this order of decreasing probabilities of symbols to help compress the data. However,
that’s where the similarities end. In the Huffman compression method, we firstly create a binary
tree in the order of decreasing probabilities. Then from the binary tree, we branch out from each
symbol, starting from the least probable simple occurring branching to the next least probable
symbol occurring. Then finally for each symbol, we label each branch with a binary number and the
bit sequence obtained from the binary tree is the Huffman code.
As computed earlier, the entropy of this string of data is:
H = - ((0.2)log(0.2) + (0.18)log(0.18) + (0.13)log(0.13) + (0.1)log(0.1) + (0.09)log(0.09) +
(0.08)log(0.08) + (0.08)log(0.08) + (0.07)log(0.07) + (0.04)log(0.04)+
(0.03)log(0.03) = 3.1262772
As we now have the binary strings attached to each symbol derived from the Huffman code, we
can calculate the average length, and see how close it is to the entropy of the data. Again we can
calculate the average length of this code by being the sum of each symbols binary string length
multiplied by the probability of occurrence.
So the average length = (9 x (0.03+0.04) ) + (8 x 0.07) + (7 x 0.08) + (6 x 0.08) + (5 x 0.09) + (4 x
0.10) + (3 x 0.13) + (2 x 0.18) + 0.2 =
We can see how close the average length of the data is to its entropy. From the two
compression methods, we can deduce that the Huffman compression method is more effective as
the average length of the code is closer to the entropy. This is why today, Huffman coding is used
more than the Shannon-Fano method, to compress data. (Can refer to the Dice problem)
Other Data Compression Algorithms
Compression using Lempel-Ziv method
The Lempel-Ziv method has been explained in the previous chapter, so we can just summarise the
use of the compression method. The Lempel- Ziv compression method is used primarily to
compress text files. In the Lempel-Ziv compression method, the input sequence is put in to nonoverlapping blocks of different lengths, whilst doing that we make a dictionary of blocks that we
have already seen.
Lossy Data Compression
Lossy compression is a data compression method that loses some of the data, in order to
achieve its goal of compression. However, when decompressed the data content is different from
the original, though similar enough to be useful in some way. Lossy compression is most
commonly used to compress multimedia data, especially in applications such as streaming media
and using the telephone over the internet.
Compressing and Archiving
Compression is typically applied to a single file and compressed formats only contain a single item.
Archiving allows files and folders to be grouped together for compression. Archive formats
can contain a single item, or many items, and preserve the hierarchy of nested folders. However,
most archive formats include compression as part of the archiving process.
Expansion and Extraction
When an archive is created, it can be accessed in two different modes. One of the modes is
Expansion, where the entire contents of an archive are expanded out at once. The second is the
Browse mode, where the archive is accessed like a folder. The hierarchical structure can be
navigated and individual files can be extracted without having to expand the entire contents of the
archive. In some cases, archive content can be manipulated: items can be renamed, moved, deleted,
added, etc.
(I would remove these green sections) Alena
Re-compression
Re-compression involves making files smaller by disassembling the structure of the data and
then compressing the file more efficiently. Then when the file is then expanded, the data structure
for that file is then reassembled.
Recompression usually results in an output which is 100% identical to its original, however in
some cases, it may not be. In these cases the content and any data is never lost, however the
encoding may be slightly different.
(Can add some more here or explain better)
References:
1977, Martin J. Computer Data-Base Organisation. Englewood Cliffs, NJ: Prentice-Hall
1980. Reghbati, H.K. Technical Aspects of Teleprocessing. Saskatoon, Sask
1983 Encyclopedia of computer science and engjneering, Anthony Ralston
Download