Efficient Data Compression Using Prime Numbers

advertisement
Efficient Data Compression Using Prime Numbers
Debasish Chakraborty1,∗ , Snehasish Kar2 and Kalyan Guchait3
1 Dept.
of Computer Science & Engineering, Assistant Professor, St. Thomas’ College of Engineering & Technology,
Kolkata, West Bengal, India.
2 Dept. of Information Technology, Grad. Student, St. Thomas’ College of Engineering & Technology, Kolkata,
West Bengal, India.
3 Dept. of Information Technology, Grad. Student, St. Thomas’ College of Engineering & Technology, Kolkata,
West Bengal, India.
e-mail: 1 sunnydeba@gmail.com
Abstract. In this paper, we have proposed a new methodology for data compression and decompression by
using prime numbers. The concept of the algorithm is that strings containing four characters are taken from a file
one at a time, arranged according to frequency and assigned a prime number each. The strings are then rescanned
and indices for a set of strings are multiplied to create a number, which is used to replace the strings in the file.
This encoding method shows consistent compression ratio irrespective of the content of the corresponding text
file.
Keywords:
Lossless compression, Bit saving, Prime numbers.
1. Introduction
Data compression deals with the representation of the data using less number of bits than required by the original data
[1,3–5]. The basic aim of Data compression is to compress the data, effectively reducing its size. This is generally
done to reduce the memory required to store the data or the bandwidth required to transfer the data over the network.
Since we tend to store old data rather than delete it, data compression gives us the option of storing more data than
would normally be possible. Data compression is carried out using encoding schemes, which have been designed to be
able to reduce the size of the data. Every encoding scheme is accompanied by a decoding scheme, which may be used
to get back the original data from the compressed data [2,6], and [7]. The decoding scheme is unique to the encoding
scheme and may only be used on data compressed by the encoding scheme. Different compression schemes produce
different degrees of compression and have different complexities in implementation.
Data compression techniques may be classified into two categories: 1. Lossless compression, 2. Lossy compression
[8]. A lossless technique exploits statistical redundancy to reduce the size of the data. The original data may be retrieved
from the compressed file upon decompression in case of lossless compression as every bit of data is preserved. Lossy
compression refers to techniques, which removes or rounds off redundant data in order to achieve compression. Lossy
compression is used on data where loss of data is acceptable. The limits are determined by human perception. Better
compression is achieved using lossy compression techniques.
However the decompressed data is different from the original data. Lossy compression is generally used on images
or video files where small loss of data value goes unnoticed. In contrast, text files, particularly files containing computer
programs, may be rendered valueless if even one bit is modified. Such files should only be compressed using lossless
compression techniques.
Here, our chief aim is to implement a text compression algorithm. The lossless compression technique aims to attain
a compression ratio close to 50% or more, irrespective of the data present in the text files. Our approach is to take 10
text files and check the file sizes before and after compression, thus comparing the original sizes to reduced sizes. The
compression ratio, given by 1-(size of compressed file/size of original file) * 100%, is calculated for each file and
compared.
∗ Corresponding author
© Elsevier Publications 2013.
Debasish Chakraborty, Snehasish Kar and Kalyan Guchait
The structure followed in the rest of the paper is as follows. Section 2 summarizes the proposed algorithm strategy.
Section 3 provides the calculations. Section 4 contains the algorithm which may be used to implement the technique.
Section 5 shows the experimental results and section 6 contains conclusion of the paper.
2. Algorithm Strategy
This algorithm defines a lossless strategy which makes use of the prime numbers present in the decimal number
system [9]. The prime numbers are the numbers which have only two factors, 1 and itself. So when two or more prime
numbers are multiplied to form a composite number, the composite number only has prime factors. Thus the prime
factors may be extracted from the composite number without any loss of data or precision.
If such prime numbers may be used as indices to a dictionary and the dictionary contains the unique strings present
in the file, the entire file may be replaced by a set of numbers created by multiplying prime numbers. Each number
contains a set of prime indices which may be extracted from it without any loss of precision. Since 1 is produced on
extraction of every number, it has to be removed from the indices, i.e. indexing has to start from 2. Since a bigger
number requires lesser number of bits to be represented in a file, a bigger composite number requires lesser bits to
be stored than individual characters which are indexed by prime numbers. Hence a certain level of compression is
achieved.
3. Calculation
The technique may be best explained with the help of an example. Let us consider the string:
THE QUICK SILVER FOX JUMPS OVER THE LAZY DOG
Scanning all 4 byte symbols from the file and storing them in a dictionary along with their frequency of occurrence.
For convenience and better understanding, we replace spaces with @ symbol.
Table 1.
Element
Frequency
THE@
QUIC
K@SI
LVER
@FOX
@JUM
PS@O
VER@
LAZY
@DOG
2
1
1
1
1
1
1
1
1
1
Arranging the elements according to frequency in descending order and assigning prime numbers to the sorted
dictionary, we have:
Table 2.
178
Element
Frequency
Prime− Index
THE@
QUIC
K@SI
LVER
@FOX
@JUM
PS@O
VER@
LAZY
@DOG
2
1
1
1
1
1
1
1
1
1
2
3
5
7
11
13
17
19
23
29
© Elsevier Publications 2013.
Efficient Data Compression Using Prime Numbers
Now, replacing the strings with prime numbers and multiplying till the maximum value of long, i.e. 4294967296, is
reached we have:
2 * 3 * 5 * 7 * 11 * 13 * 17 * 19 * 2 * 23 * 29 = 12939386460
But as this number exceeds the maximum limit, i.e. maximum value of long, we multiply upto 23
2 * 3 * 5 * 7 * 11 * 13 * 17 * 19 * 2 * 23 = 446, 185, 740
The next number, 29, goes into the next number set.
The file gets replaced by the above numbers, i.e. 446,185,740 and 29.
4. Algorithm
Steps for compression:
1. Scan the file and store all 4 byte symbols present in the file in an array along with their frequency of occurrence in
the file. This forms a dictionary.
2. Arrange the symbols in descending order of frequency.
3. Assign prime numbers to symbols starting with 2. Store the dictionary in a file.
4. Let count = 1
5. Rescan the file from beginning.
6. Scan symbol from file and find corresponding prime number from array.
7. Multiply number with count.
8. Repeat steps 6 to 8 till count < maximum value of long (4294967296).
9. Write value of count to file and reset value of count to 1.
10. Repeat 6 to 10 till end of file is reached.
Steps of decompression:
1.
2.
3.
4.
5.
Scan the compressed file and reconstruct the dictionary.
Scan number from file and store in count.
Extract prime numbers from the numbers.
Replace the numbers with symbols from the dictionary using number as index to create a new file.
Repeat the steps 2 to 5 till end of file is reached.
Graph 1.
© Elsevier Publications 2013.
179
Debasish Chakraborty, Snehasish Kar and Kalyan Guchait
Table 3.
File Name
Original
File Size
(in bytes)
File1.txt
762,839
File2.txt
1,059,851
File3.txt
1,221,084
File4.txt
1,372,074
File5.txt
1,492,876
File6.txt
1,601,578
File7.txt
1,692,885
File8.txt
1,750,460
File9.txt
1,850,340
File10.txt
1,940,297
Compressed File
Size for Huffman
Compression
(in bytes)
Compressed File
Size for LZW
Compression
(in bytes)
Compressed File
Size for Proposed
Algorithm
(in bytes)
450304
(40.97%)
625175
(41.02%)
725342
(40.60%)
814,325
(40.65%)
883,036
(40.85%)
948,706
(40.76%)
1,003,919
(40.10%)
1,036,622
(40.78%)
1,094,476
(40.85%)
1,150,491
(40.71%)
369630
(51.55%)
518013
(51.12%)
623487
(48.93%)
692,485
(49.53%)
751,812
(49.64%)
781,885
(51.18%)
876,123
(48.24%)
904,287
(48.34%)
951,814
(48.56%)
1,006,812
(48.11%)
414,590
(45.65%)
552,350
(47.88%)
631,238
(48.31%)
703,736
(48.71%)
761,665
(48.98%)
810,510
(49.39%)
854,718
(49.51%)
879,606
(49.75%)
927,205
(49.89%)
970,090
(50.00%)
5. Experimental Results
The developed Algorithm has been simulated in C++ in Visual Studio 6. The input files are text files (.txt files).
Files verified are of different sizes ranging from 50 KB to 2 MB. The compression ratios are tabularized in Table 3.
Experimental outcomes show that irrespective of the file size, the recommended algorithm tends to achieve a stable
compression ratio. A chart given below displays the stability of the compression ratio achieved irrespective of the
file size. The experimental results obtained concur with the theoretical calculations and results found in Section 3.
Additionally, the text files rebuilt from the compressed file have the same file size as the original. Therefore the
proposed algorithm portrays lossless compression behavior. Here, the results of Compression ratio performance are
displayed using the graphical figure Graph 1.
Here, calculated compression ratios for the different size of text files have been tabulated. The data has been shown
in the table Table 3.
6. Conclusion
In this paper, a new compression method has been recommended for text files. The benefit of this compression algorithm is its consistency in compression and the fact that the compression ratio will never be less than 50% (approx.)
regardless of the contents of the file. Furthermore, formation of the active dictionary empowers us to handle memory
effectively and also with suitable data structure.
7. Acknowledgements
First, we would like to thank Professor Subarna Bhattacharjee [10], for her valuable advice, provision and constant
encouragement. Her constant assessments and reviews gave us the much needed theoretical clarity. We owe a substantial lot to all the faculty members of the Department of Computer Science and Engineering and the Department
of Information Technology. We would also like to thank our friends for tolerantly listening to our explanations. Their
reviews and comments were exceptionally helpful. And of course, we owe our capability to complete this project to
our families whose love and encouragement consumes remained our cornerstone.
180
© Elsevier Publications 2013.
Efficient Data Compression Using Prime Numbers
References
[1] J. Ziv and A. Lempel, “Compression of individual sequences via variable length coding”, IEEE Transaction on Information
Theory, Vol. 24: pp. 530–536, 1978.
[2] Gonzalo Navarro and Mathieu A Raffinot, “General Practical Approach to Pattern Matching over Ziv-LempelCompressed
Text”, Proc. CPM’99, LNCS 1645, pp. 14–36.
[3] Khalid Sayood, “An Introduction to Data Compression”, Academic Press, 1996.
[4] David Solomon, “Data Compression: The Complete Reference”, Springer Publication, 2000.
[5] Mark Nelson and Jean-Loup Gaily, “The Data Compression Book”, Second Edition, M&T Books.
[6] M. Atallah and Y. Genin, “Pattern matching text compression: Algorithmic and empirical results”, International Conference
on Data Compression, Vol. II: pp. 349–352, Lausanne, 1996.
[7] Timothy C. Bell, “Text Compression”, Prentice Hall Publishers, 1990.
[8] Ranjan Parekh,” Principles of Multimedia”, Tata McGraw-Hill Companies, 2006.
[9] Debashis Chakraborty, Sandipan Bera, Anil Kumar Gupta and Soujit Mondal, “Efficient Data Compression using Character
Replacement through Generated Code”, IEEE NCETACS 2011, Shillong, India, March 4–5, 2011.
[10] S. Bhattacharjee, J. Bhattacharya, U. Raghavendra, D. Saha and P. Pal Chaudhuri, “A VLSI architecture for cellular automata
based parallel data compression”, IEEE-2006, Bangalore, India, January 03–06.
Mr. Debashis Chakraborty received the B.Sc. degree with Honours in Computer Science and M.Sc. degree in
Computer and Information Science from the University of Calcutta, West Bengal, India, in 2001 and 2003, respectively. He obtained the M.Tech. degree in Computer Science and Engineering, from Vinayaka Mission University,
Salem, Tamilnadu, India, in 2007. At present he is pursuing his Ph.D. in the field of Data and Image Compression
from University of Calcutta, West Bengal, India. Mr. Chakraborty is a Lecturer in the department of Computer Science
and Engineering, St. Thomas’ College of Engineering and Technology, Kolkata, West Bengal, India. He has authored
or co-authored over 6 conference papers in area of Data and Image Compression.
© Elsevier Publications 2013.
181
Download