Efficient Data Compression Using Prime Numbers Debasish Chakraborty1,∗ , Snehasish Kar2 and Kalyan Guchait3 1 Dept. of Computer Science & Engineering, Assistant Professor, St. Thomas’ College of Engineering & Technology, Kolkata, West Bengal, India. 2 Dept. of Information Technology, Grad. Student, St. Thomas’ College of Engineering & Technology, Kolkata, West Bengal, India. 3 Dept. of Information Technology, Grad. Student, St. Thomas’ College of Engineering & Technology, Kolkata, West Bengal, India. e-mail: 1 sunnydeba@gmail.com Abstract. In this paper, we have proposed a new methodology for data compression and decompression by using prime numbers. The concept of the algorithm is that strings containing four characters are taken from a file one at a time, arranged according to frequency and assigned a prime number each. The strings are then rescanned and indices for a set of strings are multiplied to create a number, which is used to replace the strings in the file. This encoding method shows consistent compression ratio irrespective of the content of the corresponding text file. Keywords: Lossless compression, Bit saving, Prime numbers. 1. Introduction Data compression deals with the representation of the data using less number of bits than required by the original data [1,3–5]. The basic aim of Data compression is to compress the data, effectively reducing its size. This is generally done to reduce the memory required to store the data or the bandwidth required to transfer the data over the network. Since we tend to store old data rather than delete it, data compression gives us the option of storing more data than would normally be possible. Data compression is carried out using encoding schemes, which have been designed to be able to reduce the size of the data. Every encoding scheme is accompanied by a decoding scheme, which may be used to get back the original data from the compressed data [2,6], and [7]. The decoding scheme is unique to the encoding scheme and may only be used on data compressed by the encoding scheme. Different compression schemes produce different degrees of compression and have different complexities in implementation. Data compression techniques may be classified into two categories: 1. Lossless compression, 2. Lossy compression [8]. A lossless technique exploits statistical redundancy to reduce the size of the data. The original data may be retrieved from the compressed file upon decompression in case of lossless compression as every bit of data is preserved. Lossy compression refers to techniques, which removes or rounds off redundant data in order to achieve compression. Lossy compression is used on data where loss of data is acceptable. The limits are determined by human perception. Better compression is achieved using lossy compression techniques. However the decompressed data is different from the original data. Lossy compression is generally used on images or video files where small loss of data value goes unnoticed. In contrast, text files, particularly files containing computer programs, may be rendered valueless if even one bit is modified. Such files should only be compressed using lossless compression techniques. Here, our chief aim is to implement a text compression algorithm. The lossless compression technique aims to attain a compression ratio close to 50% or more, irrespective of the data present in the text files. Our approach is to take 10 text files and check the file sizes before and after compression, thus comparing the original sizes to reduced sizes. The compression ratio, given by 1-(size of compressed file/size of original file) * 100%, is calculated for each file and compared. ∗ Corresponding author © Elsevier Publications 2013. Debasish Chakraborty, Snehasish Kar and Kalyan Guchait The structure followed in the rest of the paper is as follows. Section 2 summarizes the proposed algorithm strategy. Section 3 provides the calculations. Section 4 contains the algorithm which may be used to implement the technique. Section 5 shows the experimental results and section 6 contains conclusion of the paper. 2. Algorithm Strategy This algorithm defines a lossless strategy which makes use of the prime numbers present in the decimal number system [9]. The prime numbers are the numbers which have only two factors, 1 and itself. So when two or more prime numbers are multiplied to form a composite number, the composite number only has prime factors. Thus the prime factors may be extracted from the composite number without any loss of data or precision. If such prime numbers may be used as indices to a dictionary and the dictionary contains the unique strings present in the file, the entire file may be replaced by a set of numbers created by multiplying prime numbers. Each number contains a set of prime indices which may be extracted from it without any loss of precision. Since 1 is produced on extraction of every number, it has to be removed from the indices, i.e. indexing has to start from 2. Since a bigger number requires lesser number of bits to be represented in a file, a bigger composite number requires lesser bits to be stored than individual characters which are indexed by prime numbers. Hence a certain level of compression is achieved. 3. Calculation The technique may be best explained with the help of an example. Let us consider the string: THE QUICK SILVER FOX JUMPS OVER THE LAZY DOG Scanning all 4 byte symbols from the file and storing them in a dictionary along with their frequency of occurrence. For convenience and better understanding, we replace spaces with @ symbol. Table 1. Element Frequency THE@ QUIC K@SI LVER @FOX @JUM PS@O VER@ LAZY @DOG 2 1 1 1 1 1 1 1 1 1 Arranging the elements according to frequency in descending order and assigning prime numbers to the sorted dictionary, we have: Table 2. 178 Element Frequency Prime− Index THE@ QUIC K@SI LVER @FOX @JUM PS@O VER@ LAZY @DOG 2 1 1 1 1 1 1 1 1 1 2 3 5 7 11 13 17 19 23 29 © Elsevier Publications 2013. Efficient Data Compression Using Prime Numbers Now, replacing the strings with prime numbers and multiplying till the maximum value of long, i.e. 4294967296, is reached we have: 2 * 3 * 5 * 7 * 11 * 13 * 17 * 19 * 2 * 23 * 29 = 12939386460 But as this number exceeds the maximum limit, i.e. maximum value of long, we multiply upto 23 2 * 3 * 5 * 7 * 11 * 13 * 17 * 19 * 2 * 23 = 446, 185, 740 The next number, 29, goes into the next number set. The file gets replaced by the above numbers, i.e. 446,185,740 and 29. 4. Algorithm Steps for compression: 1. Scan the file and store all 4 byte symbols present in the file in an array along with their frequency of occurrence in the file. This forms a dictionary. 2. Arrange the symbols in descending order of frequency. 3. Assign prime numbers to symbols starting with 2. Store the dictionary in a file. 4. Let count = 1 5. Rescan the file from beginning. 6. Scan symbol from file and find corresponding prime number from array. 7. Multiply number with count. 8. Repeat steps 6 to 8 till count < maximum value of long (4294967296). 9. Write value of count to file and reset value of count to 1. 10. Repeat 6 to 10 till end of file is reached. Steps of decompression: 1. 2. 3. 4. 5. Scan the compressed file and reconstruct the dictionary. Scan number from file and store in count. Extract prime numbers from the numbers. Replace the numbers with symbols from the dictionary using number as index to create a new file. Repeat the steps 2 to 5 till end of file is reached. Graph 1. © Elsevier Publications 2013. 179 Debasish Chakraborty, Snehasish Kar and Kalyan Guchait Table 3. File Name Original File Size (in bytes) File1.txt 762,839 File2.txt 1,059,851 File3.txt 1,221,084 File4.txt 1,372,074 File5.txt 1,492,876 File6.txt 1,601,578 File7.txt 1,692,885 File8.txt 1,750,460 File9.txt 1,850,340 File10.txt 1,940,297 Compressed File Size for Huffman Compression (in bytes) Compressed File Size for LZW Compression (in bytes) Compressed File Size for Proposed Algorithm (in bytes) 450304 (40.97%) 625175 (41.02%) 725342 (40.60%) 814,325 (40.65%) 883,036 (40.85%) 948,706 (40.76%) 1,003,919 (40.10%) 1,036,622 (40.78%) 1,094,476 (40.85%) 1,150,491 (40.71%) 369630 (51.55%) 518013 (51.12%) 623487 (48.93%) 692,485 (49.53%) 751,812 (49.64%) 781,885 (51.18%) 876,123 (48.24%) 904,287 (48.34%) 951,814 (48.56%) 1,006,812 (48.11%) 414,590 (45.65%) 552,350 (47.88%) 631,238 (48.31%) 703,736 (48.71%) 761,665 (48.98%) 810,510 (49.39%) 854,718 (49.51%) 879,606 (49.75%) 927,205 (49.89%) 970,090 (50.00%) 5. Experimental Results The developed Algorithm has been simulated in C++ in Visual Studio 6. The input files are text files (.txt files). Files verified are of different sizes ranging from 50 KB to 2 MB. The compression ratios are tabularized in Table 3. Experimental outcomes show that irrespective of the file size, the recommended algorithm tends to achieve a stable compression ratio. A chart given below displays the stability of the compression ratio achieved irrespective of the file size. The experimental results obtained concur with the theoretical calculations and results found in Section 3. Additionally, the text files rebuilt from the compressed file have the same file size as the original. Therefore the proposed algorithm portrays lossless compression behavior. Here, the results of Compression ratio performance are displayed using the graphical figure Graph 1. Here, calculated compression ratios for the different size of text files have been tabulated. The data has been shown in the table Table 3. 6. Conclusion In this paper, a new compression method has been recommended for text files. The benefit of this compression algorithm is its consistency in compression and the fact that the compression ratio will never be less than 50% (approx.) regardless of the contents of the file. Furthermore, formation of the active dictionary empowers us to handle memory effectively and also with suitable data structure. 7. Acknowledgements First, we would like to thank Professor Subarna Bhattacharjee [10], for her valuable advice, provision and constant encouragement. Her constant assessments and reviews gave us the much needed theoretical clarity. We owe a substantial lot to all the faculty members of the Department of Computer Science and Engineering and the Department of Information Technology. We would also like to thank our friends for tolerantly listening to our explanations. Their reviews and comments were exceptionally helpful. And of course, we owe our capability to complete this project to our families whose love and encouragement consumes remained our cornerstone. 180 © Elsevier Publications 2013. Efficient Data Compression Using Prime Numbers References [1] J. Ziv and A. Lempel, “Compression of individual sequences via variable length coding”, IEEE Transaction on Information Theory, Vol. 24: pp. 530–536, 1978. [2] Gonzalo Navarro and Mathieu A Raffinot, “General Practical Approach to Pattern Matching over Ziv-LempelCompressed Text”, Proc. CPM’99, LNCS 1645, pp. 14–36. [3] Khalid Sayood, “An Introduction to Data Compression”, Academic Press, 1996. [4] David Solomon, “Data Compression: The Complete Reference”, Springer Publication, 2000. [5] Mark Nelson and Jean-Loup Gaily, “The Data Compression Book”, Second Edition, M&T Books. [6] M. Atallah and Y. Genin, “Pattern matching text compression: Algorithmic and empirical results”, International Conference on Data Compression, Vol. II: pp. 349–352, Lausanne, 1996. [7] Timothy C. Bell, “Text Compression”, Prentice Hall Publishers, 1990. [8] Ranjan Parekh,” Principles of Multimedia”, Tata McGraw-Hill Companies, 2006. [9] Debashis Chakraborty, Sandipan Bera, Anil Kumar Gupta and Soujit Mondal, “Efficient Data Compression using Character Replacement through Generated Code”, IEEE NCETACS 2011, Shillong, India, March 4–5, 2011. [10] S. Bhattacharjee, J. Bhattacharya, U. Raghavendra, D. Saha and P. Pal Chaudhuri, “A VLSI architecture for cellular automata based parallel data compression”, IEEE-2006, Bangalore, India, January 03–06. Mr. Debashis Chakraborty received the B.Sc. degree with Honours in Computer Science and M.Sc. degree in Computer and Information Science from the University of Calcutta, West Bengal, India, in 2001 and 2003, respectively. He obtained the M.Tech. degree in Computer Science and Engineering, from Vinayaka Mission University, Salem, Tamilnadu, India, in 2007. At present he is pursuing his Ph.D. in the field of Data and Image Compression from University of Calcutta, West Bengal, India. Mr. Chakraborty is a Lecturer in the department of Computer Science and Engineering, St. Thomas’ College of Engineering and Technology, Kolkata, West Bengal, India. He has authored or co-authored over 6 conference papers in area of Data and Image Compression. © Elsevier Publications 2013. 181