Data String Compression using K-Nets Kumar Sourav Software Engineer Computer Science Corporation DLF It Park, a-44/45, Sector 62 Noida UP India sourav.revo@gmail.com Abstract – Data compression is required to compress given data in order to save space on storage devices and remove unnecessary redundancies in given data. Data mainly consists of strings of characters and digits. Data compression algorithms are mainly lossless or lossy. In lossless algorithms no data is lost during operation and in lossy algorithm data is lost during operation but compression is more. Data can be compressed by exploiting factors like : repetition of characters, using dictionary , prediction etc. This paper is about compression of Data Strings used in various programming languages using the technique of K-Nets which is lossless technique. A K-Net can be an Ndimensional geometrical figure which is generally a square or a rectangle for N=2 or in two dimensional plane. In this papaer we will firstly represent a Data string of characters using a points in the K-Net plane and then later calculate the space saved by the technique to represent data string's for long data string's. The programming language used in this paper for performance analysis and calculating results is JAVA. running it on faster resources [8, 9]. LZMW, LZAP and LZWL are the variants of Lempel Ziv. LZMW Searches input for the longest string already in the dictionary adds the concatenation of the previous match with the current match to the dictionary. LZAP modification of LZMW: instead of adding just the concatenation of the previous match with the current match to the dictionary, add the concatenations of the previous match with each initial substring of the current match and LZWL can work with syllables obtained by all algorithms of decomposition into syllables. This algorithm can be used for words too. 1. INTRODUCTION Data compression is an important element of programming and storage. Data compression is the process of encoding data so that it takes less storage space or less transmission time than it would if it were not compressed. Compression is possible because we ca exploit the redundancies in the real world data. Mainly two types of compression techniques are there: Lossless compression and Lossy compression. In Lossless compression technique the data is compressed a lot but some data is lost during the process as in the case of music, video but in the Lossy compression technique the data is compressed and no data is lost as in the case of word, excel and text files. In this paper i will be mainly focusing on the lossless data compression techniques as data string Compression using K-Nets is also lossless technique. LZ77 and LZ78 and lossless compression techniques based on dictionary coders. Dictionary based algorithms can have Static dictionary, Semi adaptive dictionary or Adaptive dictionary. Lempel Ziv is adaptive dictionary technique .LZ77 maintains a sliding window during compression. This was later shown to be equivalent to the explicit dictionary constructed by LZ78. It was the algorithm of the widely used UNIX file compression utility compress, and is used in the GIF image format. However the performance of Lempel Ziv is limited by the memory and other resources [9]. We can also optimize its performance by using partial matching technique and Fig 1: Lempel Ziv coding based on the probability of occurrence of characters. Huffman coding [10, 11, 12] is an entropy encoding algorithm used for lossless data compression. Huffman coding uses a specific method for choosing the representation for each symbol, resulting in a prefix code that expresses the most common source symbols using shorter strings of bits than are used for less common source symbols. Fig 2: Huffman Coding based on probability of occurrence of symbols. Run Length Coding is a very simple form of data compression in which runs of data are stored as a single data value and count, rather than as the original run. This is most useful on data that contains many such runs: for example, simple graphic images such as icons, line drawings, and animations. It is not useful with files that don't have many runs as it could greatly increase the file size. Kolakoski sequence, Golomb coding and Look-and-say sequence are based on Run Length Encoding. Kolakoski sequence [13] is an infinite sequence of symbols {1, 2} which is its own run-length encoding. In Golomb coding alphabets following a geometric distribution will have a Golomb code [14] as an optimal prefix code, making Golomb coding highly suitable for situations in which the occurrence of small values in the input stream is significantly more likely than large values and Look-and-say sequence is the sequence of integers beginning as follows: 1, 11, 21, 1211, 111221, 312211, 13112221, 1113213211. To generate a member of the sequence from the previous member, read off the digits of the previous member, counting the number of digits in groups of the same digit. sequence compression and pattern matching [3, 4, 5] which use different approaches to arrive to a similar result of compression. This paper is about compression of Strings using the technique of K-Nets. A K-Net is an N dimensional figure whose some or all sides represent a character sequence. A group of characters are represented by a point in this K-Net plane. In this way we can encode and decode the String pattern using the simple K-Net implementation in a programming language like Java. This technique is efficient for strings of large size. Comparison of the storage space saved for a number of strings of different lengths using this technique is made in this paper. K-Net can be multidimensional in nature when N increases. In this paper K-Nets with N=2 is more frequently used. None of these techniques guarantee constant string compression as their compression depends on the type of data given but String compression using K-Net guarantee constant compression irrespective of the data given although it can be made adaptive by combining it with other compression techniques. Basic structure of K-Nets in different dimensions is described in section II, Point representation of characters in K-Nets is described in section III, Algorithm to implement K-Nets is described in section IV, Java implementation of algorithm is described in section V, Comparison of compression results is described in section VI. II. BASIC K-NET STRUCTURE A K-Net is a geometric structure which can be N dimensional depending on the requirements. As N increases the amount of characters stored by the K-Net also increases and hence the compression also increases. Programmatically it is easier to implement K-Nets with less N than K-Nets which has higher value of N. Now there are two types of K-Nets one is Half K-Net and the other is Full K-Net. A half K-Net will only have half of its sides named with characters and a full K-Net will have all its sides named with characters. As the domain of characters by which the string is represented will increase, so the size of K-Net will also increase. Fig 3: Run Length Encoding over a stream of characters. There are a number of compression techniques to compress data like: Shannon-Fano Coding, Static Huffman Coding, Lempel-Ziv Codes, Adaptive Codes and many other techniques [1]. Data Strings are compressed by removing repetitive characters in the string and square detection [2, 6] and also on the basis of frequency of occurrence of the characters in it. In fact there are many algorithms for string, tree and Fig 4: A half K-Net with N = 2 Figure4. Shows a half K-Net with N=2 , which means half the sides of the K-Net are having characters and it is a two dimensional K-Net. The range of the K-Net given is (A-K). The range can depend on the strings that your program is going to handle. plane. Like: a point in K-Net plane may represent "abc" or "af" etc, depending on the K-Net used. Fig 5: A full K-Net with N = 2 Fig 7: Representation with a half K-Net and N=2. Figure 5. Shows a full K-Net with N=2. In this KNet all the sides are having characters and represent set of characters. This K-Net will offer more compression in some cases as compared to half K-Net where a string characters contain a pattern or symmetry. Figure 7. Shows a half K=Net with N=2. In this KNet every cell is assigned a number which represents a pair of characters from A to K. So, we see here the storage capacity required to store two characters is replaced by a number. In the K-Net, the characters are considered clockwise. For each pair of characters in the string a number is marked in the K-Net which correspond to that pair of characters. Suppose we have a string which contains "AA" as a part of it then the KNet equivalent of the string is "1". For K-Net with N=2: K (n) ="C1 C2" (1) Where n is a number and C1 C2 are two characters that correspond to that number in the K-Net. In more general form, K (n1 n2 n3 ..... nk) = "C1 C2 C3 ..... C2k" Fig 6: A half K-Net with N=3. Figure 6. Shows a half K-Net with N=3. It is a three dimensional K-Net whose half of the sides are labeled with characters. It can store more information as compared to two dimensional K-Net, although its implementation is more complex in programming. Similarly we can have K-Nets with higher value of N in which are both half and full. In this paper half KNet with N=2 is used to discuss the compression results. III. REPRESENTATION OF CHARACTERS IN K-NETS The basic idea of using K-Nets is that a set of characters can be represented by a point in the K-Net (2) Also K sequence to be printed out , there is a record kept which contains the row according to each element. It is used to retrieve the original string from the K-Net. As the value of N increases the compression ratio also increases. IV. ALGORITHM FOR K-NET The algorithm used for K-Net is described below. Algorithm prerequisites: User will enter a general string like "abc1" which will be represented in the form of K-Net code like "12", thus saving space. The original sequence can retained by using the reverse K-Net process. Algorithm: Step 1: Take a string: str from the user { Step 2: Get the degree of K-Net which is N word=str.substring(i,i+2); buildnet(word); Step 3: Get the largest of characters from str and build K-Net using HashMap, which contains combinations of characters of length specified by N. } Step 4: Now using the HashMap convert the given N characters at a time in string str into K-Net string kstr. for(int i=0;i<len;i+=2) { Step 5: Repeat Step 4, till no more characters are left in string str to convert into kstr. word=str.substring(i,i+2); Step 6: Now this kstr is the equivalent string of str in K-Net. Use kstr to get original string when needed. num=Integer.parseInt(word); word=net.get(word).toString(); temp=num%10; res+=Integer.toString(temp); num=num/10; rows+=Integer.toString(num); V. JAVA IMPLEMENTATION OF K-NET WITH N=2 AND FOR CHARACTERS "ABC" Here is the code for sample K-Net with N=2 and for characters "abc" using JCreator. } CODE : return res; import java.util.*; } public void buildnet(String str) public class knet { { if(str.equals("aa")) HashMap net=new HashMap(); { String rows=""; net.put("aa",11); String res=""; } else if(str.equals("ab")) public static void main(String[] args) { { knet obj=new knet(); net.put("ab",12); String str="cbabcbacaaaaaaaaabbbbbbcccccaabb"; } String kstring=obj.toKnet(str); else if(str.equals("ac")) String original=""; { net.put("ac",13); System.out.println(kstring); } original=obj.fromKnet(kstring); System.out.println(original); else if(str.equals("ba")) { } net.put("ba",21); public String toKnet(String str) } { String word=""; int num=0; else if(str.equals("bb")) int temp=0; { net.put("bb",22); int len=str.length(); for(int i=0;i<len;i+=2) } original+="a"; else if(str.equals("bc")) } { else if(temp.equals("2")) net.put("bc",23); {original+="b"; } } else if(str.equals("ca")) else if(temp.equals("3")) { {original+="c"; net.put("ca",31); } } } else if(str.equals("cb")) { return original; net.put("cb",32); } } } else if(str.equals("cc")) { Input: cbabcbacaaaaaaaaabbbbbbcccccaabb K-Net output: 2223111122233312 net.put("cc",33); } Space saved: lenght (output)/length (input)*100= 16/32*100 = 50%. } public String fromKnet(String str) VI. RESULTS { String temp=""; String original=""; int len=str.length(); for(int i=0;i<len;i++) { As we can see we get a constant compression using K-Net which is dependent on the value of N, rather than the lenght of the string. For: N=2, compression = (1/2)* length of string N=3, compression = (2/3)* length of string N=k, compression = ((k-1)/k)* length of string temp=""+rows.charAt(i); if(temp.equals("1")) { original+="a"; } else if(temp.equals("2")) {original+="b"; } else if(temp.equals("3")) {original+="c"; } temp=""+str.charAt(i); if(temp.equals("1")) { Fig 8: Line Graph of Compression Percentage as N changes. As we can see in figure 8. As the value of N increases the compression percentage also increases, although it increases slowly for higher value of N. So, we have to choose the value of N which is optimal for our program. VII. CONCLUSION K-Nets provide a strong implementation which can give constant compression for strings. This technique exploits the fact that data strings can be stored in map data structure and can be represented as alternative symbols. We can get the original data back by reversing the process by which we compressed the data. Data compression using K-Nets depends on the vale of N which is the dimension of the structure in which the data is stored.As the vale of N increases the compression percentage becomes high. K-Net can be implemented to store more data with compression where storage capacity is a restricted. VIII. FUTURE SCOPE K-Nets can be further refined programmatically for higher compression rates and better performance. We can implement K-Net for higher values of N programmatically. We can also refine building of KNets to offer more compression. K-Nets can also be applied to data warehouses and other data repositories where it can achieve higher rate of compression. This technique can also be used in systems where cryptography and compression are required at the same time as data compressed through K-Nets is in ciphertext form. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] Data Compression - Debra A. Lelewer and Daniel S. Hirschberg An Efficient Algorithm to Test Square-Freeness of Strings Compressed by Balanced Straight Line Programs ,CHICAGO JOURNAL OF THEORETICAL COMPUTER SCIENCE 2010- Wataru Matsubara, Shunsuke Inenaga, Ayumi Shinohara DAN GUSFIELD: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997. M. KARPINSKI, W. RYTTER, AND A. SHINOHARA: An ef?cient pattern-matching algorithm for strings with short descriptions. Nordic Journal of Computing, 4:172–186, 1997. M. KARPINSKI, W. RYTTER, AND A. SHINOHARA: Pattern-matching for strings with short descriptions. In Proc. Combinatorial Pattern Matching, volume 637 of Lecture Notes in Computer Science, pp. 205–214. Springer-Verlag, 1995. ALBERTO APOSTOLICO: Optimal parallel detection of squares in strings. Algorithmica, 8:285–319, 1992. Multiple-dictionary compression using partial matching Hoang, D.T. ; Dept. of Comput. Sci., Duke Univ., Durham, NC, USA ; Long, P.M. ; Vitter, J.S., Data Compression Conference, 1995. DCC '95. Proceedings, 28-30 Mar 1995 Dictionary selection using partial matching- Dzung T. Hoanga, Philip M. Longb,Jeffrey Scott Vitterc. Information Sciences Volume 119, Issues 1–2, 1 October 1999. Adaptive update algorithms for fixed dictionary lossless data compressors- Shamoon, T. ; Sch. of Electr. Eng., Cornell Univ., Ithaca, NY, USA ,Heegard, C. Information Theory, 1994. Proceedings, 27 Jun-1 Jul 1994. [10] Dynamic huffman coding: Donald E Knuth. Elsevier Journal of Algorithms Volume 6, Issue 2, June 1985 [11] An efficient test vector compression scheme using selective Huffman coding : Jas, A. ; Dept. of Electr. & Comput. Eng., Univ. of Texas, Austin, TX, USA ; Ghosh-Dastidar, J. ; Mom-Eng Ng ; Touba, N.A. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on (Volume:22 , Issue: 6 ) June 2003 [12] Variations on a theme by Huffman : Information Theory, IEEE Transactions on (Volume:24 , Issue: 6 ) Nov 1978. [13] A Recursive Formula for the Kolakoski Sequence A000002 : Bertran Steinsky ,Journal of Integer Sequences, Vol. 9 (2006). [14] Bit-plane Golomb coding for sources with Laplacian distributions : Yu, R. ; Labs. for Inf. Technol., A*STAR, Singapore ; Ko, C.C. ; Rahardja, S. Lin, X. Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03). 2003 IEEE International Conference on (Volume:4 ) 610 April 2003.