IV. Algorithm for K-net

advertisement
Data String Compression using K-Nets
Kumar Sourav
Software Engineer
Computer Science Corporation
DLF It Park, a-44/45, Sector 62 Noida UP India
sourav.revo@gmail.com
Abstract – Data compression is required to compress
given data in order to save space on storage devices and
remove unnecessary redundancies in given data. Data
mainly consists of strings of characters and digits. Data
compression algorithms are mainly lossless or lossy. In
lossless algorithms no data is lost during operation and
in lossy algorithm data is lost during operation but
compression is more. Data can be compressed by
exploiting factors like : repetition of characters, using
dictionary , prediction etc. This paper is about
compression of Data Strings used in various
programming languages using the technique of K-Nets
which is lossless technique. A K-Net can be an Ndimensional geometrical figure which is generally a
square or a rectangle for N=2 or in two dimensional
plane. In this papaer we will firstly represent a Data
string of characters using a points in the K-Net plane
and then later calculate the space saved by the technique
to represent data string's for long data string's. The
programming language used in this paper for
performance analysis and calculating results is JAVA.
running it on faster resources [8, 9]. LZMW,
LZAP and LZWL are the variants of Lempel Ziv.
LZMW Searches input for the longest string already in
the dictionary adds the concatenation of the previous
match with the current match to the dictionary. LZAP
modification of LZMW: instead of adding just the
concatenation of the previous match with the current
match to the dictionary, add the concatenations of the
previous match with each initial substring of the
current match and LZWL can work with syllables
obtained by all algorithms of decomposition into
syllables. This algorithm can be used for words too.
1. INTRODUCTION
Data compression is an important element of
programming and storage. Data compression is the
process of encoding data so that it takes less storage
space or less transmission time than it would if it were
not compressed. Compression is possible because we
ca exploit the redundancies in the real world data.
Mainly two types of compression techniques are there:
Lossless compression and Lossy compression. In
Lossless compression technique the data is
compressed a lot but some data is lost during the
process as in the case of music, video but in the Lossy
compression technique the data is compressed and no
data is lost as in the case of word, excel and text files.
In this paper i will be mainly focusing on the lossless
data compression techniques as data string
Compression using K-Nets is also lossless technique.
LZ77 and LZ78 and lossless compression
techniques based on dictionary coders. Dictionary
based algorithms can have Static dictionary, Semi
adaptive dictionary or Adaptive dictionary. Lempel
Ziv is adaptive dictionary technique .LZ77 maintains a
sliding window during compression. This was later
shown to be equivalent to the explicit dictionary
constructed by LZ78. It was the algorithm of the
widely used UNIX file compression utility compress,
and is used in the GIF image format. However the
performance of Lempel Ziv is limited by the memory
and other resources [9]. We can also optimize its
performance by using partial matching technique and
Fig 1: Lempel Ziv coding based on the probability of
occurrence of characters.
Huffman coding [10, 11, 12] is an entropy
encoding algorithm used for lossless data
compression. Huffman coding uses a specific method
for choosing the representation for each symbol,
resulting in a prefix code that expresses the most
common source symbols using shorter strings of bits
than are used for less common source symbols.
Fig 2: Huffman Coding based on probability of occurrence of
symbols.
Run Length Coding is a very simple form of data
compression in which runs of data are stored as a
single data value and count, rather than as the original
run. This is most useful on data that contains many
such runs: for example, simple graphic images such as
icons, line drawings, and animations. It is not useful
with files that don't have many runs as it could greatly
increase the file size. Kolakoski sequence, Golomb
coding and Look-and-say sequence are based on Run
Length Encoding.
Kolakoski sequence [13] is an infinite sequence of
symbols {1, 2} which is its own run-length encoding.
In Golomb coding alphabets following a geometric
distribution will have a Golomb code [14] as an
optimal prefix code, making Golomb coding highly
suitable for situations in which the occurrence of small
values in the input stream is significantly more likely
than large values and Look-and-say sequence is the
sequence of integers beginning as follows: 1, 11, 21,
1211, 111221, 312211, 13112221, 1113213211. To
generate a member of the sequence from the previous
member, read off the digits of the previous member,
counting the number of digits in groups of the same
digit.
sequence compression and pattern matching [3, 4, 5]
which use different approaches to arrive to a similar
result of compression. This paper is about compression
of Strings using the technique of K-Nets. A K-Net is
an N dimensional figure whose some or all sides
represent a character sequence. A group of characters
are represented by a point in this K-Net plane. In this
way we can encode and decode the String pattern
using the simple K-Net implementation in a
programming language like Java. This technique is
efficient for strings of large size. Comparison of the
storage space saved for a number of strings of
different lengths using this technique is made in this
paper. K-Net can be multidimensional in nature when
N increases. In this paper K-Nets with N=2 is more
frequently used. None of these techniques guarantee
constant string compression as their compression
depends on the type of data given but String
compression using K-Net guarantee constant
compression irrespective of the data given although it
can be made adaptive by combining it with other
compression techniques. Basic structure of K-Nets in
different dimensions is described in section II, Point
representation of characters in K-Nets is described in
section III, Algorithm to implement K-Nets is
described in section IV, Java implementation of
algorithm is described in section V, Comparison of
compression results is described in section VI.
II. BASIC K-NET STRUCTURE
A K-Net is a geometric structure which can be N
dimensional depending on the requirements. As N
increases the amount of characters stored by the K-Net
also increases and hence the compression also
increases. Programmatically it is easier to implement
K-Nets with less N than K-Nets which has higher
value of N. Now there are two types of K-Nets one is
Half K-Net and the other is Full K-Net. A half K-Net
will only have half of its sides named with characters
and a full K-Net will have all its sides named with
characters. As the domain of characters by which the
string is represented will increase, so the size of K-Net
will also increase.
Fig 3: Run Length Encoding over a stream of characters.
There are a number of compression techniques to
compress data like: Shannon-Fano Coding, Static
Huffman Coding, Lempel-Ziv Codes, Adaptive Codes
and many other techniques [1]. Data Strings are
compressed by removing repetitive characters in the
string and square detection [2, 6] and also on the basis
of frequency of occurrence of the characters in it. In
fact there are many algorithms for string, tree and
Fig 4: A half K-Net with N = 2
Figure4. Shows a half K-Net with N=2 , which
means half the sides of the K-Net are having
characters and it is a two dimensional K-Net. The
range of the K-Net given is (A-K). The range can
depend on the strings that your program is going to
handle.
plane. Like: a point in K-Net plane may represent
"abc" or "af" etc, depending on the K-Net used.
Fig 5: A full K-Net with N = 2
Fig 7: Representation with a half K-Net and N=2.
Figure 5. Shows a full K-Net with N=2. In this KNet all the sides are having characters and represent
set of characters. This K-Net will offer more
compression in some cases as compared to half K-Net
where a string characters contain a pattern or
symmetry.
Figure 7. Shows a half K=Net with N=2. In this KNet every cell is assigned a number which represents a
pair of characters from A to K. So, we see here the
storage capacity required to store two characters is
replaced by a number. In the K-Net, the characters are
considered clockwise. For each pair of characters in
the string a number is marked in the K-Net which
correspond to that pair of characters. Suppose we have
a string which contains "AA" as a part of it then the KNet equivalent of the string is "1".
For K-Net with N=2:
K (n) ="C1 C2"
(1)
Where n is a number and C1 C2 are two characters
that correspond to that number in the K-Net.
In more general form,
K (n1 n2 n3 ..... nk) = "C1 C2 C3 ..... C2k"
Fig 6: A half K-Net with N=3.
Figure 6. Shows a half K-Net with N=3. It is a
three dimensional K-Net whose half of the sides are
labeled with characters. It can store more information
as compared to two dimensional K-Net, although its
implementation is more complex in programming.
Similarly we can have K-Nets with higher value of
N in which are both half and full. In this paper half KNet with N=2 is used to discuss the compression
results.
III. REPRESENTATION OF CHARACTERS IN K-NETS
The basic idea of using K-Nets is that a set of
characters can be represented by a point in the K-Net
(2)
Also K sequence to be printed out , there is a
record kept which contains the row according to each
element. It is used to retrieve the original string from
the K-Net. As the value of N increases the
compression ratio also increases.
IV. ALGORITHM FOR K-NET
The algorithm used for K-Net is described below.
Algorithm prerequisites:
User will enter a general string like "abc1" which will
be represented in the form of K-Net code like "12",
thus saving space. The original sequence can retained
by using the reverse K-Net process.
Algorithm:
Step 1: Take a string: str from the user
{
Step 2: Get the degree of K-Net which is N
word=str.substring(i,i+2);
buildnet(word);
Step 3: Get the largest of characters from str and build
K-Net using HashMap, which contains combinations
of characters of length specified by N.
}
Step 4: Now using the HashMap convert the given N
characters at a time in string str into K-Net string kstr.
for(int i=0;i<len;i+=2)
{
Step 5: Repeat Step 4, till no more characters are left
in string str to convert into kstr.
word=str.substring(i,i+2);
Step 6: Now this kstr is the equivalent string of str in
K-Net. Use kstr to get original string when needed.
num=Integer.parseInt(word);
word=net.get(word).toString();
temp=num%10;
res+=Integer.toString(temp);
num=num/10;
rows+=Integer.toString(num);
V. JAVA IMPLEMENTATION OF K-NET WITH N=2 AND
FOR CHARACTERS "ABC"
Here is the code for sample K-Net with N=2 and
for characters "abc" using JCreator.
}
CODE :
return res;
import java.util.*;
}
public void buildnet(String str)
public class knet {
{
if(str.equals("aa"))
HashMap net=new HashMap();
{
String rows="";
net.put("aa",11);
String res="";
}
else if(str.equals("ab"))
public static void main(String[] args) {
{
knet obj=new knet();
net.put("ab",12);
String
str="cbabcbacaaaaaaaaabbbbbbcccccaabb";
}
String kstring=obj.toKnet(str);
else if(str.equals("ac"))
String original="";
{
net.put("ac",13);
System.out.println(kstring);
}
original=obj.fromKnet(kstring);
System.out.println(original);
else if(str.equals("ba"))
{
}
net.put("ba",21);
public String toKnet(String str)
}
{
String word="";
int num=0;
else if(str.equals("bb"))
int temp=0;
{
net.put("bb",22);
int len=str.length();
for(int i=0;i<len;i+=2)
}
original+="a";
else if(str.equals("bc"))
}
{
else if(temp.equals("2"))
net.put("bc",23);
{original+="b";
}
}
else if(str.equals("ca"))
else if(temp.equals("3"))
{
{original+="c";
net.put("ca",31);
}
}
}
else if(str.equals("cb"))
{
return original;
net.put("cb",32);
}
}
}
else if(str.equals("cc"))
{
Input: cbabcbacaaaaaaaaabbbbbbcccccaabb
K-Net output: 2223111122233312
net.put("cc",33);
}
Space saved: lenght (output)/length (input)*100=
16/32*100 = 50%.
}
public String fromKnet(String str)
VI. RESULTS
{
String temp="";
String original="";
int len=str.length();
for(int i=0;i<len;i++)
{
As we can see we get a constant compression using
K-Net which is dependent on the value of N, rather
than the lenght of the string.
For: N=2, compression = (1/2)* length of string
N=3, compression = (2/3)* length of string
N=k, compression = ((k-1)/k)* length of string
temp=""+rows.charAt(i);
if(temp.equals("1"))
{
original+="a";
}
else if(temp.equals("2"))
{original+="b";
}
else if(temp.equals("3"))
{original+="c";
}
temp=""+str.charAt(i);
if(temp.equals("1"))
{
Fig 8: Line Graph of Compression Percentage as N
changes.
As we can see in figure 8. As the value of N
increases the compression percentage also increases,
although it increases slowly for higher value of N. So,
we have to choose the value of N which is optimal for
our program.
VII. CONCLUSION
K-Nets provide a strong implementation which can
give constant compression for strings. This technique
exploits the fact that data strings can be stored in map
data structure and can be represented as alternative
symbols. We can get the original data back by
reversing the process by which we compressed the
data. Data compression using K-Nets depends on the
vale of N which is the dimension of the structure in
which the data is stored.As the vale of N increases the
compression percentage becomes high. K-Net can be
implemented to store more data with compression
where storage capacity is a restricted.
VIII. FUTURE SCOPE
K-Nets can be further refined programmatically for
higher compression rates and better performance. We
can implement K-Net for higher values of N
programmatically. We can also refine building of KNets to offer more compression. K-Nets can also be
applied to data warehouses and other data repositories
where it can achieve higher rate of compression. This
technique can also be used in systems where
cryptography and compression are required at the
same time as data compressed through K-Nets is in
ciphertext form.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
Data Compression - Debra A. Lelewer and Daniel S.
Hirschberg
An Efficient Algorithm to Test Square-Freeness of Strings
Compressed by Balanced Straight Line Programs ,CHICAGO
JOURNAL OF THEORETICAL COMPUTER SCIENCE
2010- Wataru Matsubara, Shunsuke Inenaga, Ayumi
Shinohara
DAN GUSFIELD: Algorithms on Strings, Trees, and
Sequences. Cambridge University Press, 1997.
M. KARPINSKI, W. RYTTER, AND A. SHINOHARA: An
ef?cient pattern-matching algorithm for strings with short
descriptions. Nordic Journal of Computing, 4:172–186, 1997.
M. KARPINSKI, W. RYTTER, AND A. SHINOHARA:
Pattern-matching for strings with short descriptions. In Proc.
Combinatorial Pattern Matching, volume 637 of Lecture
Notes in Computer Science, pp. 205–214. Springer-Verlag,
1995.
ALBERTO APOSTOLICO: Optimal parallel detection of
squares in strings. Algorithmica, 8:285–319, 1992.
Multiple-dictionary compression using partial matching Hoang, D.T. ; Dept. of Comput. Sci., Duke Univ., Durham,
NC, USA ; Long, P.M. ; Vitter, J.S., Data Compression
Conference, 1995. DCC '95. Proceedings, 28-30 Mar 1995
Dictionary selection using partial matching- Dzung T.
Hoanga, Philip M. Longb,Jeffrey Scott Vitterc. Information
Sciences Volume 119, Issues 1–2, 1 October 1999.
Adaptive update algorithms for fixed dictionary lossless data
compressors- Shamoon, T. ; Sch. of Electr. Eng., Cornell
Univ., Ithaca, NY, USA ,Heegard, C. Information Theory,
1994. Proceedings, 27 Jun-1 Jul 1994.
[10] Dynamic huffman coding: Donald E Knuth. Elsevier Journal
of Algorithms Volume 6, Issue 2, June 1985
[11] An efficient test vector compression scheme using selective
Huffman coding : Jas, A. ; Dept. of Electr. & Comput. Eng.,
Univ. of Texas, Austin, TX, USA ; Ghosh-Dastidar, J. ;
Mom-Eng Ng ; Touba, N.A. Computer-Aided Design of
Integrated Circuits and Systems, IEEE Transactions on
(Volume:22 , Issue: 6 ) June 2003
[12] Variations on a theme by Huffman : Information Theory,
IEEE Transactions on (Volume:24 , Issue: 6 ) Nov 1978.
[13] A Recursive Formula for the Kolakoski Sequence A000002 :
Bertran Steinsky ,Journal of Integer Sequences, Vol. 9 (2006).
[14] Bit-plane Golomb coding for sources with Laplacian
distributions : Yu, R. ; Labs. for Inf. Technol., A*STAR,
Singapore ; Ko, C.C. ; Rahardja, S. Lin, X. Acoustics,
Speech, and Signal Processing, 2003. Proceedings. (ICASSP
'03). 2003 IEEE International Conference on (Volume:4 ) 610 April 2003.
Download