Network Conscious Text Compression System (NCTCSys)

advertisement
Network Conscious Text Compression System (NCTCSys)
Nitin Motgi , Amar Mukherjee
School of Electrical Engineering and Computer Science,
University of Central Florida,
Orlando, FL 32816
{nmotgi,amar}@cs.ucf.edu
Abstract
This paper proposes a “Network Conscious Text
Compression System” (NCTCSys) to tackle the problem
of transmitting explosively increasing data on the
internet. It can be integrated into different text based
transfer protocols like FTP, HTTP and SMTP at
application level to handle text data transmission. We
apply newly developed LIPT [1] encoding method along
with BZIP2 and GZIP to compress text. NCTCSys
selection algorithm makes the decision of choosing the
best compression method. To share a common
encoding/decoding dictionary between the client and the
server we use Dictionary Management Protocol. In a
typical congestion and fluctuating scenario of bandwidth
using the above method we have reduced transmission
times by 60.24% using LIPT+GZIP, by 85.20% using
LIPT+BZIP2.
Compression System (NCTCSys) for transferring English
documents. It uses sophisticated compression algorithms,
proposed for lossless text compressions namely BZIP2
based on Burrow Wheelers Transform (BWT) [3], GZIP
based on Lempel-Ziv (LZ) family [12], and LIPT(Length
Index Preserving Transformation) [1] for preprocessing
English text. Preprocessing of text achieves higher
compression by encoding a word in input file by word
into a transform file. Intelligence of NCTCSys lies in
sensing a set of network and server parameters to
efficiently use an appropriate method to balance the load
and performance of the server and network for
transmission of text files. This system finds its application
in compressing HTML streams generated by web server,
text messages of emails, news on news servers and large
text manuals (in form of PDF, PostScript) by FTP or FTP
over HTTP.
2. Related works
1. Introduction
During recent years, there has been profound increase
in the amount of global information and information
distribution on Internet resulting in great demand for
efficient text compression on web servers, mail servers,
news servers and proxy implementations. Text data
competes for 45% of the total bandwidth on lossy packet
switched network (Internet). It is estimated that this will
reach a huge figure of around 27K Gbps on the backbone
by 2002 [6]. So if we compress this text stream
intelligently by 70% (avg.), we will reduce the total
bandwidth for text to 14% and reduce the transfer time of
text over Internet resulting in low congestion and
increased transfer speed over high-latency networks. With
current trend moving towards multilingual support on
Internet, compression or transformation of data prior to
transmission is becoming common. All methods to
support multilingual transmission use dictionary to
compress data sent over Internet. In this paper, we
propose to develop a Network Conscious Text
There have been efforts earlier to apply different
compression techniques to compress text on servers; but
only a few of the compression techniques like deflate[8]
and LZ algorithm [7][12] have been successful. The latest
release of IIS 5.0 (Internet Information servers)[13] on
Windows 2000 by Microsoft has integrated the above
methods. It has the capacity to compress the HTML
stream statically and on demand. These methods have
reported an increase in performance of the server by
400% and have reduced network traffic significantly.
Other services like Email have embedded this
compression technique [14] into the POP3 and SMTP
servers. News and FTP servers use none of the above
method during transmission. These algorithms were
successful because of the fact that, they are fast and
achieve a substantial compression ratio without using
significant amount of resources (memory, I/O and time)
on the server. The compression achieved by these
algorithms is not high compared to the best available in
the literature, they are faster and they utilize fewer
resources on the server. But, all these systems lack the
capability to know when to compress the data and when
not to and what methods should be used to compress the
data.
3. Design goals

NCTCSys is designed in a way such that it can
be integrated into any text based transmission
system to enhance the transmission rate,
bandwidth utilization and load sensing
capabilities.

Develop simple and easily manageable
NCTCSys compliant client modules that can be
integrated into client applications.

Modular approach to over all development of the
system to support enhancements to the system.

Provide scalability and platform to integrate new
generation compression algorithms into the
system without any major modifications to the
system and

Provide common language transmission even for
multilingual compression.
The system currently uses BZIP2 or GZIP as front end or
backend compression engines, thereby decreasing the
bandwidth usage and transmission time. The syatem is
easily expandable to allow any other compression
algorithm in the future. If LIPT is used as a pre-processor
(for details, see reference [1] in this Proceedings), the
encoder uses a pair of static dictionaries. The
decompression on client side needs same pair of
dictionaries as that used by the server. So, in order to
manage the dictionaries between client and server we
propose a Dictionary Management Protocol (DMP) that
handles addition/deletion of new words to dictionary
between the clients when new version of dictionary is
used to preprocess the text on the server. The system also
incorporates a caching mechanism to store the most
frequently requested file(s) in compressed format. Our
approach to test the NCTCSys consisted of two phases. In
phase one, we have developed a prototype of NCTCSys to
observe the relationship between lossless text
compression algorithms and integration of these into high
level protocols like HTTP, FTP, SMTP, POP3 and NNTP
for transmission over Internet. In this phase, we have
attempted to improve some of the processing phases like
encoding and decoding process of LIPT compression
technique. Our research demonstrates that with a
combination of innovative change in higher-level
protocols and only a small penalty in compressing the
text, we can achieve a better transfer time and reduced
amount of data transmitted over the network. This paper
includes architecture of NCTCSys, interfaces and
methods used, and preliminary implementation results of
transmitting text data over Internet using the above
system. Canterbury, Calgary [15] and Gutenburg [16]
corpus is used as benchmark to test the performance of
the system for text file transmission. Integration of
NCTCSys system into FTP service show average
decrease in bandwidth by 73% (preliminary test results)
and time required for transmitting text data is reduced on
average by 60-63% over the corpus.
We make decision on choosing one of GZIP or BZIP2
algorithms to transmit data depending on parameters like
current file size, line speed, bandwidth available and
congestion. The above decision also includes using
reversible transformation LIPT on the text data.
Compression can be applied at different points depending
on the type of service into which NCTCSys will be
integrated. For e.g. when integrated into HTTP service,
the compression point is on the server for HTML data
(which is text), on SMTP server for mail, News Server for
News etc.
As recent trend moves towards globilization and
multilingual support on the Internet, compression
capabilities for multiple languages will become an
essential component of Internet. Extension of this system
for any language is simple, as we use language
dictionaries to transform the language into a set of
sequence that are easily compressible.
4. NCTCSys architecture description
NCTCSys consists of five main independent modules
namely compression module, decompression module,
cache management module, dictionary management
module, network module and NCTCSys controller. The
proposed architecture of server requires NCTCSys
enabled client modules to extract the information. The
same architecture is proposed for NCTCSys clients with
minor modifications in handling of local cache.Figure 1
illustrates the structure of NCTCSys.
The compression module compresses text files based
on inputs from network module and cache management
module. Upon performing compression on the requested
file, it stores the compressed file into cache to facilitate
fast response when the file is requested next time. The
Compression
Module
Cache
e
Decompression
Module
Cache
Management
Module
Transformed and
compressed files
e
Disk
interface
Dictionary
Management
Module
Network
Module
Disk
Storage
File(s), and
Dictionary
NCTCSys Controller
Figure 1. Network Conscious Text Comporession System
network module will use NCTCSys Selection Algorithm to
find the best method suited for transmission at that point
of time. File is then transmitted with header information
to facilitate recovery of the file.
Decompression module is used to recover the file sent
by another NCTCSys complaints. The parameters for the
type of compression and reversible transformation used
during compression at source are sent in header. The
decompression module based on the header information
invokes the respective module operation and
decompresses the text stream transmitted.
Cache management module manages temporary
compressed files on the server. The files compressed by
compression module are placed in the cache for
processing similar request. Nevertheless, this will work
only in case where the files on server are static. This
module provides solutions to different questions like
which file should be cached and which file should be
removed from NCTCSys.
Dictionary management module in conjunction with
Dictionary Management protocol (DMP) manages sharing
and synchronization of dictionaries used for reversible
transformation between the NCTCSys complaint. It
supports incremental/full updates of dictionary on
NCTCSys clients.
5. Selection of algorithm based on network
parameters
The NCTCSysSelection algorithm is used to select the
best compression algorithm based on the parameters for
choosing the best compression algorithm suited for
transmission. This method is invoked only when
transmission of file is initiated. The decision is based on
various factors that are characteristics of network and
machine on which NCTCSys is running. Network Module
senses available bandwidth on the network (A-BW),
current client line speed (C-LS), Current clients (CC)
connected, Server Load (SL) (based in terms of memory
and processing power). Based on the selection algorithm
NCTCSys creates appropriate header information to be
transmitted along with the compressed file. The header
information will include type of compression used,
dictionary used if any, dictionary version, and language of
dictionary. This information will facilitate efficient
retrieval of message from the compressed stream.
The algorithm for NCTCSysSelection algorithm is as
follows:
Procedure NCTCSysSelection (
Var A-BW : availableBandWidth, var SL : serverLoad,
clientLineSpeed, noOfClients, fileSize : integer);
{The Threshold limits for each parameter is set by the
user e.g. Max file size allowed to compress, Max server
load for which this algorithm is valid.}
begin
if noOfClients > MAX_CLIENTS then
{ Has exceeded node capacity, make
a silent exit}
return NoCompression;
endif;
if fileSize > MAX_FILE_SIZE then
{Applying compression is expensive}
return NoCompression;
endif;
if serverLoad > T_serverLoad then
{Load exceeds Threshold Load on server}
{ don’t preprocess for max compression
use GZIP to compress and transfer}
return GZIP;
else
begin
if clientLineSpeed < T_clientLineSpeed then
{ Client is connected on slow line, so use
max compression.}
return LIPT+BZIP2;
else
begin
if availableBandWidth < T_aBandWidth
return BZIP2;
else
begin
return LIPT+GZIP;
end;
endif;
end;
endif;
end;
The selection LIPT+BZIP2 is made when the clients
are connected on low speed lines and we have enough
resources to initiate this process. LIPT+GZIP is
considered when we have less bandwidth for
transmission. With this selection algorithm we achieve
significant decrease in transmission time and increased
performance of the service.
6. Dictionary Management Protocol (DMP)
The dictionary management protocol (DMP) manages
language dictionaries shared between NCTCSys server
and client.
This protocol is initiated only if the
compression algorithm uses LIPT as a pre-processing step
and is not applicable if a stand alone compression
algorithm such as GZIP or BZIP2 is used. Once, the file
requested is compressed using appropriate method the
information used to compress, is sent to client in the
header. Header contains information about the
characteristics of the dictionary used to compress the file.
It includes dictionary version in the form (XX.YY) where
XX terms the major version and YY minor version. When
new block of word is added to the dictionary XX in
updated and when words are added to existing blocks YY
is updated. This kind of versioning system will facilitate
common dictionary sharing between client and server.
Language of the dictionary is also mentioned in the
header as (LL). If LL = 01, then English dictionary is
used. This field can be used to handle dictionaries of
different languages.
Upon receiving the header, the NCTCSys client
modules makes a version check and if it finds discrepancy
between the local dictionary and one used for
compressing the file, DMP is then used to update the
dictionary on client. The sequences of messages are
exchanged between the server and client during the course
of update. The basic service primitives for DMP are listed
as follows:
NS-OPEN
Establish association with server
NS-CLOSE
Release association with server.
NS-S-ABORT Server-initiated abort.
NS-C-ABORT Client-initiated abort.
NS-C-VERSION Dictionary version of client.
NS-S-VERSION Dictionary version of server.
NS-UTYPE
Update type of dictionary (information
used by client).
NS-TX-START Start of word block for update.
NS-TX-STOP Stop of word block for update.
NS-TX-WL
Length of words that will be transferred
on NS-TX-START.
NS-DATA
Words are sent with this primitive.
NS-ACK
Acknowledges for NS-DATA and NSS-VERSION.
The first four primitives are basic for any client/servermodeled application. They are used to establish and
release connection with the server. NS-C-ABORT is
release message sent to server requesting to abort
connection.
Similarly NS-S-ABORT message from
server to client to abort the current transfer. If this
message is received after NS-TX-START is received and
before NS-TX-STOP, then all the previously received
words will be discarded and dictionary on client is not
updated just to maintain the integrity and ease of
repeating the sequence again. NS-TX-WL is used to
negotiate the length of word that will be transferred in one
NS-TX-START {no. of words that will be transferred in
this block}/STOP. NS-C-VERSION is used to specify the
client dictionary version to the server and NS-SVERSION for specifying server dictionary version to the
client. NS-UTYPE will provide type of update server will
make on dictionary present on client. NS-DATA service
set of words that have to be transferred as parameters. For
each NS-DATA server receives NS-ACK from client
upon reception.
The interaction of client and server when DMP in
action is as shown below.
Sl.
No
NCTCSys Client
1.
3.
Upon receiving header
(extract version)
If version discrepancy
then initiate update.
Send NS-OPEN
4.
Send NS-C-VERSION
5.
Set Client in Update
mode for dictionary
Expect Word length and
position pointer in
dictionary
2.
6.
7.
8.
Accept all the NSDATA word list.
Send NS-ACK
9.
10.
Update dictionary
version on client and
Send NS-ACK
NCTCSys Server
Accept client
connection
Compute Difference
of version and find
update type.
Send NS-UTYPE
Send NS-TX-WL
Send NS-TXSTART no of words
Send NS-DATA {
Word list}
Receive ACK and
send NS-DATA if
any available.
If end then Send
NS-TX-STOP
Send server
dictionary version
so that it is updated
on client side. Send
NS-S-VERSION.
Terminate the
connection sending
NS-CLOSE
command to client.
After, connection has been terminated NCTCSys client
and server has common dictionary. Hence, the new
dictionary is loaded into the memory and decoding and
decompression is performed. During dictionary version
check, if client dictionary version is latest than the server
no update is made on the server.
7. Relation between BPC and transmission
times
Consider BPC of compressed file as BPCcomp and of
uncompressed file as BPCuncomp. Ratio Rbpc of BPCcomp
and BPCuncomp gives improvement in BPC after
compression of text data using compression algorithms.
Therefore, Rbpc = BPCuncomp/BPCcomp .Tuncomp and Tcomp are
times taken to transfer uncompressed and compressed text
data. Their ratio Rtransmission gives improvement in
transmission of compressed data over uncompressed data.
Therefore, Rtransmission = Tuncomp / Tcomp. When compression
algorithm used achieves less compression of data and
considering zero losses over network for transmitting
compressed and uncompressed data we have, Rbpc =
Rtransmission.
Our experimental results of transmission over network
involving Internet shows that Rbpc < Rtransmission. The
explanation for this inequality is as follows; In packet
switched networks, the file we transmitted is subjected to
fragmentation based on the transmission capabilities of
underlying network at the source node. Large files are
usually broken down into smaller files of size Nd before
transmission called 'packets'. Each packet is then
independently routed to the destination node. There are
different end-to-end delays associated with the
transmission of each packet over packet switched
networks. Different delays in the network like
Transmission Delay (Td), Propagation Delay (Pd), Packet
Processing Delay (PPd) and Queuing Delay (Qd) affect
transmission of each individual packet. Therefore, these
delays have the following relation:
Packet Transmission Time (Pt) = Td + Pd + PPd + Qd

Packet Transmission Time Pt: Is the total time
required for a packet send from source computer
(node) to destination node.

Transmission Delay Td: Is the time required for a
packet to be sent physically. It is the time from the
first bit in a packet to be sent until the last bit in the
packet to be sent.

Propagation Delay Pd: Is the time during which the
packet stays in a physical link layer.

Packet Processing Delay PPd: Is the time required
by the router to forward a packet. The router picks up
a packet in the queue and sends it on a
communication link attached to it.

Queuing Delay Qd: Is the waiting time in the queue
of the routers.
When there is no congestion in the network the Queuing
Delay Qd = 0. Delay times on Internet over
communication links follow exponential distribution.
Exponential delays can be calculated by simple queuing
model. Discussion of these models is out of scope of this
paper. Now, consider transmitting a file of size F (F >
Nd). The file is then broken down into 'np' number of
packets by the underlying network, where np = ceil (F/
Nd). So now, the task of transmitting file of size F has
been broken down into transmitting np number of packets,
which represent the file. Therefore, the total time to
transfer the file so size F is given as,
TTf = Sum of Time to transmit all the np packets
generated for the file of size F.
Therefore,
TTf = SUM (Td + Pd + PPd + Qd) for all np packets.
We also know that when BPC of text file is more, it
means the compression achieved is less. When BPC = 8,
file is uncompressed and when BPC = 1 the file is
compressed to it's maximum. Effect of BPC is directly
proportional to the transmission, i.e. more the BPC we
need more time to transmit the file and vice-versa.
Let us now consider transmitting uncompressed file so
from equation (4) we have
E.g. We have BPCuncomp = 8.0 and BPCBZIPcomp = 2.28
(Avg. compression using BZIP2 algorithm) and
corresponding transmission times as T uncomp = 171.4415
sec and
TBZIPcomp = 33.517 sec. Rbpc = 8.0/2.28 = 3.51 < Rtransmission
= 171.4415/33.517 = 5.115
8. Experimental results
Table 1 shows some experimental results obtained
during the preliminary test on the system. The test is
performed on Canterbury, Calgary and Gutenburg corpus.
TCP/IP network consisted of nodes, one located in India
and the other in Florida Solar Energy Center, Cocoa
Beach, Florida. The NCTCSys server was located at the
University of Central Florida. The server runs 360 MHz
Ultra Sparc-IIi sun Microsystem machine housing SunOS
5.7 Generic_106541-04. Each transmission time is
average of two transmissions at interval of 1 sec after first
transmission completion. We summaries the transmission
times of the whole corpus in the following table. Without
compression on corpus, the transmission time for the
corpus is 171.4415 seconds.
BPCuncomp = SUM (Td + Pd + PPd + Qd) for all npuncomp
packets
Similarly for compressed file we have
BPCcomp
packets.
= SUM (Td + Pd + PPd + Qd) for all ncomp
It is evident from the above two equations that number of
packets required transmitting uncompressed text data
npuncomp "is greater than" that number of packets ncomp
required transmitting compressed text data. More is the
BPC of the file, more is the number of packets generated
for the transmission of the file. Hence, more is associated
delays with transmission of files with more BPC.
Therefore, from the above statements we can conclude
that:
BPCucomp/BPCcomp < {SUM (Td + Pd + PPd + Qd) for all
npuncomp packets / SUM (Td + Pd + PPd + Qd) for all ncomp
packets }
This is also experimentally observed in our experiments.
The results show that the methods used to compress
will definitely have an impact on transfer rate of text on
Internet. In contrary to the improvement in transmission
time and reduction in bandwidth, NCTCSys introduces
some computing power on the server. It has an overhead
of 200K transmission of dictionary when client side does
not have a dictionary used for decompression. The
dictionary is compressed and transmitted. If there are any
transform dictionaries needed they will be generated by
client NCTCSys. We are currently working on improving
the underlying methods to reduce the load introduced on
the server by NCTCSys and to improve the decision
algorithm used to select the compression algorithm based
on network parameters.
Acknowledgment
The work is supported by National Science Foundation
(NSF), Grant Number IIS-9977336.
Table 1. Transmission times comparison-using Canterbury, Calgary and Gutenburg Corpus.
Transformation
(LIPT)
used/unused
Files transmitted using
GZIP (LZ77)
Compression
(in sec)
70.164
BZIP (BWT) based
Compression
(in sec)
33.517
Without LIPT
Transformation
With
LIPT 68.164
25.365
Transformation
Percentage (%) improvement with LIPT over without
LIPT
%
improvement
in
transmission times over
no-compression
transmission
GZIP
BZIP2
%
Improvement
of BZIP2 over
GZIP
transmission.
59.07%
80.45%
52.23%
60.24%
85.20%
62.79%
1.97%
5.91%
References
[1] F. Awan, A. Mukherjee, LIPT: A Lossless Text Transform
to Improve Compression, Accepted by International Conference
on Information Technology: Coding and Computing, Las Vegas,
April 2-4, 2001.
[11] [Welc84] T. Welch, “A Technique for High Performance
Data Compression”, IEEE Computer Vol. 17, N. 6, 1984.
[12] J. Ziv and A. Lempel. “ A Universal Algorithm for
Sequential Data Compression”, IEEE Trans. Information
Theory, IT-23(3), pp.337-343, May 1977.
[2] T.C. Bell and A. Moffat, “A Note on the DMC Data
Compression Scheme”, Computer Journal, Vol. 32, No. 1, 1989,
pp.16-20.
[13] Using HTTP Compression on your IIS 5.0 Web site,
http://www.microsoft.com/TechNet/iis/httpcomp.asp
[3] M. Burrows and D. J. Wheeler. “A Block-sorting Lossless
Data Compression Algorithm”, SRC
Research Report 124,
Digital Systems Research Center.
[14] A. Nand, T. Yu, Mail Servers with Embedded Data
Compression Mechanisms, Data Compression Conference 1998,
pp566.
[4] D. Clark, D. Tennenhouse, “Architecture
consideration for a new generation of protocols”, Proceeding of
ACM SIGCOM’ 90 Sept 90, pp.201-208.
[15] Canterbury corpus: http://corpus.canterbury.ac.nz
[5] H. Kruse and A. Mukherjee. “Preprocessing Text to
Improve Compression Ratios”, Proc. Data Compression
Conference, 1998, p. 556.
[6] M. Lottor , Network Wizards,
http://www.isc.org/ISC/news/pr2-10-2000.html.
[7] [RFC 1952]. GZIP file format specification version 4.3
[8] [RFC 1951]. Deflate Compressed Data Format Specification
version 1.3
[9] [Sewa] J. Seward, “On the performance of BWT sorting
Algorithm” Proceeding of Data Compression Conference, pp
173-182, 2000.
[10] [Slot] L. Slothouber “A Model of Web Server
Performance”, StarNine technologies, Inc.
[16] Gutenberg Corpus: http://www.promo.net/pg/
Download