Network Conscious Text Compression System (NCTCSys) Nitin Motgi , Amar Mukherjee School of Electrical Engineering and Computer Science, University of Central Florida, Orlando, FL 32816 {nmotgi,amar}@cs.ucf.edu Abstract This paper proposes a “Network Conscious Text Compression System” (NCTCSys) to tackle the problem of transmitting explosively increasing data on the internet. It can be integrated into different text based transfer protocols like FTP, HTTP and SMTP at application level to handle text data transmission. We apply newly developed LIPT [1] encoding method along with BZIP2 and GZIP to compress text. NCTCSys selection algorithm makes the decision of choosing the best compression method. To share a common encoding/decoding dictionary between the client and the server we use Dictionary Management Protocol. In a typical congestion and fluctuating scenario of bandwidth using the above method we have reduced transmission times by 60.24% using LIPT+GZIP, by 85.20% using LIPT+BZIP2. Compression System (NCTCSys) for transferring English documents. It uses sophisticated compression algorithms, proposed for lossless text compressions namely BZIP2 based on Burrow Wheelers Transform (BWT) [3], GZIP based on Lempel-Ziv (LZ) family [12], and LIPT(Length Index Preserving Transformation) [1] for preprocessing English text. Preprocessing of text achieves higher compression by encoding a word in input file by word into a transform file. Intelligence of NCTCSys lies in sensing a set of network and server parameters to efficiently use an appropriate method to balance the load and performance of the server and network for transmission of text files. This system finds its application in compressing HTML streams generated by web server, text messages of emails, news on news servers and large text manuals (in form of PDF, PostScript) by FTP or FTP over HTTP. 2. Related works 1. Introduction During recent years, there has been profound increase in the amount of global information and information distribution on Internet resulting in great demand for efficient text compression on web servers, mail servers, news servers and proxy implementations. Text data competes for 45% of the total bandwidth on lossy packet switched network (Internet). It is estimated that this will reach a huge figure of around 27K Gbps on the backbone by 2002 [6]. So if we compress this text stream intelligently by 70% (avg.), we will reduce the total bandwidth for text to 14% and reduce the transfer time of text over Internet resulting in low congestion and increased transfer speed over high-latency networks. With current trend moving towards multilingual support on Internet, compression or transformation of data prior to transmission is becoming common. All methods to support multilingual transmission use dictionary to compress data sent over Internet. In this paper, we propose to develop a Network Conscious Text There have been efforts earlier to apply different compression techniques to compress text on servers; but only a few of the compression techniques like deflate[8] and LZ algorithm [7][12] have been successful. The latest release of IIS 5.0 (Internet Information servers)[13] on Windows 2000 by Microsoft has integrated the above methods. It has the capacity to compress the HTML stream statically and on demand. These methods have reported an increase in performance of the server by 400% and have reduced network traffic significantly. Other services like Email have embedded this compression technique [14] into the POP3 and SMTP servers. News and FTP servers use none of the above method during transmission. These algorithms were successful because of the fact that, they are fast and achieve a substantial compression ratio without using significant amount of resources (memory, I/O and time) on the server. The compression achieved by these algorithms is not high compared to the best available in the literature, they are faster and they utilize fewer resources on the server. But, all these systems lack the capability to know when to compress the data and when not to and what methods should be used to compress the data. 3. Design goals NCTCSys is designed in a way such that it can be integrated into any text based transmission system to enhance the transmission rate, bandwidth utilization and load sensing capabilities. Develop simple and easily manageable NCTCSys compliant client modules that can be integrated into client applications. Modular approach to over all development of the system to support enhancements to the system. Provide scalability and platform to integrate new generation compression algorithms into the system without any major modifications to the system and Provide common language transmission even for multilingual compression. The system currently uses BZIP2 or GZIP as front end or backend compression engines, thereby decreasing the bandwidth usage and transmission time. The syatem is easily expandable to allow any other compression algorithm in the future. If LIPT is used as a pre-processor (for details, see reference [1] in this Proceedings), the encoder uses a pair of static dictionaries. The decompression on client side needs same pair of dictionaries as that used by the server. So, in order to manage the dictionaries between client and server we propose a Dictionary Management Protocol (DMP) that handles addition/deletion of new words to dictionary between the clients when new version of dictionary is used to preprocess the text on the server. The system also incorporates a caching mechanism to store the most frequently requested file(s) in compressed format. Our approach to test the NCTCSys consisted of two phases. In phase one, we have developed a prototype of NCTCSys to observe the relationship between lossless text compression algorithms and integration of these into high level protocols like HTTP, FTP, SMTP, POP3 and NNTP for transmission over Internet. In this phase, we have attempted to improve some of the processing phases like encoding and decoding process of LIPT compression technique. Our research demonstrates that with a combination of innovative change in higher-level protocols and only a small penalty in compressing the text, we can achieve a better transfer time and reduced amount of data transmitted over the network. This paper includes architecture of NCTCSys, interfaces and methods used, and preliminary implementation results of transmitting text data over Internet using the above system. Canterbury, Calgary [15] and Gutenburg [16] corpus is used as benchmark to test the performance of the system for text file transmission. Integration of NCTCSys system into FTP service show average decrease in bandwidth by 73% (preliminary test results) and time required for transmitting text data is reduced on average by 60-63% over the corpus. We make decision on choosing one of GZIP or BZIP2 algorithms to transmit data depending on parameters like current file size, line speed, bandwidth available and congestion. The above decision also includes using reversible transformation LIPT on the text data. Compression can be applied at different points depending on the type of service into which NCTCSys will be integrated. For e.g. when integrated into HTTP service, the compression point is on the server for HTML data (which is text), on SMTP server for mail, News Server for News etc. As recent trend moves towards globilization and multilingual support on the Internet, compression capabilities for multiple languages will become an essential component of Internet. Extension of this system for any language is simple, as we use language dictionaries to transform the language into a set of sequence that are easily compressible. 4. NCTCSys architecture description NCTCSys consists of five main independent modules namely compression module, decompression module, cache management module, dictionary management module, network module and NCTCSys controller. The proposed architecture of server requires NCTCSys enabled client modules to extract the information. The same architecture is proposed for NCTCSys clients with minor modifications in handling of local cache.Figure 1 illustrates the structure of NCTCSys. The compression module compresses text files based on inputs from network module and cache management module. Upon performing compression on the requested file, it stores the compressed file into cache to facilitate fast response when the file is requested next time. The Compression Module Cache e Decompression Module Cache Management Module Transformed and compressed files e Disk interface Dictionary Management Module Network Module Disk Storage File(s), and Dictionary NCTCSys Controller Figure 1. Network Conscious Text Comporession System network module will use NCTCSys Selection Algorithm to find the best method suited for transmission at that point of time. File is then transmitted with header information to facilitate recovery of the file. Decompression module is used to recover the file sent by another NCTCSys complaints. The parameters for the type of compression and reversible transformation used during compression at source are sent in header. The decompression module based on the header information invokes the respective module operation and decompresses the text stream transmitted. Cache management module manages temporary compressed files on the server. The files compressed by compression module are placed in the cache for processing similar request. Nevertheless, this will work only in case where the files on server are static. This module provides solutions to different questions like which file should be cached and which file should be removed from NCTCSys. Dictionary management module in conjunction with Dictionary Management protocol (DMP) manages sharing and synchronization of dictionaries used for reversible transformation between the NCTCSys complaint. It supports incremental/full updates of dictionary on NCTCSys clients. 5. Selection of algorithm based on network parameters The NCTCSysSelection algorithm is used to select the best compression algorithm based on the parameters for choosing the best compression algorithm suited for transmission. This method is invoked only when transmission of file is initiated. The decision is based on various factors that are characteristics of network and machine on which NCTCSys is running. Network Module senses available bandwidth on the network (A-BW), current client line speed (C-LS), Current clients (CC) connected, Server Load (SL) (based in terms of memory and processing power). Based on the selection algorithm NCTCSys creates appropriate header information to be transmitted along with the compressed file. The header information will include type of compression used, dictionary used if any, dictionary version, and language of dictionary. This information will facilitate efficient retrieval of message from the compressed stream. The algorithm for NCTCSysSelection algorithm is as follows: Procedure NCTCSysSelection ( Var A-BW : availableBandWidth, var SL : serverLoad, clientLineSpeed, noOfClients, fileSize : integer); {The Threshold limits for each parameter is set by the user e.g. Max file size allowed to compress, Max server load for which this algorithm is valid.} begin if noOfClients > MAX_CLIENTS then { Has exceeded node capacity, make a silent exit} return NoCompression; endif; if fileSize > MAX_FILE_SIZE then {Applying compression is expensive} return NoCompression; endif; if serverLoad > T_serverLoad then {Load exceeds Threshold Load on server} { don’t preprocess for max compression use GZIP to compress and transfer} return GZIP; else begin if clientLineSpeed < T_clientLineSpeed then { Client is connected on slow line, so use max compression.} return LIPT+BZIP2; else begin if availableBandWidth < T_aBandWidth return BZIP2; else begin return LIPT+GZIP; end; endif; end; endif; end; The selection LIPT+BZIP2 is made when the clients are connected on low speed lines and we have enough resources to initiate this process. LIPT+GZIP is considered when we have less bandwidth for transmission. With this selection algorithm we achieve significant decrease in transmission time and increased performance of the service. 6. Dictionary Management Protocol (DMP) The dictionary management protocol (DMP) manages language dictionaries shared between NCTCSys server and client. This protocol is initiated only if the compression algorithm uses LIPT as a pre-processing step and is not applicable if a stand alone compression algorithm such as GZIP or BZIP2 is used. Once, the file requested is compressed using appropriate method the information used to compress, is sent to client in the header. Header contains information about the characteristics of the dictionary used to compress the file. It includes dictionary version in the form (XX.YY) where XX terms the major version and YY minor version. When new block of word is added to the dictionary XX in updated and when words are added to existing blocks YY is updated. This kind of versioning system will facilitate common dictionary sharing between client and server. Language of the dictionary is also mentioned in the header as (LL). If LL = 01, then English dictionary is used. This field can be used to handle dictionaries of different languages. Upon receiving the header, the NCTCSys client modules makes a version check and if it finds discrepancy between the local dictionary and one used for compressing the file, DMP is then used to update the dictionary on client. The sequences of messages are exchanged between the server and client during the course of update. The basic service primitives for DMP are listed as follows: NS-OPEN Establish association with server NS-CLOSE Release association with server. NS-S-ABORT Server-initiated abort. NS-C-ABORT Client-initiated abort. NS-C-VERSION Dictionary version of client. NS-S-VERSION Dictionary version of server. NS-UTYPE Update type of dictionary (information used by client). NS-TX-START Start of word block for update. NS-TX-STOP Stop of word block for update. NS-TX-WL Length of words that will be transferred on NS-TX-START. NS-DATA Words are sent with this primitive. NS-ACK Acknowledges for NS-DATA and NSS-VERSION. The first four primitives are basic for any client/servermodeled application. They are used to establish and release connection with the server. NS-C-ABORT is release message sent to server requesting to abort connection. Similarly NS-S-ABORT message from server to client to abort the current transfer. If this message is received after NS-TX-START is received and before NS-TX-STOP, then all the previously received words will be discarded and dictionary on client is not updated just to maintain the integrity and ease of repeating the sequence again. NS-TX-WL is used to negotiate the length of word that will be transferred in one NS-TX-START {no. of words that will be transferred in this block}/STOP. NS-C-VERSION is used to specify the client dictionary version to the server and NS-SVERSION for specifying server dictionary version to the client. NS-UTYPE will provide type of update server will make on dictionary present on client. NS-DATA service set of words that have to be transferred as parameters. For each NS-DATA server receives NS-ACK from client upon reception. The interaction of client and server when DMP in action is as shown below. Sl. No NCTCSys Client 1. 3. Upon receiving header (extract version) If version discrepancy then initiate update. Send NS-OPEN 4. Send NS-C-VERSION 5. Set Client in Update mode for dictionary Expect Word length and position pointer in dictionary 2. 6. 7. 8. Accept all the NSDATA word list. Send NS-ACK 9. 10. Update dictionary version on client and Send NS-ACK NCTCSys Server Accept client connection Compute Difference of version and find update type. Send NS-UTYPE Send NS-TX-WL Send NS-TXSTART no of words Send NS-DATA { Word list} Receive ACK and send NS-DATA if any available. If end then Send NS-TX-STOP Send server dictionary version so that it is updated on client side. Send NS-S-VERSION. Terminate the connection sending NS-CLOSE command to client. After, connection has been terminated NCTCSys client and server has common dictionary. Hence, the new dictionary is loaded into the memory and decoding and decompression is performed. During dictionary version check, if client dictionary version is latest than the server no update is made on the server. 7. Relation between BPC and transmission times Consider BPC of compressed file as BPCcomp and of uncompressed file as BPCuncomp. Ratio Rbpc of BPCcomp and BPCuncomp gives improvement in BPC after compression of text data using compression algorithms. Therefore, Rbpc = BPCuncomp/BPCcomp .Tuncomp and Tcomp are times taken to transfer uncompressed and compressed text data. Their ratio Rtransmission gives improvement in transmission of compressed data over uncompressed data. Therefore, Rtransmission = Tuncomp / Tcomp. When compression algorithm used achieves less compression of data and considering zero losses over network for transmitting compressed and uncompressed data we have, Rbpc = Rtransmission. Our experimental results of transmission over network involving Internet shows that Rbpc < Rtransmission. The explanation for this inequality is as follows; In packet switched networks, the file we transmitted is subjected to fragmentation based on the transmission capabilities of underlying network at the source node. Large files are usually broken down into smaller files of size Nd before transmission called 'packets'. Each packet is then independently routed to the destination node. There are different end-to-end delays associated with the transmission of each packet over packet switched networks. Different delays in the network like Transmission Delay (Td), Propagation Delay (Pd), Packet Processing Delay (PPd) and Queuing Delay (Qd) affect transmission of each individual packet. Therefore, these delays have the following relation: Packet Transmission Time (Pt) = Td + Pd + PPd + Qd Packet Transmission Time Pt: Is the total time required for a packet send from source computer (node) to destination node. Transmission Delay Td: Is the time required for a packet to be sent physically. It is the time from the first bit in a packet to be sent until the last bit in the packet to be sent. Propagation Delay Pd: Is the time during which the packet stays in a physical link layer. Packet Processing Delay PPd: Is the time required by the router to forward a packet. The router picks up a packet in the queue and sends it on a communication link attached to it. Queuing Delay Qd: Is the waiting time in the queue of the routers. When there is no congestion in the network the Queuing Delay Qd = 0. Delay times on Internet over communication links follow exponential distribution. Exponential delays can be calculated by simple queuing model. Discussion of these models is out of scope of this paper. Now, consider transmitting a file of size F (F > Nd). The file is then broken down into 'np' number of packets by the underlying network, where np = ceil (F/ Nd). So now, the task of transmitting file of size F has been broken down into transmitting np number of packets, which represent the file. Therefore, the total time to transfer the file so size F is given as, TTf = Sum of Time to transmit all the np packets generated for the file of size F. Therefore, TTf = SUM (Td + Pd + PPd + Qd) for all np packets. We also know that when BPC of text file is more, it means the compression achieved is less. When BPC = 8, file is uncompressed and when BPC = 1 the file is compressed to it's maximum. Effect of BPC is directly proportional to the transmission, i.e. more the BPC we need more time to transmit the file and vice-versa. Let us now consider transmitting uncompressed file so from equation (4) we have E.g. We have BPCuncomp = 8.0 and BPCBZIPcomp = 2.28 (Avg. compression using BZIP2 algorithm) and corresponding transmission times as T uncomp = 171.4415 sec and TBZIPcomp = 33.517 sec. Rbpc = 8.0/2.28 = 3.51 < Rtransmission = 171.4415/33.517 = 5.115 8. Experimental results Table 1 shows some experimental results obtained during the preliminary test on the system. The test is performed on Canterbury, Calgary and Gutenburg corpus. TCP/IP network consisted of nodes, one located in India and the other in Florida Solar Energy Center, Cocoa Beach, Florida. The NCTCSys server was located at the University of Central Florida. The server runs 360 MHz Ultra Sparc-IIi sun Microsystem machine housing SunOS 5.7 Generic_106541-04. Each transmission time is average of two transmissions at interval of 1 sec after first transmission completion. We summaries the transmission times of the whole corpus in the following table. Without compression on corpus, the transmission time for the corpus is 171.4415 seconds. BPCuncomp = SUM (Td + Pd + PPd + Qd) for all npuncomp packets Similarly for compressed file we have BPCcomp packets. = SUM (Td + Pd + PPd + Qd) for all ncomp It is evident from the above two equations that number of packets required transmitting uncompressed text data npuncomp "is greater than" that number of packets ncomp required transmitting compressed text data. More is the BPC of the file, more is the number of packets generated for the transmission of the file. Hence, more is associated delays with transmission of files with more BPC. Therefore, from the above statements we can conclude that: BPCucomp/BPCcomp < {SUM (Td + Pd + PPd + Qd) for all npuncomp packets / SUM (Td + Pd + PPd + Qd) for all ncomp packets } This is also experimentally observed in our experiments. The results show that the methods used to compress will definitely have an impact on transfer rate of text on Internet. In contrary to the improvement in transmission time and reduction in bandwidth, NCTCSys introduces some computing power on the server. It has an overhead of 200K transmission of dictionary when client side does not have a dictionary used for decompression. The dictionary is compressed and transmitted. If there are any transform dictionaries needed they will be generated by client NCTCSys. We are currently working on improving the underlying methods to reduce the load introduced on the server by NCTCSys and to improve the decision algorithm used to select the compression algorithm based on network parameters. Acknowledgment The work is supported by National Science Foundation (NSF), Grant Number IIS-9977336. Table 1. Transmission times comparison-using Canterbury, Calgary and Gutenburg Corpus. Transformation (LIPT) used/unused Files transmitted using GZIP (LZ77) Compression (in sec) 70.164 BZIP (BWT) based Compression (in sec) 33.517 Without LIPT Transformation With LIPT 68.164 25.365 Transformation Percentage (%) improvement with LIPT over without LIPT % improvement in transmission times over no-compression transmission GZIP BZIP2 % Improvement of BZIP2 over GZIP transmission. 59.07% 80.45% 52.23% 60.24% 85.20% 62.79% 1.97% 5.91% References [1] F. Awan, A. Mukherjee, LIPT: A Lossless Text Transform to Improve Compression, Accepted by International Conference on Information Technology: Coding and Computing, Las Vegas, April 2-4, 2001. [11] [Welc84] T. Welch, “A Technique for High Performance Data Compression”, IEEE Computer Vol. 17, N. 6, 1984. [12] J. Ziv and A. Lempel. “ A Universal Algorithm for Sequential Data Compression”, IEEE Trans. Information Theory, IT-23(3), pp.337-343, May 1977. [2] T.C. Bell and A. Moffat, “A Note on the DMC Data Compression Scheme”, Computer Journal, Vol. 32, No. 1, 1989, pp.16-20. [13] Using HTTP Compression on your IIS 5.0 Web site, http://www.microsoft.com/TechNet/iis/httpcomp.asp [3] M. Burrows and D. J. Wheeler. “A Block-sorting Lossless Data Compression Algorithm”, SRC Research Report 124, Digital Systems Research Center. [14] A. Nand, T. Yu, Mail Servers with Embedded Data Compression Mechanisms, Data Compression Conference 1998, pp566. [4] D. Clark, D. Tennenhouse, “Architecture consideration for a new generation of protocols”, Proceeding of ACM SIGCOM’ 90 Sept 90, pp.201-208. [15] Canterbury corpus: http://corpus.canterbury.ac.nz [5] H. Kruse and A. Mukherjee. “Preprocessing Text to Improve Compression Ratios”, Proc. Data Compression Conference, 1998, p. 556. [6] M. Lottor , Network Wizards, http://www.isc.org/ISC/news/pr2-10-2000.html. [7] [RFC 1952]. GZIP file format specification version 4.3 [8] [RFC 1951]. Deflate Compressed Data Format Specification version 1.3 [9] [Sewa] J. Seward, “On the performance of BWT sorting Algorithm” Proceeding of Data Compression Conference, pp 173-182, 2000. [10] [Slot] L. Slothouber “A Model of Web Server Performance”, StarNine technologies, Inc. [16] Gutenberg Corpus: http://www.promo.net/pg/