ALGORITHMS TO IMPROVE THE EFFICIENCY OF DATA

advertisement
ALGORITHMS TO IMPROVE THE EFFICIENCY OF DATA COMPRESSION AND CACHING ON WIDE-AREA
NETWORKS.
Amar Mukherjee (Principal Investigator)
School of Electrical Engineering and Computer Science,
VLSI Lab.
Contact Information:
Amar Mukherjee
School of Electrical Engineering and Computer Science,
University of Central Florida,
Orlando, FL-32816.
Phone: (407) 823-2763
Fax: (407) 823-5419
Email: amar@cs.ucf.edu
http://vlsi.cs.ucf.edu/director.html
WWW Page
http://vlsi.cs.ucf.edu/datacomp/nsf/report/nsf2000report.html
List of Supported Students
Nan Zhang, Graduate Research Assistant
Fauzia Salim Awan, Graduate Research Assistant
Nitin Jeevan Motgi, Graduate Research Assistant
Project Award Information




Award number: IIS-9977336
Duration: Period of performance of entire project, 10/01/1999-9/30/2002
Current Reporting Period: 10/01/2000-9/30/2001
Title: Algorithms To Improve the Efficiency of Data Compression and Caching on Wide-Area Networks.
Keywords
Data compression, NCTCSys, dictionary, English dictionary, Huffman, lossless text compression, bandwidth, bzip2, PPMD, LIPT, GZIP,
text transformation.
Project Summary
The objective of this research project is to develop new lossless text compression algorithms and software tools to incorporate compression
in standard text transmissions over Internet. The approach consists of exploiting the natural redundancy of language by encoding text into
an intermediate form that increases the context for compression. The encoding scheme uses dictionaries to co-relate words in text and
transformed words. Comprehensive understanding of the interaction of encoding schemes and compression algorithms like bzip2 and PPM
is being developed. Algorithm performance is measured in terms of compression metrics such as compression ratio, encoding and decoding
times and transmission metrics such as available bandwidth, traffic congestion and server load. Tools like Network Conscious Text
Compression System (NCTCSys) are being developed to embed compression into systems that involve text transmission. Dictionary
management protocol for managing dictionaries used by our compression algorithms is also being developed. The goal of the research is to
impact the future of information technology by developing data delivery systems that make efficient utilization of communication
bandwidth. It is expected that the new lossless text compression algorithms will have 5% to 10% improved compression ratio over the best
known pre-existing compression algorithms which translates into on average reduction of more than 60% of text traffic on internet. The
experimental research is linked to educational goals via rapid dissemination of results via reports, conference and journal papers, doctoral
dissertation and master thesis and transferring the research knowledge into the graduate curriculum. Software tools developed under this
grant will be shared via a website.
Goals and Objectives
The major purpose of this project is to develop lossless text compression algorithms that can be used in data delivery systems for efficient
utilization of communication bandwidth as well as archival storage. The specific objectives are to develop new text compression algorithms
along with basic understanding of the interaction of encoding schemes and compression algorithms, measurement of performance of the
algorithms taking into account both compression and communication metrics and software tools to incorporate compression in text
transmission over the Internet.
Targeted Activities
First Year
During this period, we developed several transformations for pre-processing the text to make it more compressible by existing algorithms.
The transformed text can be compressed better when most of the available compression algorithms are applied. We proposed three
transformations called Star Encoding where a word is replaced by character ‘*’ and at most two characters of the original word, Fixed
Context Length Preserving Transformation (LPT) where strings of * characters are replaced by a fixed sequence of characters in alphabetic
order sharing common suffixes depending on the lengths of the strings (viz. ‘stuvw’ or ‘qrstuvw’ etc), Fixed Context Reverse LPT (RLPT)
which is same as LPT with the sequence of characters reversed and shortened-context LPT (SCLPT) where only the first character of
SCLPT is kept, which uniquely identifies the sequence. All of these transforms improve the compression performance and uniformly beat
almost all of the best of the available compression algorithms over an extensive text corpus. Along with compression ratios, we also made
measurements on performance of these algorithms in terms of encoding and decoding time and storage overhead.
Second Year
We developed a new text transformation called LIPT (Length Index Preserving Transformation). In LIPT, the length of the input word and
the offset of the words in the dictionary are denoted with letters. Our encoding scheme makes use of recurrence of same length of words in
the English Language to create context in the transformed text that the entropy coders can exploit. LIPT achieves some compression at the
preprocessing stage as well and retains enough context and redundancy for the compression algorithms to give better results.
During this period we also developed infrastructure and tools to integrate the new text compression into Web servers, Mail Servers and
News servers. Corresponding clients for specific applications were created as part of tool development. All of this resulted in making
bandwidth utilization more efficient and reducing the time to transfer text by about 60% on average.
We wrote several papers for presentation in conferences and are in the process of submitting for publications in journals. We conducted
group discussions and wrote annual progress reports.
Third Year
The activities of the second year will continue into the third year with additional emphasis on theoretical understanding of our proposed
transforms. We expect to complete two MS and one Ph.D. dissertations.
Indication of success
We have discovered a new modeling scheme LIPT (Length Index Preserving Transform) for pre-processing the input text. This scheme is
more efficient in providing faster and better compression than earlier schemes LPT, RLPT and SCLPT. This scheme uniformly obtains
better result in all text corpuses that we tried (i.e. around 0.28% to 19.63% reduction in filesize using new scheme). The average reduction
in filesize achieved by LIPT over the corpus is 9.47%. LIPT+BZIP2 outperforms original BZIP2 by 5.3%, LIPT+PPMD over PPMD by
4.5% and LIPT+GZIP over GZIP by 6.7%. We also compare our method with Word-based Huffman; our method achieves average BPC of
2.169 over the corpus as compared to 2.506 achieved by using Word-Huffman, an improvement of 13.45%. Transmission time
improvement for the transfer of corpus is 1.97% with LIPT+GZIP2 over original GZIP2, 5.91% with LIPT+BZIP2 over BZIP2.
Transmission using LIPT+BZIP2 is 42.90% faster than simple GZIP used as current standard for compression.
Project Impact
This project will have impact on data delivery systems such as Web servers, Mail servers, and News servers where transferring text data is
primary concern. We expect to develop faster and better compression algorithms for lossless text compression that will have 5% to 10%
improved compression ratio over the best know algorithms with minimal degradation in time performance to achieve the above stated
compression. With this development, a major portion of text data can be compressed and transmitted resulting in efficient utilization of
bandwidth within and outside network boundaries. We will develop a Network Conscious Text Compression System (NCTCSys), which
will be a pluggable module into the existing text transmission systems to improve transmission of text files over Internet. Currently, one
Ph. D. student (Nan Zhang) and two Masters Students (Nitin Motgi and Fauzia S. Awan) are working on the project. Other members of the
M5 Research Group at the School of Electrical Engineering and Computer Science, Dr. Kunal Mukherjee and Mr. Tao Tao made critical
comments and observation during the course of this work. The overall effect of these activities is to train graduate students with the current
research
on
the
forefront
of
technology.
Professor Amar Mukherjee has been invited to give a number of colloquium talks on text compression at several universities in California
namely University of California at Santa Cruise, Riverside, Santa Barbara, San Diego, and Davis. He was also invited to give talks at IBM
Almaden Research Center at San Jose, California, and Oregon State University at Corvallis, Oregon.
We have introduced a graduate level course CAP 5515 entitled “Multimedia Compression on the Internet”
(http://www.cs.ucf.edu/courses/cap5515/) based on the research we have been conducting on data compression. The course material
includes both text and image compression, including content from research supported by current NSF grant.
Project References
Early papers that established this work are as follows:
1.
E. Franceschini and A. Mukherjee. “Data Compression Using Encrypted Text” Proceedings of the Third Forum on Research and
Technology, Advances in Digital Libraries, ADL96, May 13-15 1996, PP 130-138.
2.
3.
H. Kruse and A. Mukherjee. “Data Compression Using Text Encryption”, Proceedings of the Data Compression Conference, 1997,
IEEE Computer Society Press, pp. 447.
H. Kruse and A. Mukherjee. Preprocessing Text to improve Compression Ratios. Proceedings of Data Compression Conference,
1998, IEEE Computer Society Press 1997, pp. 556.
Project Publications
We have submitted a journal paper and seven conference papers of which three have been accepted as of January 2001. Copies of these
papers will be made available via our project website and also by hard copy.
1.
R. Franceschini, H. Kruse, N. Zhang, R. Iqbal and A. Mukherjee, “Lossless, Reversible Transformations that Improve Text
Compression Ratios”, submitted to IEEE Transactions on Multimedia Systems (June 2000).
2.
F. Awan, and A. Mukherjee, “ LIPT: A Lossless Text Transform to Improve Compression", Proceedings of International
Conference on Information and Theory: Coding and Computing, IEEE Computer Society, Las Vegas Nevada, April 2001.
3.
N. Motgi and A. Mukherjee, “Network Conscious Text Compression Systems (NCTCSys)”, Proceedings of International
Conference on Information and Theory: Coding and Computing, IEEE Computer Society, Las Vegas Nevada, April, 2001.
4.
F. Awan, Nan Zhang N. Motgi, R. Iqbal and A. Mukherjee, “LIPT: A Reversible Lossless Text Transform to Improve
Compression Performance”, Proceedings of Data Compression Conference, Snowbird, Utah, March, 2001.
5.
F. Awan, Nan Zhang, N. Motgi and A. Mukherjee, “Data Compression: A key for Digital Archival Storage and Transmission”
submitted for Joint Conference on Digital Libraries (JCDL) sponsored by Association for Computing Machinery, Washington
D.C., June 24-28, 2001.
6.
F. Awan, R. Iqbal and A. Mukherjee, “Length Index Preserving Transform: A Lossless Reversible Text Preprocessor for Bzip2
and PPM”, submitted to IEEE International Symposium on Information Theory, Washington, D.C. 2001.
7. N. Motgi and A. Mukherjee, “High Speed Text Data Transmission over Internet using Compression”
submitted to IEEE International Symposium on Information Theory, Washington, D.C. 2001
8.
Nan Zhang and A. Mukherjee “Dictionary based Transformations”, submitted to IEEE International Symposium on Information
Theory, Washington, D.C. 2001.
Area Background
In the last decade, we have seen an unprecendent explosion of text information transfer through Emails, Web Browsing, and digital library
and information retrieval systems. It is estimated that this growth will be 100% every year. In all of this text data competes for 45% of the
total Internet traffic due to downloading of web pages, movements of emails, news groups, forums etc. With continuous increasing use of
the Internet the efficient use of available resources, in particular hardware resources, has been a primary concern. One way of improving
performance is by compressing text data faster and better using better compression methods and to ensure that as little data as possible is
sent in response to the client’s request. We propose to integrate new compression schemes into a network conscious system which is
capable of sensing the traffic on the current server on which these algorithms are hosted, and make an appropriate decision on what method
should be used to compress the information before transmitting it.
Area References
1.
2.
3.
4.
5.
I.H. Witten, A. Moffat, and T.C. Bell. Managing Gigabytes. 2 nd Edition, Morgan Kaufmann Publishers, 1999.
D. Salomon,”Data Compression”, Springer Verlag, 2nd edition, 2000.
K. Sayood. Introduction to Data Compression. 2nd Ed. Morgan Kaufmann, 2000.
Using Compression on Webservers IIS 5.0 http://www.microsoft.com/TechNet/iis/httpcomp.asp
Compression for HTML Streams http://www.w3.org/Protocols/HTTP/Performance/Pipeline.html
Potential Related Projects
A research project has been completed, under the supervision of the Principal Investigator, on Wavelet Based Image Compression and
Transmission supported by Intel Corporation. A research proposal on this subject is under consideration by NSF for possible funding. A
research project on hardware implementation of the BWT compression algorithm on FPGA is underway in collaboration with a research
team in Germany (Technical University of Darmstadt).
Download