Cellular short message service using data compaction with bounded character synchronization 1 1 ACM Fong and 2B Fong School of Computer Engineering, Nanyang Technological University, Blk N4, #02-37, Nanyang Avenue, Singapore 639798. 2 CTS Center, Lucent Technologies, Singapore 469004. 1 ascmfong@ntu.edu.sg Abstract: Short Message Service (SMS) has been incorporated into digital cellular phone standards such as the Global System for Mobile Communications (GSM) standard for years. It has also been offered by a number of cellular service providers. However, in many parts of the world, the full potential of SMS has not been realized. In the GSM standard, 140 bytes are used to encode up to 160 alphanumeric characters. This paper presents a data compaction application that increases the number of alphanumeric characters by at least 21%, with an added advantage of automatic character synchronization. It presents practical aspects of the proposal, as well as application variations such as the treatment of statistically time varying and multilingual information sources. Keywords: Cellular Short message service, source coding, code word synchronization Introduction Digital cellular wireless protocols include many user-friendly features such as the Short Message Service (SMS). Not available in earlier analog cellular systems, SMS is a bi-directional service for short messages of up to 140 bytes as defined in the Global System for Mobile Communications (GSM) standard. Short messages are sent in a store-and-forward fashion via SMS centers that are operated by cellular phone network operators. For alphanumeric messages, the 140 bytes of information translate to 160 ASCII characters for each short message. Key advantages of SMS include confirmation of message delivery (an advantage over paging) and simultaneous transmission with voice, data and fax services. Significantly, these features are not expected to be incorporated in planned services such as General Package Radio Service (GPRS). One major disadvantage of SMS is the limitation of alphanumeric 160 characters. As one might expect, message length for non-Latin languages, such as Chinese, Japanese, Korean and Arabic is further constrained at a maximum of 70 characters. While this is often adequate for users to send and receive short text messages (in much the same way as two-way paging), the full potential of SMS technology cannot be fully explored. This is an important consideration particularly if SMS is to be used for such value-added services as prepayment, advertising and electronic commerce, with multilingual capabilities. Current Status, Trends and Directions SMS traffic volume is increasing rapidly in Europe, where GSM is most prevalent. It has been reported that there were some 4 billion SMS transactions in the month of March 2000 alone in Europe [1]. Worldwide, however, the same report shows that the total SMS traffic was 5 billion transactions in the same month. This indicates that SMS is not widely used in many parts of the world outside of Europe. By considering the modes of SMS operation, one can account for the various SMS-related services that are currently available and predict what possible services are likely to have a major impact on further increase in SMS usage. In addition to point-to-point SMS, a message could also be broadcast. The two basic modes of SMS operation are person-to-person and cellbroadcast mode. The former is like two-way paging, where a sender sends text messages via a data center to a recipient. The latter is for sending messages, such as traffic updates or news updates, from a data center. 42 In addition, messages can also be stored in the SIM card for later retrieval. Based on these modes of SMS operation, first generation SMS data centers typically provide Email services and Information services. Email services follow from the person-to-person mode of operation. An introduction of Email services typically leads to an increase of 20% in overall SMS transactions [1]. Information services follow from the cell-broadcast mode of operation. These services tend to build up slowly, as third parties are often involved in providing the source information such as world news, financial news and sport news. An introduction of Information services typically leads to an increase of 10% in overall SMS transactions [1]. The next quantum leap in an increase in SMS traffic can only occur if SMS can find widespread commercial applications such as value-added services like prepayment, advertising and various forms of electronic trading and with multilingual capabilities. To realize the full potential of SMS technology, a larger amount of information must be packed in the available 140 bytes. Variable-length coding (VLC) is often employed to enhance data compression capabilities. This paper focuses on a class of VLC that have demonstrable selfsynchronizing properties that also enhance data integrity. New Proposal The existing SMS standards already include attempts to increase the amount of information that can be transmitted within the 140-byte constraint. These include concatenation and compression techniques [1]. The authors propose the use of self-synchronizing T-codes for SMS message encoding. Similar approaches of data compaction have been demonstrated in other areas such as computer data communications [2][3][4], moving pictures source coding [5] and paging messages encoding [6]. The automatic character synchronization capability of T-codes is well documented and is best demonstrated with an example as shown in Figure 1. The word ‘MESSAGE’ is encoded using Titchener’s character assignment table [7]. That particular assignment table does not appear to be optimal in anyway. Rather, it provides a convenient ‘off-the-shelf’ character-to-code mapping. In general and common with other variable-length codes, optimization on T-code selection can only be performed if the statistical nature of the information source is known. From [7], the word ‘MESSAGE’ is encoded as M E S S A G E 1001111 0000 101101 101101 00101 1010101 0000 Suppose a random error occurs that causes 5 bits to be lost, the bit stream becomes 100111100001011011011010010110101010000 Then the bit stream is decoded as 1001111 0000101101 101101 00101 1010101 0000 M A S A G E Figure 1 Automatic character synchronization (bit streams are continuous and spaces are inserted for clarity only) In Figure 1, the word ‘MESSAGE’ is represented using 37 characters (5.28 bits per character). Among the 37 bits, 5 are presumed lost (13.5%) for illustration purposes and Figure 1 demonstrates that synchronization is achieved within 2 characters, which in general occurs within 1.5 character for a typical message [8]. Any error corruption is therefore very localized. This observation is typical of T-codes. The strong tendency for T-codes to resynchronize is embodied in the augmentation process of T-code construction. In addition, the deterministic and bounded character synchronizing properties of T-codes [7,8] mean that once the decoder establishes that an error has occurred, it can determine the point at which resynchronization will occur. Thus, the erroneous bits that are received can be ignored or corrected as appropriate. Practical Application It has already been mentioned that the statistical nature of an information source must be known (modelled) to take full advantage of using variablelength coding. It is reasonable to assume that a statistical model for the information source is available or obtainable. The source may be general 43 person-to-person English text messages, much like paging messages that users send. It may also be more specific, e.g. financial market data (stock prices, volume, etc.). For example, the letter ‘E’ may be found to appear 14% of the time in any typical English text transmitted by SMS users. On the other hand, stock data typically include many numeric characters and frequently occurring company names can be encoded using a company name to code map (rather than individual characters to code words). This is similar to ‘canned messages’ used in paging. presented above does not preclude the possibility of dynamically updating the dictionary if necessary, if the statistical nature of the source is time varying. This is discussed in the next section. Identify correct subgroup for information source through entropy matching Optimal subgroup found (best sync T-code is found when encoding is optimally efficient for the source) Determine best sync T-code set within selected subgroup The required degree of T-augmentation is based on the number of symbols emitted by the information source. From [7], the number of symbols N that can be encoded using an augmentation degree Q T-code set is given by (1). From (1), the decision on the augmentation degree Q that is required for T-encoding of N discrete levels is given by (2). N=2Q+1 (1) Q Log2 (N - 1) (2) Thus, a T-augmentation degree of 7 is compatible with the ASCII character set. From this, the most suitable 7th degree of augmentation T-code subgroup can be determined (degree may change for different formats). This establishes the average code word length (ACL) distribution that best matches the source entropy. The next step is to determine the best synchronizing Tcode set within the chosen subgroup. This information is obtained from available databases of best T-codes [8]. These can be fine-tuned using a fast algorithm for computing average synchronization delay (ASD) of T-codes [9,10]. The most suitable T-code set has thus been identified. Finally, a dictionary of code words can be constructed for source encoding. The same dictionary is used for decoding at the receiver. The above process is summarized in Figure 2. In the event that the statistical nature of the information source is considered time-invariant, the above operational steps can be performed once at the beginning. So, the dictionary for source encoding and decoding is made available before actual coding occurs. The proposal Use existing database of ‘best’ T-codes or fast ASD algorithms T-code specification of the form S[0, x1, x2, x3, x4, x5, x6], where xi is in the range 1 xi 2i + 1 and all prefixes must be different Build dictionary for selected 7thdegree T-code set (The decoder uses the same dictionary) Encode short messages using the T-code dictionary Encoded short messages Figure 2 The process of applying T-codes to SMS data compaction Adaptive T-coding The need for adaptively changing the T-dictionary for source encoding and decoding arises from information sources that have a time varying statistical model. For example, during certain times, numeral characters occur much more frequently than other times compared to letters of the alphabet in messages that include a lot of telephone numbers. Another reason for having adaptive T-coding is for multilingual applications. This leads to the consideration of a switching system for multilingual text applications as shown in Figure 3. Figure 3 depicts a system that has an information source capable of emitting symbols from different sets of alphabet (e.g. English, German, Greek, etc.). In the wider sense, non-Latin languages (e.g. Chinese, 44 Korean and Japanese) may also be considered as sets of symbols. Dictionary 1 Dictionary 2 Dictionary 3 Multilingual Information Sij Source : : Timing Considerations The simple method of identification [11] makes Tcodes particularly suitable for adaptive operations. Thus, following the statistical modelling stage, information can be passed to an adapter to dynamically switch to the most suitable T-code table (or codebook) to be used to encode the symbols. The adaptation and switching of T-code table will introduce some delay. Thus, an additional stage is needed to add an appropriate amount of time delay t as shown in Figure 4. : Dictionary m Figure 3 A switching system for multilingual T-encoding Each symbol emitted is denoted by Sij, such that it is the ith symbol in alphabet (set of symbols) j. For example, the first alphabet is given by S1 = {S11, S21, .., Sn1}. There are m sets of alphabet requiring m different dictionaries. Each dictionary is determined based on the statistical nature of the corresponding language and in general, the dictionaries can be predetermined and remain unchanged throughout the system operation. This system dynamically switches between the dictionaries according to j. None of the above discussion on multilingual switching precludes the switching of dictionaries due to changes in statistical nature within any given language. Indeed, it is possible to adapt to such changes dynamically because best T-codes in terms of efficiency and sync performance can be identified quite rapidly, so only a moderatelysized buffer would suffice. In the likely event that the statistical nature of a given alphabet can only change within a limited degree of freedom (e.g. only several dictionaries needed), then one could simply extend the arrangement depicted in Figure 3 by switching between more than m dictionaries. For example, if alphabet S1 has 3 dictionaries, then one could introduce dictionaries 1.1, 1.2 and 1.3 in place of dictionary 1. At any rate, redundant (repeated) dictionaries should be avoided to enhance system performance. Textual information Statistical Model Adapter Symbols t Compacted data T-Encoder T-codebook Selection Figure 4 Block diagram of a T-code entropy encoder with optional adaptive feature (shown in dotted line) If the adapter is not active, a fixed T-code table has to be predetermined based on the statistical nature of the symbols. That means the most recently selected codebook will be used for T-encoding until the next Tcodebook selection occurs. This can occur automatically as a result of the activated adapter (the next time the adapter becomes active), or manually selected by the system designer based on entropy matching of the source. In the latter scenario, the source is taken to be relatively stable in terms of its statistical nature. Entropy matching is done in much the same manner as described above. Most commonly, the system will be used in automatic mode of operation. For automatic operation, the value of t does not need to be fixed. A number of logic gates can be used to control the flow of information as shown in Figure 5. The logic gates in Figure 5 are represented by their corresponding IEEE/ANSI standard symbols. For simplicity, the adapter and Tcodebook selector are merged into one unit without loss of generality. It is understood that Adaptation occurs before Code Selection in sequence within the merged unit. 45 Textual information Statistical Model Adapter / T-codebook Selection adaptOK adptEN & EN T-Encoder 1 1 Compacted data Figure 5 T-code entropy encoder with logic control to minimize processing delay In this paper, the authors have proposed the use of a class of variable-length codes (VLC) to improve the coding efficiency, given the constraint of 140 bytes. The advantage of the new proposal is twofold. First, each alphanumeric character requires, on average in a typical English text, 5.5 bits to represent [6]. This represents a saving of about 21% compared to the 7 bits required for ASCII representation. For financial and other types of data, saving of almost 40% can be achieved [6]. Further, accurate modelling of the statistical properties of the various information sources can further improve coding efficiency. This compression is lossless and represents a step in the right direction. Being able to pack more information facilitates multilingual and multi-format operations. The second advantage is that automatic synchronization occurs within 1.5 characters as a direct consequence of the T-construction algorithm [8]. Practical aspects of the proposal have also been discussed. References In Figure 5, data signals are shown in solid lines and logic signals are shown in dotted lines. Logic variables are shown in bold. An external input Adaptation Enable ‘adptEN’ is introduced to indicate whether or not adaptation should occur. The merged Adaptation / Codebook Selection process has an additional logic output ‘adaptOK’ to signal the completion of the process. An enable input ‘EN’ is added to the Tencoder such that actual information source coding only occurs when the T-codebook is ready. Conclusion A brief description of cellular short message service (SMS) has been presented. Currently, SMS is used mainly for person-to-person communications of short messages and broadcast of timely information from data centers. While SMS is adequate for such applications (and have distinctive advantages over paging), the limited maximum message length means that the full potential of SMS cannot be fully realized. It has also been observed that SMS traffic volume is relatively low in regions outside of Europe. The next major increase in SMS traffic volume, as well as widespread acceptance worldwide, will come about when SMS can provide more value-added services such as electronic commerce. [1] Background information about SMS can be found at these URLs: http://www.gsmworld.com/gsmdata http://www.cdg.org http://www.mobilesms.com/ http://www.gsm_pcs.org http://www.itu.int/itudoc/itu-t/rec/index.html [2] ‘Secure communication system for re-establishing time limited communication between first and second computers before communication time period expiration using new random number’ US pat number 5428745, 6/27/1995. [3] ‘Boundary markers for indicating the boundary of a variable length instruction to facilitate parallel processing of sequential instructions’, US pat number 5450650, 9/12/1995. [4] ‘System and method for determining whether to transmit command to control computer by checking status of enable indicator associated with variable identified in the command’ US pat number 5561770, 10/01/1996. [5] ‘Methods of coding and decoding moving-picture signals, using self-synchronizing variable length codes’ US pat number 5835144, 11/10/1998. [6] Fong A. C. M. & Quay C., ‘Application of selfsynchronizing T-codes to FLEXTM Suite message encoding’, Motorola Technical Developments, Vol. 40, Jan 2000, pp. 68-73. [7] Titchener M. R., ‘Digital encoding by means of new T-codes to provide improved data synchronization and message integrity’, Proc. IEE 46 Computers & Digital Techniques, Vol. 131, pp. 151-153, 1984. [8] Higgie G. R. ‘Database of best T-codes’, Proc. IEE Computers & Digital Techniques, Vol. 143, pp213-218, 1996. [9] Fong A. C. M. & Higgie, G. R., ‘An improved algorithm for calculating the average synchronization delay of T-codes’, Computers & Industrial Engineering, Vol. 37, pp161-164, 1998. [10] Fong A. C. M. and Higgie G. R., ‘Identification of T-codes that have minimal average synchronization delay’, to appear in IEE, Pt E, Computers & Digital Techniques. [11] Titchener M. R., ‘Construction and properties of the augmented and binary depletion codes’, Proc. IEE, Pt E, Computers & Digital Techniques, Vol. 132, 1985, pp. 163-169. 47