Cellular short message service using data compaction with bounded

advertisement
Cellular short message service using data compaction with bounded
character synchronization
1
1
ACM Fong and 2B Fong
School of Computer Engineering, Nanyang Technological University,
Blk N4, #02-37, Nanyang Avenue, Singapore 639798.
2
CTS Center, Lucent Technologies, Singapore 469004.
1
ascmfong@ntu.edu.sg
Abstract: Short Message Service (SMS) has been incorporated into digital cellular phone standards such
as the Global System for Mobile Communications (GSM) standard for years. It has also been offered by a
number of cellular service providers. However, in many parts of the world, the full potential of SMS has
not been realized. In the GSM standard, 140 bytes are used to encode up to 160 alphanumeric characters.
This paper presents a data compaction application that increases the number of alphanumeric characters
by at least 21%, with an added advantage of automatic character synchronization. It presents practical
aspects of the proposal, as well as application variations such as the treatment of statistically time varying
and multilingual information sources.
Keywords: Cellular Short message service, source coding, code word synchronization
Introduction
Digital cellular wireless protocols include many
user-friendly features such as the Short Message
Service (SMS). Not available in earlier analog
cellular systems, SMS is a bi-directional service
for short messages of up to 140 bytes as defined
in
the
Global
System
for
Mobile
Communications (GSM) standard. Short
messages are sent in a store-and-forward fashion
via SMS centers that are operated by cellular
phone network operators. For alphanumeric
messages, the 140 bytes of information translate
to 160 ASCII characters for each short message.
Key advantages of SMS include confirmation of
message delivery (an advantage over paging)
and simultaneous transmission with voice, data
and fax services. Significantly, these features are
not expected to be incorporated in planned
services such as General Package Radio Service
(GPRS).
One major disadvantage of SMS is the limitation
of alphanumeric 160 characters. As one might
expect, message length for non-Latin languages,
such as Chinese, Japanese, Korean and Arabic is
further constrained at a maximum of 70
characters. While this is often adequate for users
to send and receive short text messages (in much
the same way as two-way paging), the full potential of
SMS technology cannot be fully explored. This is an
important consideration particularly if SMS is to be
used for such value-added services as prepayment,
advertising and electronic commerce, with
multilingual capabilities.
Current Status, Trends and Directions
SMS traffic volume is increasing rapidly in Europe,
where GSM is most prevalent. It has been reported
that there were some 4 billion SMS transactions in the
month of March 2000 alone in Europe [1].
Worldwide, however, the same report shows that the
total SMS traffic was 5 billion transactions in the same
month. This indicates that SMS is not widely used in
many parts of the world outside of Europe.
By considering the modes of SMS operation, one can
account for the various SMS-related services that are
currently available and predict what possible services
are likely to have a major impact on further increase in
SMS usage. In addition to point-to-point SMS, a
message could also be broadcast. The two basic modes
of SMS operation are person-to-person and cellbroadcast mode. The former is like two-way paging,
where a sender sends text messages via a data center
to a recipient. The latter is for sending messages, such
as traffic updates or news updates, from a data center.
42
In addition, messages can also be stored in the
SIM card for later retrieval.
Based on these modes of SMS operation, first
generation SMS data centers typically provide
Email services and Information services. Email
services follow from the person-to-person mode
of operation. An introduction of Email services
typically leads to an increase of 20% in overall
SMS transactions [1]. Information services
follow from the cell-broadcast mode of
operation. These services tend to build up
slowly, as third parties are often involved in
providing the source information such as world
news, financial news and sport news. An
introduction of Information services typically
leads to an increase of 10% in overall SMS
transactions [1].
The next quantum leap in an increase in SMS
traffic can only occur if SMS can find
widespread commercial applications such as
value-added
services
like
prepayment,
advertising and various forms of electronic
trading and with multilingual capabilities. To
realize the full potential of SMS technology, a
larger amount of information must be packed in
the available 140 bytes. Variable-length coding
(VLC) is often employed to enhance data
compression capabilities. This paper focuses on
a class of VLC that have demonstrable selfsynchronizing properties that also enhance data
integrity.
New Proposal
The existing SMS standards already include
attempts to increase the amount of information
that can be transmitted within the 140-byte
constraint. These include concatenation and
compression techniques [1]. The authors propose
the use of self-synchronizing T-codes for SMS
message encoding. Similar approaches of data
compaction have been demonstrated in other
areas such as computer data communications
[2][3][4], moving pictures source coding [5] and
paging messages encoding [6].
The automatic character synchronization
capability of T-codes is well documented and is
best demonstrated with an example as shown in
Figure 1. The word ‘MESSAGE’ is encoded
using Titchener’s character assignment table [7].
That particular assignment table does not appear to be
optimal in anyway. Rather, it provides a convenient
‘off-the-shelf’ character-to-code mapping. In general
and common with other variable-length codes,
optimization on T-code selection can only be
performed if the statistical nature of the information
source is known.
From [7], the word ‘MESSAGE’ is encoded as
M
E
S
S
A
G
E
1001111 0000 101101 101101 00101 1010101 0000
Suppose a random error occurs that causes 5 bits to be lost,
the bit stream becomes
100111100001011011011010010110101010000
Then the bit stream is decoded as
1001111 0000101101 101101 00101 1010101 0000
M
A
S
A
G
E
Figure 1 Automatic character synchronization
(bit streams are continuous and spaces are inserted for
clarity only)
In Figure 1, the word ‘MESSAGE’ is represented
using 37 characters (5.28 bits per character). Among
the 37 bits, 5 are presumed lost (13.5%) for illustration
purposes and Figure 1 demonstrates that
synchronization is achieved within 2 characters, which
in general occurs within 1.5 character for a typical
message [8]. Any error corruption is therefore very
localized. This observation is typical of T-codes. The
strong tendency for T-codes to resynchronize is
embodied in the augmentation process of T-code
construction.
In addition, the deterministic and bounded character
synchronizing properties of T-codes [7,8] mean that
once the decoder establishes that an error has
occurred, it can determine the point at which
resynchronization will occur. Thus, the erroneous bits
that are received can be ignored or corrected as
appropriate.
Practical Application
It has already been mentioned that the statistical
nature of an information source must be known
(modelled) to take full advantage of using variablelength coding. It is reasonable to assume that a
statistical model for the information source is
available or obtainable. The source may be general
43
person-to-person English text messages, much
like paging messages that users send. It may also
be more specific, e.g. financial market data
(stock prices, volume, etc.). For example, the
letter ‘E’ may be found to appear 14% of the
time in any typical English text transmitted by
SMS users. On the other hand, stock data
typically include many numeric characters and
frequently occurring company names can be
encoded using a company name to code map
(rather than individual characters to code words).
This is similar to ‘canned messages’ used in
paging.
presented above does not preclude the possibility of
dynamically updating the dictionary if necessary, if
the statistical nature of the source is time varying. This
is discussed in the next section.
Identify correct subgroup
for information source
through entropy matching
Optimal subgroup found (best sync
T-code is found when encoding is
optimally efficient for the source)
Determine best sync
T-code set within
selected subgroup
The required degree of T-augmentation is based
on the number of symbols emitted by the
information source. From [7], the number of
symbols N that can be encoded using an
augmentation degree Q T-code set is given by
(1). From (1), the decision on the augmentation
degree Q that is required for T-encoding of N
discrete levels is given by (2).
N=2Q+1
(1)
Q  Log2 (N - 1)
(2)
Thus, a T-augmentation degree of 7 is
compatible with the ASCII character set. From
this, the most suitable 7th degree of augmentation
T-code subgroup can be determined (degree may
change for different formats). This establishes
the average code word length (ACL) distribution
that best matches the source entropy. The next
step is to determine the best synchronizing Tcode set within the chosen subgroup. This
information is obtained from available databases
of best T-codes [8]. These can be fine-tuned
using a fast algorithm for computing average
synchronization delay (ASD) of T-codes [9,10].
The most suitable T-code set has thus been
identified. Finally, a dictionary of code words
can be constructed for source encoding. The
same dictionary is used for decoding at the
receiver. The above process is summarized in
Figure 2.
In the event that the statistical nature of the
information source is considered time-invariant,
the above operational steps can be performed
once at the beginning. So, the dictionary for
source encoding and decoding is made available
before actual coding occurs. The proposal
Use existing
database of ‘best’
T-codes or fast
ASD algorithms
T-code specification of the form
S[0, x1, x2, x3, x4, x5, x6], where
xi is in the range 1  xi  2i + 1
and all prefixes must be different
Build dictionary for selected
7thdegree T-code set (The
decoder uses the same dictionary)
Encode short messages using
the T-code dictionary
Encoded short messages
Figure 2 The process of applying T-codes to
SMS data compaction
Adaptive T-coding
The need for adaptively changing the T-dictionary for
source encoding and decoding arises from information
sources that have a time varying statistical model. For
example, during certain times, numeral characters
occur much more frequently than other times
compared to letters of the alphabet in messages that
include a lot of telephone numbers. Another reason for
having adaptive T-coding is for multilingual
applications. This leads to the consideration of a
switching system for multilingual text applications as
shown in Figure 3.
Figure 3 depicts a system that has an information
source capable of emitting symbols from different sets
of alphabet (e.g. English, German, Greek, etc.). In the
wider sense, non-Latin languages (e.g. Chinese,
44
Korean and Japanese) may also be considered as
sets of symbols.
Dictionary 1
Dictionary 2
Dictionary 3
Multilingual
Information
Sij
Source
:
:
Timing Considerations
The simple method of identification [11] makes Tcodes particularly suitable for adaptive operations.
Thus, following the statistical modelling stage,
information can be passed to an adapter to
dynamically switch to the most suitable T-code table
(or codebook) to be used to encode the symbols. The
adaptation and switching of T-code table will
introduce some delay. Thus, an additional stage is
needed to add an appropriate amount of time delay t
as shown in Figure 4.
:
Dictionary m
Figure 3 A switching system for multilingual
T-encoding
Each symbol emitted is denoted by Sij, such that
it is the ith symbol in alphabet (set of symbols) j.
For example, the first alphabet is given by S1 =
{S11, S21, .., Sn1}. There are m sets of alphabet
requiring m different dictionaries. Each
dictionary is determined based on the statistical
nature of the corresponding language and in
general, the dictionaries can be predetermined
and remain unchanged throughout the system
operation. This system dynamically switches
between the dictionaries according to j.
None of the above discussion on multilingual
switching precludes the switching of dictionaries
due to changes in statistical nature within any
given language. Indeed, it is possible to adapt to
such changes dynamically because best T-codes
in terms of efficiency and sync performance can
be identified quite rapidly, so only a moderatelysized buffer would suffice. In the likely event
that the statistical nature of a given alphabet can
only change within a limited degree of freedom
(e.g. only several dictionaries needed), then one
could simply extend the arrangement depicted in
Figure 3 by switching between more than m
dictionaries. For example, if alphabet S1 has 3
dictionaries, then one could introduce
dictionaries 1.1, 1.2 and 1.3 in place of
dictionary 1. At any rate, redundant (repeated)
dictionaries should be avoided to enhance
system performance.
Textual
information
Statistical
Model
Adapter
Symbols
t
Compacted
data
T-Encoder
T-codebook
Selection
Figure 4 Block diagram of a T-code entropy encoder with
optional adaptive feature (shown in dotted line)
If the adapter is not active, a fixed T-code table has to
be predetermined based on the statistical nature of the
symbols. That means the most recently selected
codebook will be used for T-encoding until the next Tcodebook selection occurs. This can occur
automatically as a result of the activated adapter (the
next time the adapter becomes active), or manually
selected by the system designer based on entropy
matching of the source. In the latter scenario, the
source is taken to be relatively stable in terms of its
statistical nature. Entropy matching is done in much
the same manner as described above.
Most commonly, the system will be used in automatic
mode of operation. For automatic operation, the value
of t does not need to be fixed. A number of logic
gates can be used to control the flow of information as
shown in Figure 5. The logic gates in Figure 5 are
represented by their corresponding IEEE/ANSI
standard symbols. For simplicity, the adapter and Tcodebook selector are merged into one unit without
loss of generality. It is understood that Adaptation
occurs before Code Selection in sequence within the
merged unit.
45
Textual information
Statistical
Model
Adapter /
T-codebook
Selection
adaptOK
adptEN
&
EN
T-Encoder
1
1
Compacted data
Figure 5 T-code entropy encoder with logic control to
minimize processing delay
In this paper, the authors have proposed the use of a
class of variable-length codes (VLC) to improve the
coding efficiency, given the constraint of 140 bytes.
The advantage of the new proposal is twofold. First,
each alphanumeric character requires, on average in a
typical English text, 5.5 bits to represent [6]. This
represents a saving of about 21% compared to the 7
bits required for ASCII representation. For financial
and other types of data, saving of almost 40% can be
achieved [6]. Further, accurate modelling of the
statistical properties of the various information
sources can further improve coding efficiency. This
compression is lossless and represents a step in the
right direction. Being able to pack more information
facilitates multilingual and multi-format operations.
The second advantage is that automatic
synchronization occurs within 1.5 characters as a
direct consequence of the T-construction algorithm
[8]. Practical aspects of the proposal have also been
discussed.
References
In Figure 5, data signals are shown in solid lines
and logic signals are shown in dotted lines.
Logic variables are shown in bold. An external
input Adaptation Enable ‘adptEN’ is introduced
to indicate whether or not adaptation should
occur. The merged Adaptation / Codebook
Selection process has an additional logic output
‘adaptOK’ to signal the completion of the
process. An enable input ‘EN’ is added to the Tencoder such that actual information source
coding only occurs when the T-codebook is
ready.
Conclusion
A brief description of cellular short message
service (SMS) has been presented. Currently,
SMS is used mainly for person-to-person
communications of short messages and
broadcast of timely information from data
centers. While SMS is adequate for such
applications (and have distinctive advantages
over paging), the limited maximum message
length means that the full potential of SMS
cannot be fully realized. It has also been
observed that SMS traffic volume is relatively
low in regions outside of Europe. The next major
increase in SMS traffic volume, as well as
widespread acceptance worldwide, will come
about when SMS can provide more value-added
services such as electronic commerce.
[1] Background information about SMS can be found
at these URLs:
http://www.gsmworld.com/gsmdata
http://www.cdg.org
http://www.mobilesms.com/ http://www.gsm_pcs.org
http://www.itu.int/itudoc/itu-t/rec/index.html
[2] ‘Secure communication system for re-establishing
time limited communication between first and second
computers before communication time period
expiration using new random number’ US pat number
5428745, 6/27/1995.
[3] ‘Boundary markers for indicating the boundary of
a variable length instruction to facilitate parallel
processing of sequential instructions’, US pat number
5450650, 9/12/1995.
[4] ‘System and method for determining whether to
transmit command to control computer by checking
status of enable indicator associated with variable
identified in the command’ US pat number 5561770,
10/01/1996.
[5] ‘Methods of coding and decoding moving-picture
signals, using self-synchronizing variable length
codes’ US pat number 5835144, 11/10/1998.
[6] Fong A. C. M. & Quay C., ‘Application of selfsynchronizing T-codes to FLEXTM Suite message
encoding’, Motorola Technical Developments, Vol.
40, Jan 2000, pp. 68-73.
[7] Titchener M. R., ‘Digital encoding by means of
new
T-codes
to
provide
improved
data
synchronization and message integrity’, Proc. IEE
46
Computers & Digital Techniques, Vol. 131, pp.
151-153, 1984.
[8] Higgie G. R. ‘Database of best T-codes’,
Proc. IEE Computers & Digital Techniques,
Vol. 143, pp213-218, 1996.
[9] Fong A. C. M. & Higgie, G. R., ‘An
improved algorithm for calculating the average
synchronization delay of T-codes’, Computers &
Industrial Engineering, Vol. 37, pp161-164, 1998.
[10] Fong A. C. M. and Higgie G. R., ‘Identification
of T-codes that have minimal average synchronization
delay’, to appear in IEE, Pt E, Computers & Digital
Techniques.
[11] Titchener M. R., ‘Construction and properties of
the augmented and binary depletion codes’, Proc. IEE,
Pt E, Computers & Digital Techniques, Vol. 132,
1985, pp. 163-169.
47
Download