A Multiprocessor Implementation for the GSM ... C.

A Multiprocessor Implementation for the GSM Algorithm
by
Jennifer C. Kleiman
Submitted to the Department of Electrical Engineering and Computer Science
in Partial Fulfillment of the Requirements for the Degrees of
Bachelor of Science in Electrical Engineering and Computer Science
and Master of Engineering in Electrical Engineering and Computer Science
at the Massachusetts Institute of Technology
May 21, 1999
© Copyright Jennifer C. Kleiman 1999. All rights reserved.
The author hereby grants to M.I.T. permission to reproduce and
distribute publicly paper and electronic copies of this thesis
and to grant others the right to do so.
Author
Department of Electrical Engineering(nd domputer Science
May 21, 1999
Certified by
Dr. Christ h~r J. Terman
T jesSupepjsor
Accepted by
Prof. Arthur . Smith
Chairman, Department Committee on Graduate
es
INSTITUTE
MASSACHUSETTS
OF TECHNOLOGY
JUL 1 5 1999
LIBRARIES
A Multiprocessor Implementation for the GSM Algorithm
by
Jennifer C. Kleiman
Submitted to the
Department of Electrical Engineering and Computer Science
May 21, 1999
In Partial Fulfillment of the Requirements for the Degrees of
Bachelor of Science in Electrical Engineering and Computer Science
and Master of Engineering in Electrical Engineering and Computer Science
ABSTRACT
Telecommunications or simply communications play an important role in the computer
industry. At the core of this industry lies the digital signal processor. Moreover, many
communication technologies rely principally on signal processing. At both the system
and component level, redundancy is present in most of these applications. Therefore, an
opportunity exists to optimize these technologies using parallel processing. Specifically,
the DSPs in these applications may be designed in a parallel configuration to achieve
higher performance and a reduction of dedicated hardware.
GSM, a mobile communications system, displays redundancy at core processing nodes in
its network as well as in its fundamental speech processing algorithm, thereby making it
an optimal choice for this implementation. This thesis describes the design methodology
for this implementation and evaluates several different configurations. As a result, a new
multiprocessor is proposed.
Thesis Supervisor: Christopher J. Terman
Title: Senior Lecturer, Dept. of Electrical Engineering and Computer Science
2
Acknowledgments
First, I would like to thank God for giving me the patience, endurance, and motivation to
complete this thesis.
I would like to thank my advisor, Chris Terman, for providing a wealth of knowledge and
assistance throughout my thesis endeavor. I am thankful for the opportunity to have
worked with him.
To my parents, I would like to give my deepest gratitude and honor. They have been a
constant source of love, support and encouragement since the moment I came to MIT. I
thank them for the sacrifices they made in order to provide me with an excellent
education. "Keep Going"
3
Contents
1 Introduction
1.1 Background ........................................................................................................
1.2 D igital Signal Processors.......................................................................................
1.3 N etw ork A rchitectures ...........................................................................................
7
7
8
9
2 Technologies of Interest
2.1 A D SL.......................................................................................................................
2.2 M PEG ......................................................................................................................
2.3 GSM ........................................................................................................................
12
12
16
19
3 The Bulk DSP Architecture
3.1 Parallel Com puting..............................................................................................
3.2 Processing Elem ents...........................................................................................
3.3 M em ory Hierarchy .............................................................................................
3.4 GSM Implem entation .........................................................................................
3.5 Encoding and D ecoding ......................................................................................
25
25
27
28
29
30
4 Scheduling
4.1 D esign M ethodology ...........................................................................................
4.2 G SM 's Com putational M odules ...........................................................................
4.3 Evaluation of Architectures................................................................................
33
33
34
37
5 Conclusion
5.1 Summ ary .................................................................................................................
5.2 Future W ork ........................................................................................................
5.3 Final Thoughts....................................................................................................
44
44
45
46
A Software Tools Used
48
B GSM Resources on the Web
49
4
List of Figures
2-1 ADSL Network Connection..................................................................................
14
2-2 MPEG Encoding Algorithm ..................................................................................
18
2-3 GSM Network Architecture......................................................................................
20
3-1 GSM Encoding Algorithm.........................................................................................
30
4-1 GSM Decoding Algorithm.........................................................................................35
4-2 LongTermS nthesisFiltering Module .............................................................
36
4-3 Preliminary Building Block of Bulk DSP..............................................................
38
4-4 Segment of Schedule for Proposed Architecture ..................................................
40
4-5 Final Bulk DSP Architecture ..................................................................................
42
5
List of Tables
4.1 Computational Module Parameters ........................................................................
37
4.2 Comparison of Results with Best Architecture ......................................................
41
4 .3 F in al R esults ...............................................................................................................
43
6
Chapter 1
Introduction
This thesis identifies the utility of a parallel multiprocessor for use in communication
technologies. In order to demonstrate this, several applications are explored, and the
basic architecture of the new parallel multiprocessor is presented. A specific technology
is then chosen as a vehicle to implement this multiprocessor, and the design process is
described. In addition, several architectures are analyzed based on experiments done
utilizing a set of software modules, and the performance results are presented. Finally,
some conclusions are drawn for the final implementation.
1.1
Background
Virtually everyone today uses some type of telephone service. Whether it is a 'plain old
telephone', a wireless cellular phone, or a modem in a computer, people heavily depend
on telephone systems as their primary source of communication. By the year 2001, it is
estimated that there will be 1 billion phone lines worldwide and 580 million cell phone
subscribers.
Communication, however, does not just encompass spoken conversation
between people. Technological advances have enabled communications or
telecommunications to also include transmitting sound, video, and digital data across
7
subscribers.8 Communication, however, does not just encompass spoken conversation
between people. Technological advances have enabled communications or
telecommunications to also include transmitting sound, video, and digital data across
telephone lines, radio frequencies, and cable lines. Because these types of
communication have expanded so rapidly, there are now telephone systems in almost
every area of the world. Even so, there still remains a large demand for global
connectivity; the most prominent example of this today is the Internet. This demand
drives telecommunication technologies to provide an infrastructure, which includes the
hardware, software, and network topology to further connect people around the world
while delivering as much information as possible. In addition, these communication
systems need to service millions of customers simultaneously and efficiently.
1.2
Digital Signal Processors
Communication technology relies on specialized hardware to perform digital
signal processing computations. In general, the computations performed are part or all of
a signal processing algorithm. Specifically, the numerous computations include
receiving, decoding, directing, encoding, and transmitting data. All of these operations
are done digitally by the DSPs. The benefits for communication applications reaped from
operating in the digital realm include reliability in the transmission process as well as
compression in data size, which makes transmission faster and easier. These procedures
are carried on small portions of the data known as frames on each of the subscriber lines
in order to achieve "seamless" point-to-point communication. The hardware components
that perform these functions are known as digital signal processors (DSPs). These
8
processors have been designed to handle complex arithmetic computations precisely for
speech and data processing applications.
Digital signal processing applications tend to be repetitive in nature. Many signal
processing functions rely on the execution of the same computation on each byte of data
in order to complete the desired operation. In fact, numerous of these algorithms consist
of a relatively small set of instructions that are just executed over and over again in loop
configurations. Because of this inherit repetition, the overall signal processing algorithm
can be broken down into smaller computing modules, which are then just applied
repeatedly throughout the algorithm.
1.3
Network Architectures
The network topologies of these communication systems vary somewhat depending on
the particular type of data they transmit and the communication medium used. Although,
the current trend is to build networks that can transmit all types of data, there still exist
several methods in which to transmit this data (i.e. physical lines vs. wireless). In
general, the basic foundations of these networks include a backbone network that
connects everything in the network. There are also several central locations in the
network, which brings together all the separate "channels" or subscriber lines in order to
process the data. Usually, these channels carry multiple simultaneous calls, and thus,
these central locations receive and transmit hundreds of channels at any given time.
Moreover, the central office or location is equipped with many computers whose tasks
are only to process these incoming and outgoing channels of communication. Because all
the channels come through a central office, the majority or bulk of the processing in the
9
network occurs at these central locations. The other major components of the network
infrastructure handle other operations such as transmitting and receiving the channels.
Since the central offices control much of the computation in a communication system,
they can be considered the "core processing nodes" of the network. The latency of these
core processing nodes, however, is much larger than at anywhere else in the network and
as a result much of the total computation time is spent here. Therefore, the efficiency of
these nodes play a significant role in the overall network performance.
As stated, much of the processing in many communication networks occurs in
central locations. Present day infrastructures tend to use a single DSP processor for a
small number of channels routed through these points. Since there are usually many
channels transmitted through these locations, there are many DSPs located at these spots.
Ideally a single DSP chip would process a large number of the channels at these locations
which can vary anywhere from 100 to 1000 depending on the technology. This would
immensely reduce the number of processors in the network. Logically, the next step
towards achieving such an improvement is to design a DSP or multiprocessor that could
process much more than just a few channels. By taking advantage of the instruction level
parallelism(ILP) among the channels, a parallel computation architecture such as a
SIMD (Single Instruction stream Multiple stream) vector processor might be used to
design a more efficient implementation. SIMD refers to an architecture in which the
same instruction is executed by multiple processors on different data streams. In
particular, for communication applications, a multiprocessor or 'Bulk DSP'might be
designed to contain many simple processors, such that each processor executes the same
instruction or set of instructions in parallel. This design would enable the Bulk DSP to
10
control and process many channels. The reduction in hardware would result in
significant cost savings for the given infrastructure. In addition, if successful, the Bulk
DSP would exceed the performance of existing communication networks, thereby
meeting the growing demands of communication technologies.
11
Chapter 2
Technologies of Interest
Of the many communication and digital signal processing applications, the three that will
be considered are ADSL (Asynchronous Digital Subscriber Line), MPEG (Motion
Picture Experts Group video compression), and GSM (Global System for Mobile
Communications). These technologies were chosen based on their significance to the
communications world as well as the similarities in their computation structure. They
each embody several key characteristics, which demonstrate the utility of a Bulk DSP as
their core processor.
2.1
ADSL
Asynchronous Digital Subscriber Line refers to a communication technology
implemented on the copper telephone lines found in all homes and businesses. ADSL is
an enhanced version of your basic phone line that provides much more data bandwidth to
the subscriber. It transfers voice, data and video at a significant increase in data rates.
The main point of attraction that drives this technology is a faster Internet connection. A
salient detail of this technology lies in the data rate transmitted. Much more bandwidth is
sent downstream (as from the Internet to one's computer) than in the reverse direction;
12
thus the name Asynchronous DSL. In fact, ADSL communication can theoretically reach
data rates of 8Mbps (Mega bits per second) downstream and 1Mbps upstream. ADSL
communication exists on the same twisted-pair wire as the telephone line and transmits
simultaneously with existing phone services without interruption or interference.
Because of this, no new phone lines need to be installed to implement ADSL, making it
an attractive choice of communication. Once companies start deploying ADSL,
customers need only to acquire an ADSL modem for their computers and to subscribe for
ADSL service in order to start using ADSL.
The ADSL network consists of the central office that contains the ADSL Modem
Rack, the phone line connection, a POTS (Plain Old Telephone System) splitter, and the
user-end ADSL modems. Figure 1-1 shows the details of the core network and how it
interfaces with the actual phone lines and other types of networks [2].
As shown by the diagram, the central office in the network manages all the
communication via phone lines to and from personal computers and corporate networks.
When a call comes into the central office, it is first passed through a POTS splitter, which
"splits" the call into voice and data signals and then directs them to the appropriate
device. A voice call will proceed to the Public Switch Telephone Network while a data
call will go to the ADSL Modem Rack. The Modem Rack consists of many line cards or
ATU-Cs (ADSL Transceiver Unit- Central Office) and is the key device of interest since
it uses digital signal processors. The ATU-C receives data from the access module and
converts the data into analog signals. It also receives and decodes data from customers
sent by the ATU-R (remote or end-user modem). Presently, each ATU-C can
13
accommodate up to 3 ADSL circuits, which means it can serve up to 3 individual phone
lines. These ADSL
circuits are implemented with several integrated circuits such as a core DSP to perform
Discrete Multi-Tone (DMT) technology functions, a line driver/receiver, a general
Central Office
Public
Phone
Switch
PT
_
or
_
Fax
PC or
Network
Computer
WWW or
Video
Server
-
ADSL Modem
POTS
POTS
Splitter
Sphitter
ADSL Modem
Rack
Ethernet, Frame
Relay, ATM
Internet
Backbone
Figure 2-1: ADSL Network Connection
purpose DSP, and an ASIC to perform all the analog and mixed-signal operations as well
as the modem configuration software.
The ADSL technology rests largely in its transmission methodology. As
mentioned ADSL transmits far more information downstream than it does upstream;
upstream and downstream refer to different "channels" which transmit information at
different frequencies. These channels are created by Discrete Multi-Tone (DMT). This
technique falls under the category of Frequency Division Multiplexing (FDM) and is a
14
multi-carrier modulation technology. Basically it takes a band of frequencies (input
data) and divides them into separate "channels" so that the channels have the same band
but a different center frequency. This allows the channels to be coded individually and
independently from each other. The DMT transmitter relies on the efficiency of the IFFT
(Inverse Fast Fourier Transform) to create these channels while the receiver uses the FFT
(Fast Fourier Transform) to do the compliment operation. This transform pair represents
a key digital signal processing technique used by the ADSL technology. DMT uses the
band from 26KHz to 134KHz for the upstream channel and 138KHz to 1.1MHz for
downstream. DMT reserves a 4KHz band (0 to 4KHz) for POTS to accommodate the
ordinary phone line on the same copper wire [14]. ADSL modems also implement error
correction algorithms in order to reduce errors that occur on a network line such as
impulse noise or continuous noise coupled into a line. These operations, performed by
DSPs, combine the channels into blocks and use error correction codes on each block of
data. This method allows for effective and accurate transmission of both data and video
signals on the wire.
The ADSL network implements a high-speed transmission technology on normal
copper phone lines. Because it uses the phone lines, it does not require much equipment
from the customer and is easy and inexpensive to use. In addition, it meets or exceeds
customer requirements with respect to Internet access. Examination of the ADSL
network reveals the importance of DSPs to the basic functionality of this technology. In
fact, DSPs coordinate and perform the main computations in the transmission technology.
These DSPs are found on the ATU-C located at the central office in the ADSL network.
Because the ATU-C contains up to 3 ADSL circuits, and in turn each ADSL circuit has at
15
least 2 DSPs, the minimum number of DSPs on each ATU-C is 6. This corresponds to 2
DSPs for each phone line. Presently 560 million copper phone lines exist worldwide.
Therefore, if each of these phone lines subscribed to ADSL, over a billion DSPs would
be needed in this network alone!
2.2
MPEG
Although this next application is not the same type of communication technology as the
other two studied, it exhibits some of the same characteristics, such as the need for many
repetitive DSP computations. This is the MPEG (Moving Picture Experts Group)
standard, which describes a compression technology for video. MPEG compresses video
data into a smaller format so that more information can fit on a storage disk or more data
can transfer across a network. The compression ratio achieved with MPEG ranges from
30:1 to 8:1, depending on the complexity of the video. One of the most popular
applications today that employs MPEG compression is DVD (Digital Versatile Disk).
This technology stores video information on DVDs similar to VHS video tapes and is
played on special DVD players (like VCRs). Each disk can hold up to 17 Gigabytes of
information. That's a lot of data!
A second, very popular application that employs
MPEG compression is video conferencing. This application relies heavily on the
transmission of data across different types of networks. Thus, using the MPEG algorithm
to compress data enables video conferencing applications to achieve real-time point-topoint communication. With MPEG compression, other technologies such as High
Definition Television (HDTV) can be transmitted at 24 frames per second while movies
and live broadcast at 30 frames per second in order to produce high quality resolution
16
pictures. An update to the standard, MPEG-2 adds the functionality to transmit highquality broadcast video. The main difference between the standards is the data rate at
which they can transmit video sequences. The MPEG-I standard targets a data rate of
1.5Mbps, which transmits over most transmission links that support MPEG format,
namely the Internet, cable networks, and ADSL networks; MPEG-2 transmits at a data
rate of 4-8Mbps. MPEG-2 supports a broader range of applications including digital TV
and coding of interlaced video, retaining all of the MPEG-I syntax and functionality.
The MPEG compression algorithm depends largely on motion compensation and
estimation. A block diagram of the algorithm is shown in Figure 1-2 [2]. It first takes
low resolution video and converts the images to YUV space. In this domain, the U and V
(color) components can be compressed to a greater degree than the Y component without
affecting the picture quality. Video pictures characteristically do not contain a lot of
movement in them and in a lot of cases the movement can be predicted if done in an
intelligent manner. MPEG compression does this prediction by estimation and
interpolative algorithms. Specifically, these techniques perform inter-frame coding
which means motion is predicted from frame to frame in the temporal direction. The
MPEG video stream consists of three types of frames. These frames are defined based on
whether their spatial or temporal redundancy is eliminated. They are also grouped
together to form GOPs (Groups of Pictures) or the MPEG bit stream. The I (Intra-coded)
frames are coded by eliminating spatial redundancy using a technique derived from JPEG
compression and serve as a reference point for the sequence. The I frame originates as a
sequence of raw images and are then split into 3 8x8 blocks of pixels (one block for
luminance and the other two for chrominance). These blocks then pass through a
17
Discrete Time Cosine Transform (DCT), are quantified, and finally proceed through an
entropy encoder which transforms the images into a MPEG bit stream (see Figure 1-2).
The second type of frame is the P (Predictive-coded) frame which is coded using motion
estimation and depends on the preceding I or P frame. In addition, to the motion
estimation and compensation operations, P frames require a DCT computation as well.
The DCTS performed on I and P frames serve to eliminate the spatial redundancy found
in these frames. Finally the B (Bidirectionally predictive-coded) frames are predicted
Low
Resolution
Q
DCT
Compressed Data
IQ
DJDC
IDCT
--
Filter
M.C.
Motion Estimation
Figure 2-2: MPEG Encoding Algorithm
based on the two closest P or I frames and are the smallest frames in the sequence.
Although this type of coding exploits similarities with future images and reduces
temporal redundancy, it still introduces a large delay in the overall algorithm [5].
18
Because images are not compressed as a single frame, an MPEG bit stream
usually consists of thousands of blocks, which represents a single image. In essence
these blocks are just smaller images and are encoded as described above using the MPEG
algorithm. Consequently, the process to compress an image is a highly repetitive one
since the same operations are executed on each block. In addition, these computations
are independent of each other and require digital signal processing. Specifically DSPs
are used to compute DCTs as well as the other signal processing operations required by
the MPEG algorithm. A Bulk DSP implementation for this algorithm could reduce the
overall compression time by performing many of these operations in parallel.
2.3
GSM
Mobile communications technology has undergone a major change in the last several
years. The mobile or cellular world has transferred from the analog to the digital domain.
Previously, cellular phones used a strictly analog protocol to transmit signals. However
with the increased number of cellular users, the push for faster data rates, and the need for
better service, cellular technology has moved into the digital realm. GSM (Global
System for Mobile Communications) a digital cellular radio network, relies on digital
cellular technology. It has been widely used in Europe for several years and is gaining
popularity in the US. GSM implements Personal Communication System (PCS) which
delivers more than just a wireless phone service. PCS incorporates the transfer of calls,
voice mail, and other data transfers anywhere, anytime. In fact, each GSM phone has a
personal identifier, which is unique to the phone and identifies itself on the GSM network
from any location. PCS also includes the ability to connect your GSM phone to a laptop
19
or computers in order to send and receive faxes, email, or connect to the Internet. GSM
has stepped to the forefront in mobile communications and provides its services in over
200 countries worldwide [9].
The GSM network architecture consists of three main functional entities that
interface with each other to provide end-to-end communication. The subsystems are the
Base Station Subsystem, the Network and Switching Subsystem, and the Operation and
Management Subsystem (see Figure 1-3).
BSS
BTS
BTS
BTS
BTS
BSC
BTS
BTS
BSC
ISD
NSS
OMC
OMC
NMC
Figure 2-3: GSM Network Architecture
GSM subscribers connect to the GSM network via a radio link from their phone
(the Mobile Station) to the Base Station Subsystem (BSS). The BSS is actually
composed of multiple Base Transceivers Stations (BTS) and a Base Station Controller
20
(BSC). The BTS includes all the transmission and reception equipment such as the
antennas and transceivers in order to conduct the radio protocols and signal processing
over the radio link. The BSC controls the set of BTSs in its service area and controls
radio-channel setup, frequency administration, and handovers for each call. In addition,
the multiplexing of speech data is performed by the Transcoder Rate Adaptation Unit
(TRAU) which is located at either the BTS, BSC, or the MSC (Mobile Service Switching
Center) depending on the configuration. The BSS subsystem interfaces with the Network
and Switching Subsystem, specifically, the Base Station Controller connects to the main
component of the Network and Switching Subsystem, the MSC. The MSC manages
communication with other fixed telecommunication networks such as ISDN (Integrated
Services Digital Network) and PSTN (Public Service Telephone Network), and it also
performs paging, resource allocation, location registration, authentication, and encryption
functionality required to service a mobile subscriber. Finally, all the equipment in the
BSC and the Switching System connect to the Operational and Maintenance Center
(OMC) which includes the operation and maintenance of GSM equipment and support of
the operator network interface. The OMC performs mostly administrative functions such
as billing within a region. Depending on the size of the network, there may only be one
OMC in a country in which case the OMC is responsible for the network administration
in the entire country [13].
GSM's technology allocates a range of frequencies to a GSM system and divides
that band of frequencies into individual simultaneous data channels. Each GSM system
has a bandwidth of 25MHz that allows for 124 carriers with a bandwidth of 200KHz
each. There are 8 users per carrier and as a result approximately 1000 total speech or
21
data channels. The maximum speech rate on the channel (known as full-rate speech) is
13Kbps (Kilobits/sec) and the maximum data rate is 9.6Kbps.
GSM's main purpose is to transmit information (either speech or data) reliably in
wireless form from one location to another. The following explanation will describe the
full-rate speech transmission in order to highlight the main details of GSM. The process
begins when the mobile station (the GSM phone) receives an audio signal (speech)
through a microphone. This signal must first be converted from an analog to a digital
signal before processing begins. This occurs by first filtering the signal so that it only
contains frequency components below 4KHz. This frequency characterizes baseband
voice signals and is the minimum bandwidth necessary to accurately recognize a voice.
Once filtered, the signal is sampled at a rate of 8000 samples per second (8KHz), which
corresponds to the minimum sampling frequency needed in order to not to lose any
information. As the signal is sampled, it is quantified into 13-bit words. Thus, the
output of this analog to digital converter is a bit stream of 104Kbps (13 x 8000) which
then becomes the input to the GSM speech codec. The speech codec's job is to reduce
this data rate to a size more appropriate for radio transmission. In essence, it removes all
the redundant information in the data stream. The codec uses the Linear Predictive
Coding (LPC) and Regular Pulse Excitation (RPE) algorithms to perform this function
and executes at a bit rate of 6.5Kbps. GSM's codec collects segments of the data stream
every 20ms and produces speech frames of 260 bits every 20ms. This corresponds to a
speech rate of 13Kbps. From there, the data is transmitted via the radio link to the Base
Transceiver Station. The next step in the process occurs at the Base Station where the
BTS receives the signal and proceeds to extract the signal and recover the modulation.
22
The signal gets directed and further transmitted by the Base Station Controller to the
MSC where the GSM transcoder (speech encoder and decoder) converts the GSM
formatted encoding into either a speech format for the PSTN or to 13Kbps data for GSM
mobile station functions.
The essential part of the GSM technology depends on digital signal processing to
encode and decode bits of information into the GSM format. Specifically DSP
processors perform the speech, modem, and channel coding, as well as decoding
operations. The DSP operation of interest computes the encoding and decoding part of
the GSM algorithm and is performed by the GSM transcoder or codec. There are many
of these DSPs in the network located in the mobile station as well as in the BTS, BSC, or
MSC depending on the network configuration. The BSC and MSC can be considered
central processing locations in the GSM infrastructure since most of the phone calls are
routed through these units. Effectively the transcoders here encode and decode
individual phone calls where the likely configuration is one transcoder per channel, thus
operating on one channel at a time. As with the ADSL network, GSM uses DSPs to
perform key processing operations on each channel of communication. So again there is
a central place in the network where "bulk processing" occurs and at which repetitive
computations are executed among its processors.
The ADSL, MPEG, and GSM technologies share similarities at two different
levels; first in their governing algorithms and second, in their system architectures. At
the algorithm level, they each require a lot of digital signal processing which
characterizes most of their computations. As stated, DSP operations tend to be composed
23
of a relatively small set of instructions, which are executed repetitively. Thus, each DSP
operation in the algorithm can be treated as a separate computational module. If the
algorithm is subdivided into these modules then the steps of the algorithm are easily
identified and an instruction level parallelism (ILP) results. GSM, ADSL, and MPEG
exemplify this level of parallelism in their algorithms. The second similarity exists at the
system level and also exemplifies a type of ILP. Each technology operates on multiple
independent data streams in parallel, thus, there is an inherit repetition among the
computations performed. The system architectures of each technology dedicate multiple
processors to work on the data even though they are all essentially doing the same thing.
Therefore the utility of a Bulk DSP is evident. This multiprocessor could take advantage
of this inherit repetitive scheme by increasing computing power and thereby exceeding
the performance of modern day microprocessors. Indeed, one would assume that if the
DSP were designed with "N" simple processors, the improvement in performance would
equal that of "N" present day DSP processors. However, if the Bulk DSP were designed
to optimize a particular algorithm, one could imagine exceeding a factor of N'
improvement in performance with the use of "N" processors. Therefore, a single
application has been chosen and a Bulk DSP designed in efforts to achieve this type of
improvement.
24
Chapter 3
The Bulk DSP Architecture
3.1
Parallel Computing
Parallel processing describes a method of computational style suited for applications
which exhibit some type of parallel algorithmic behavior. These applications usually
consist of small computational modules or modules that are used throughout the
algorithm. Given that there is a set number of transistors available to design a parallel
multiprocessor, the question is how to best utilize these transistors to maximize
performance? In order to answer this question, performance-critical aspects of the
algorithm must be considered; they include the amount of data processed, the load
balance among the computational modules, the parallel structure, the distribution of data,
and the spatial and temporal access patterns to memory of the algorithm [10]. These
factors determine the design parameters of the multiprocessor such as the architecture of
the simple processing elements, the allocation of memory resources, the communication
protocol, and ultimately the number of simple processors. With the exception of the
communication protocol, all of these parameters were considered in this design process.
Bulk DSP is basically a subset of this type of processor architecture. Bulk DSP aims at
connecting many simple processors on a single chip instead of designing one large
25
complex processor. The gain in performance stems from using these simple processors in
a parallel structure. Bulk DSP differs from modern day microprocessors in that its basic
building block consists of simple hardware and a reduced instruction set. Unlike Intel's
Pentium processor, which incorporates branch prediction and multiple instructions per
clock cycle, Bulk DSP relies on a simple set of instructions using a RISC-like structure
[11]. Also, in contrast to the Pentium, the Bulk DSP does not include an extensive
memory hierarchy. The memory components of the Bulk DSP consist of a simple
instruction cache and data cache. The instruction cache does not have to be large due to
the small number of instructions utilized by the algorithm; the main part of memory is
dedicated to the data cache that serves as a buffer memory between the modules of
computation. This architecture also differs from another type of modern day processor
named IRAM (Intelligent RAM) which was designed at the University of CaliforniaBerkeley [12]. This processor relies on the ability to place 1 billion transistors on a
single chip, made possible by advances in integrated circuit technology. In having such a
large transistor budget, IRAM is able to allocate a large portion of its transistors to
memory, specifically on-chip DRAM. Its main purpose is to diminish the gap between
microprocessor performance and the latency of main memory accesses. Although the
Bulk DSP would also rely on being able to integrate a large number of transistors on a
chip, the Bulk DSP allocates these resources to computing or processing power rather
than memory. Instead of dedicating 80% of the transistor budget to memory as IRAM
does, the Bulk DSP might dedicate this percentage to processors. Bulk DSP applications
require a lot of arithmetic computations and thus would benefit from more processing
power than memory. Both of these processors, the Pentium and IRAM, are beneficial for
26
certain classes of applications. The Pentium is designed for general-purpose applications
that don't necessarily exemplify a specific type of algorithmic structure while the IRAM
targets memory-intensive applications such as database and multimedia programs. The
Bulk DSP targets neither of these areas, moreover, it aims at improving applications
which require a lot of parallel signal processing. Thus, in comparison, an architecture
such as the Bulk DSP would be more advantageous in performance than a Pentium or
IRAM processor for this class of parallel applications. Additionally, the Bulk DSP does
better from a cost standpoint; the cost to have many simple processors on a chip is less
than the cost of a lot of DRAM or other specialized hardware characteristic of modern
day microprocessors.
3.2
Processing Elements
The architecture and organization of the Bulk DSP's simple processors model the
processing elements used in SIMD parallel processing. A SIMD multiprocessor usually
contains many simple processors called processing elements and a single control unit
with only one instruction and data memory resource. These processing elements are
characterized by their simplicity. Their main function is to execute the instructions given
to them by a control unit that distributes the instructions to all processors. The ILP
present in these programs, imply that short instruction sequences will be carried out in
parallel. Because each simple processor only carries out the given instructions, it
contains minimal control logic. In fact, these simple processors have a RISC
architecture, which basically just fetches an instruction and data, executes the assigned
computation, outputs the new data, and fetches the next instruction in a continuous cycle.
27
Essentially, these simple processors contain only basic hardware components and do not
require a lot of complexity, therefore, they are inexpensive and easy to replicate on a
chip. The number of simple processors used in this Bulk DSP will be discussed in the
scheduling section.
3.3
Memory Hierarchy
The memory resources of the Bulk DSP play a large part in the design process. For this
multiprocessor, a single 2KB (kilobyte) cache for both instructions and data will be
allocated to each simple processor. The caches will be subdivided into 256 byte sections,
which can be designated to either instruction or data memory. The instruction size of the
computing module(s) in each processor determine the portion of the cache used for
instructions, and the number of input and output bytes for each module(s) determine the
amount for data cache. Because the algorithm will be subdivided into computational
modules among the processors, data will need to propagate from one processor to
another. This means that each processor will have to both read data from and write to
other processor memories. This data movement can be setup in such a way that the
movement occurs in the "background." Consequently, the processors will not have to
wait for data and no cycles will be wasted on data movement. This concept will be
enabled by the buffer memories between the computing modules. For each processor,
there will essentially be four buffer memories. Two on the input "side" and two on the
output "side." One buffer memory on each side will be dedicated to the current set of
data being processed; this will give the processor a place from which to read current data
28
and another place to write out current data. The other two memories associated with each
processor are for other processors to write to or read from while the processor associated
with those buffer memories is busy working on the current set of data.
The focus of the remainder of the investigation explores how to best implement a
Bulk DSP. GSM serves as an excellent application for the Bulk DSP and will be the
main application of the designed processor. This is due to several reasons. First, the
Base Transceiver Station in the GSM network acts as a core processing node at which
many DSP computations take place. Second, a software library entailing the GSM
algorithm was found and proved useful for this investigation. Third, GSM is a popular
mobile cellular system, which has gained acceptance around the world thereby making it
a very relevant and useful technology to explore.
3.4
GSM Implementation
Because GSM is a cellular phone network, human speech encompasses the
majority of the information transmitted across the network. As mentioned, the speech
compression algorithm used in GSM is a Regular Pulse Excitation- Long-Term
Prediction (RPE-LTP) specified in the GSM 06.10 standard [9]. A block diagram of the
encoding algorithm is shown in Figure 2-1. This algorithm is executed in the GSM codec
and serves as its primary functionality. The input frames to the codec consist of 160
signed 13-bit linear PCM values each of which are sampled at 8 kHz. They come from
either the audio part of the mobile station or from the PSTN. These frames last for 20ms,
and thus, cover about one glottal period of a very low voice or 10 periods for a very high
29
voice. Because this is a relatively short period of time, the speech wave does not change
much and thus the algorithm will not lose any information by dividing up the speech
Short-Term
LPC
I u
S ign
Log Area Ratios
Short-Term
Pre-Process
2
RPE Parameters
RPE
Analysis
[0..1591
(4)
Long-Term
Analysis
(1) Short Term Residual
(2) Long Term Residual
(3) Short Term Residual
Estimate
(4) Reconstructed Short
Term Residual
(5) Quantized Long Term
Residual
TP
Analysis
(5)
RPE
Grid
LTP Parameters
-
Figure 2-1: GSM Encoding Algorithm
signal as such. The encoder divides the input speech samples into a short-term
predictable part, a long-term predictable part, and the rest into the residual pulse. Then, it
encodes and quantifies the residual pulse and parameters for the two predictors. The
decoder applies the long-term residual pulse to the residual pulse in order to reconstruct
the speech and then passes the filtered output through the short-term predictor [6].
3.5
Encoding and Decoding
The first step in the encoding algorithm consists of preprocessing the samples to produce
an offset-free signal and then passing them through a first-order preemphasis filter. The
resulting 160 samples are then analyzed to determine the coefficients of the short-term
30
analysis filter. This short-term linear-predictive filter (LPC analysis) is the first stage of
compression and the last stage of decompression. The speech compression in this
algorithm is achieved by modeling the human-speech system with two filters and an
initial excitation of which LPC is the first filter. In this process, the short-term filter acts
as the human vocal and nasal tract such that when excited by a mixture of glottal wave
and noise, produces speech that is hopefully similar to the one you are compressing. This
is done using the set of coefficients determined from the preprocessed signal and using
them as well as the 160 samples to produce a weighted sum of the previous output, which
is, termed the short-term residual signal. In addition, the filter parameters, named the
reflection coefficients, are transformed into log-area-ratios (LARs) before transmission
since they will be used for the short-term synthesis filter in the decoder. The next stage
in processing involves the long-term analysis where the main computation is the longterm prediction filter. Before filtering, the speech frame is subdivided into 40 sample
blocks of the short-term residual signal. Also, the parameters of the long-term analysis
filter, the LTP lag which describes the source of the copy in time and the LTP gain, a
scaling factor, are estimated and updated in the LTP analysis block. Both of these
prediction parameters are calculated based on the current sub-block and the previous 120
reconstructed short-term residual samples. With these parameters, an estimate of the
short-term residual signal is found via the long-term prediction filter. Then, the last stage
of this section, subtracts the estimated short-term residual signal from the actual shortterm signal to produce the long-term residual signal. With each 40 sub-block iteration,
56 bits of the GSM encoded frame are produced. The resulting 40 samples of the longterm signal are then passed to the regular pulse excitation analysis for the primary
31
compression operation of the algorithm. Here, each sub-segment of the residual signal is
filtered by an FIR (Finite Impulse Response) algorithm and then down-sampled by a
factor of 3. Thus results a four candidate sequence of length 13. The sequence with the
most energy is chosen and the 13 samples are quantified by block adaptive PCM
(APCM) encoding. The result is passed on to the decoder via a 2-bit grid selection.
Lastly, the encoder updates the reconstructed short-term residual in order to prepare the
next LTP analysis. In summary, the speech codec, or encoder compresses an input of 160
samples into an output frame of 260 bits every 20ms. Therefore, one can see that one
second of speech equals 1625 bytes and one megabyte of compressed data holds about 10
minutes of speech [6].
The decoder mirrors many of the encoding computations. Decoding occurs when
a call is received from the PSTN or from the Mobile Station (the cellular phone) at the
Base Station. The decoding algorithm begins by multiplying the 13 3-bit samples by the
scaling factor and expanding them back to 40 sample sub-blocks. This residual signal
passes through the long-term predictor, which consists of a similar feedback loop as the
one in the encoder. The long-term synthesis filter removes 40 samples of the previous
estimated short-term signal, scales it by the LTP gain and adds it to the incoming pulse.
This new short-term residual becomes part of the source for the next three predictions. In
addition, these samples are applied to the short-term synthesis filter, which uses the
reflection coefficients calculated by the LPC module. Finally, the de-emphasis filter
processes the samples whose output should resemble the original speech signal.
32
Chapter 4
Scheduling
In an effort to design a Bulk DSP that will optimize the performance of the GSM
algorithm, different organizations of the algorithm's computational modules were
arranged and considered for the building block of the multiprocessor. These architectures
differ based on the number of computing modules grouped together in a single processor
and the schedule in which the modules are executed by the control unit. Changing the
architecture based on these parameters allowed the designer to explore the parallel
structure already present in the GSM algorithm.
4.1
Design Methodology
At first, the "best architecture" for this Bulk DSP might seem to be a set of simple
processors with a fixed memory resource each assigned to process the entire decoding
algorithm. This architecture results in each simple processor working on an entire frame
at a time. The benefit from this solution is that each processor will continuously process
data. The only exception, or idle time induced, would occur the first time an instruction
is called within a frame; this results in some memory access time to fetch the instruction
into the cache. Due to the limited memory resource, the entire decoding program can not
33
fit into the on-chip cache, thus the idle cycles while waiting for memory access. This
design represents the scheme where a factor of N'improvement is achieved by simply
replicating N'number of simple processors on chip. However, this architecture does not
take advantage of any instruction-level parallelism present in the algorithm, and
therefore, the idea is that a more efficient scheme using the same amount of hardware
exists. Thus, given that this Bulk DSP is composed of simple processors with 2KB of
memory each, what is the best organization of these resources? In order to best address
this question, careful study of each computational module is required. Accordingly, a
discussion of the computational modules will follow.
4.2
GSM's Computational Modules
The GSM algorithm consists of two main operations: encoding and decoding. Each of
these operations can be easily subdivided into a set of independent computing modules.
This modularity allows for the flexibility in organizing the simple processors. Here, we
will focus on the decoding part of the algorithm in the design of the Bulk DSP. Because
many of the modules are the same for decoding as they are for encoding, a similar
approach as the one taken here may also serve to design a multiprocessor for the encoder.
In order to subdivide the algorithm, specific functions within the overall computation
were identified (see Figure 4-1). Ten independent modules were distinguished differing
in instruction and data size. Due to the nature of the GSM algorithm, several of these
modules can be executed in parallel thus providing a means to optimize the architectural
organization of the algorithm. The number of instructions executed characterizes each
module. There are four important parameters that determine the above information for
34
each module. They are the numbers of Loads, Stores, Arithmetics, and Shifts encountered
in the instruction set of the module. A Load represents a processor reading from memory
(this could be either instructions or data). Specifically, a Load fetches two bytes of
RPERPE
Grid
DeEmphasis
SShort-Term
Synthesis
PositionS
t
Invers
APCM
AN
Z
LAR-to-RP
APCM
Coefficiet
Quantization
LTP
Long-Term
SSynthesis
Decoding
of
LARs
LAR
Figure 4-1: GSM Decoding Algorithm
information at a time. Stores, symbolize the times the processor writes to memory.
Arithmetics are the actual computations executed by the processor, and Shifts
corresponds to the computation of indexing a data array. Another aspect of these
modules is the presence of loops in their structure. As noted earlier, DSP computations
require many of the same operations repetitively which accounts for the large number of
loops found in these modules. An example of this is demonstrated in Figure 3-1, which
shows the code for one of the computing modules, Long_TermSynthesisFiltering. An
example of the Load (L), Store (S), Arithmetic (A), and Shift accounting is also
demonstrated. So, to determine the number of instructions executed by this
35
0
Signal
P
1-0..159]
computational module, the sum of Loads, Stores, Arithmetics, and Shifts was calculated
without regard to the loops. This information designates the number of bytes stored in
the instruction cache (I-cache). One assumption made here is that each instruction equals
4 bytes of memory. This is a typical number for most modern day instruction sets. The
second calculation done, includes counting the number of the above parameters but this
time including the loops. The sum of these parameters represent the total number of
void GsmLongTermSynthesisFiltering
struct gsm-state
* S,
word
Ncr,
word
register word
bcr,
* erp,
register word
* drp
register longword
register int
Itmp;
k;
(
brp, drpp, Nr;
word
Nr = Ncr < 40 11 Ncr > 120 ? S->nrp Ncr;
S->nrp = Nr;
assert(Nr >= 40 && Nr <= 120);
brp = gsmQLB[ bcr ];
assert(brp != MINWORD);
for (k = 0; k <= 39; k++) {
drpp = GSMMULTR( brp, drp[ k - Nr] );
drp[k] = GSMADD( erp[k], drpp);
}
for (k = 0; k <= 119; k++) drp[ -120 + k] = drp[ -80 + k];
} //8L, 3S,
5A, 2Shifts
Figure 4-2: LongTermSynthesisFiltering Module
operations executed by the processor in a module. Operations process two bytes of data
at a time, thus processing 16-bit operands. The number of total operations will also be
used as the main factor to study the relative length of time the modules require to execute
all their operations. Lastly, one final calculation was done to determine the number of
data bytes "enter and exit" each module. Basically the input data and output data indicate
36
the number of bytes processed by the module in addition to revealing the data movement
to and from the modules (see Table 4.1).
Computational Module
In/Out
Operations
Instructions
GsmAPCM-quantization xmaxc to-exp-mant
APCM inverse-quantization
RPE-grid-positioning
GsmLongTermSynthesisFiltering
GsmLong_TermAdd
Decoding-of the codedLogAreaRatios
Coefficients
LARtorp
ShortTermSynthesisFiltering (k=13)
ShortTermSynthesis Filtering (k=14)
ShortTermSynthesisFiltering (k=120)
Postprocessing
4/2
30/26
28/80
328/82
320/320
16/16
16/16
16/16
50/26
54/28
240/272
42/26
436
745
699
3451
920
334
248
242
6680
7193
61571
4650
53
109
65
92
16
334
122
34
90
90
90
39
Table 4.1: Computational Module Parameters
4.3
Evaluation of the Architectures
There are two main aspects to the design of the multiprocessor: the organization, which
includes the number of simple processors and the assignment of computational modules
to each processor and secondly, the schedule of the modules, which represent the order
and time the modules are called. For each organization, a schedule was comprised and
the time it would take for the organization to produce 1, 10, and 25 frames was
calculated. Frames represent segments of individual phone calls.
Several approaches were taken in order to reach the optimal design for the
building block of this multiprocessor. The first, perhaps an obvious choice, involved
assigning each module to its own processor and arranging them as a pipeline in the order
37
in which they are executed by the algorithm. Thus, the scheduler for this organization
simply called each module as the previous one completed. Because of the buffer
memories associated with each processor, data movement between modules essentially
occurred in the "background," and consequently, the calls to the modules could be made
as soon as the previous one finished. Again, the number used as the time to complete a
particular module was the total number of operations for that module. The next
modification was made after observing the instruction-level parallel structure present in
the algorithm. While maintaining the one or two modules per processor approach, the
scheduler was modified to call the modules in parallel wherever possible as shown by
Figure 4-3 where 'M' denotes the module.
~~M1-M2
M3
-10
M4
~-
M5, M9 -M6
M7
,
M10
00
M8
Figure 4-3: Preliminary Building Block of Bulk DSP
This change, of course, decreased the total computation time in comparison to the
previous organizations. These designs were fairly straightforward, though, it was evident
that better performance could be achieved due to the large number of idle cycles incurred.
The next approach included looking more closely at the length of time it took each
processor to execute its module. Unfortunately, there exists great disparity among the
modules with respect to the computation time. As a result, more processors were
38
dedicated to modules with the most number of computations. Many of these modules are
called four times each (four times per frame), so it was easy to assign four processors to
one module without having to break-up the module. This also seemed to be a good idea
since the modules could be executed in parallel and as a result performance improved.
At this stage in the design, however, an important issue arose. How were these
architectures comparing to our best case scenario? There was basically no metric used to
see if the increase in performance was really significant or if it still lagged the oneprocessor-per-frame scheme. Up until then, it was thought that the total time it took each
organization to compute 1, 10 or 25 frames could be used to compare the different
organizations. But this number does not account for the number of wasted cycles
incurred in each type of architecture. As stated, the idle time in the one-processor-perphone-call scenario derives from the limitation of the 2KB cache, which does not hold
the entire program for the algorithm. Here, the minimum number of memory accesses
each processor exercises equals the number of instructions in the entire algorithm. In
contrast, in the schemes which hold just a few computational modules, the 2KB cache is
sufficient to hold the instructions for the individual modules. However, in these
architectures, there is idle time in the latency between processors since the number of
operations vary greatly among the modules. In order to determine the time for memory
access, a few assumptions were made. First, each access to memory takes 20 cycles and
second, each access to memory fetches 4 instructions. Hence, the number of wasted
cycles due to memory access equates to the total number of instructions multiplied by 5.
The latency between modules was simply calculated from the scheduler. Due to the fact
that certain modules take longer than others, it was often the case where some processors
39
would have to wait a long time until it could start its computation because it was waiting
for its input from another processor. This is shown in Figure 4-4, which demonstrates a
segment of the schedule for one of the proposed architectures. The numbered 'P's denote
the simple processor while the numbers represent the computation time for the module.
345
P1
P2
P3
61571
920
7193
6680
6680
92253171467
0 920
Figure 4-4: Segment of Schedule for Proposed Architecture
The idle time is denoted as the time in between processors. This turned out to be a major
problem in all of the architectures considered thus far. The number of wasted cycles due
to latency was calculated for each organization and compared to the idle time (memory
accesses) for the first pass "best architecture" to see if performance was better.
Unfortunately, the added hardware was not being used efficiently and the number of
wasted cycles outweighed any gain in performance. These results are summarized in
Table 4-2. Architecture 1 represents the building block shown in Figure 4-3 while
Architecture 2 represents the architecture from Figure 4-4, which includes a total of 4
simple processors. The performance of the best architecture is measured based on the
number of simple processors in each proposed scenario.
40
Frames Produced Avg. Idle Cycles/Proc
4770
96
71526
423
_ _ _
Best Architecture
Scenario 1
Best Architecture
Scenario 2
64
26
4770
56495
Table 4.2: Comparison of Results with Best Architecture
Another approach was attempted, but this time careful attention was paid to the
number of instructions and data for each module. The goal was to pack as many modules
into one processor that would fit into the 2KB cache in an effort to keep each processor
busy all the time, thus reducing idle time. The previous considerations such as
computation time per module were also considered. It was noted that grouping modules
together which did not need to execute orthogonal in relation to each other worked better.
Moreover, modules that shared no data dependency such as Postprocessing and the
LongTermSynthesisFiltering were grouped together. This reduced the idle time for
each processor. It soon became apparent, though, that no organization could eliminate
the large number of idle cycles in these architectures. The problem stems from the
disparity of one particular module's computation time (ShortTermSynthesisFiltering)
in comparison to all the others (see Table 4.1). This module was broken into four submodules based on the four times it is called in the frame. Three of these iterations had
computational lengths on the same order, however, the last sub-module is ten times
longer than the first two. This was the major source of processor idle time since
essentially all the other processors had to wait until this computation was done before the
next frame could be processed. A completely new approach was needed in order to
exceed the performance of the first proposed architecture.
41
The final iteration necessary to produce a better, more efficient architecture than
the original one entailed completely re-thinking the approach to the problem. This time,
the design was done without a fixed cache size for the processors. So instead of trying to
group the modules together based on their instruction sizes, the modules were grouped
sequentially, and the algorithm was divided up evenly based on the number of total
operations. Four processors were chosen as the number of processors needed to perform
the complete decoding algorithm. This number was selected based on the results of the
previous iterations and the number of buffer memories needed in this new scheme.
Essentially, more buffer memories were required because the longest module,
ShortTermSynthesisFiltering, was broken up four times and, thus, information
regarding the state within a loop needed to pass from one processor to the next. In
essence, the four processors executed approximately the same number of operations and
required a total of 12KB of cache altogether. Figure 4-5 shows a picture of the final Bulk
DSP including all the processing elements, memory and control units.
P1
P2
*
0
1
P3
P4 1
0
0
Memory
Control
Logic
Figure 4-5: Final Bulk DSP Architecture
42
P1:
P2:
P3:
P4:
M1-M8, M9
M9
M9
M9-M10
In order to accurately compare this architecture to the original one proposed it was
necessary to modify the first one to equate the hardware resources allocated to each.
Thus, 3KB of cache was allotted to each processor in the original architecture, so that
evaluating the performance of four of these processors would accurately compare to the
performance of four processors in the final architecture. Interestingly enough though,
adding 1 KB of cache to the first architecture does not really affect its number of idle
cycles, since 3KB of memory is not enough to store the entire program. However, the
present cache scheme assumed for this architecture, is simply a least recently used (LRU)
method, which just removes the oldest touched data in the cache when it needs to store
new data and is not the most efficient way to utilize a cache. So in order to significantly
reduce this idle time, a new cache scheme would need to be implemented. One can
imagine reducing this number by two if half of the instruction set was four processors.
Because the percentage of idle time in the original architecture represents such a small
part (4%) of the overall computation time, it is difficult to design an architecture with a
significant increase in performance over the original architecture.
Frames Produced Avg. Idle Cydes/Proc
Best Architecture
64
4770
Final Architecture
70
2.5
Table 4.3: Final Results
43
Chapter 5
Conclusion
In this chapter, a summary of the work completed will be presented as well as several
suggestions for future work. Lastly some final thoughts on this investigation will
conclude the thesis.
5.1
Summary
Recognizing the improvements achieved with parallel processing technologies motivated
the idea for a Bulk DSP. This multiprocessor is intended for technologies which
demonstrate repetition at two levels: system and algorithmic. The former is characteristic
of communication technologies which generally entail some type of core processing node
in their networks, and the latter of DSP technologies which exhibit repetitive instructions
in their algorithms. This investigation aimed at researching several technologies that
demonstrate these characteristics and applying the concepts of parallel processing to
design a multiprocessor implementation. As a result, a specific technology was chosen,
GSM, and a design methodology carried out. This process was described in order to
present some of the architectures designed and analyzed for this investigation. Finally,
the most prominent architecture was deemed as the best implementation of a GSM Bulk
DSP based on the increase in performance it would bring to the existing technology.
44
5.2
Future Work
The final architecture proposed only succeeded in achieving marginal improvements over
the N' DSP factor. Thus, there might yet be another way to realize a larger improvement
in performance. As stated, the limiting factor in utilizing the presence of multiple
processors was the modules' disparity in computation length. If one modified the
implementation of the decoding algorithm to reduce this disparity, the added processors
might be used more efficiently. Specifically, the long sub-module of the
ShortTermSynthesisFiltering could be divided into smaller, equal portions, which
match the length of the other sub-modules. This would eliminate the wasted cycles due
to latency between the processors in the organizations which grouped only a few modules
together in an effort to take advantage of the instruction-level parallelism present in the
algorithm.
A second way to improve the design might be to consider incorporating more of
the overall GSM algorithm that gets executed at the Base Station into the Bulk DSP.
This idea is based on the fact that the encoding and decoding instruction set is just not
very large and therefore, does not require a lot of computing power. If more functionality
were required of the Bulk DSP, its parallel structure and added processing power might
be efficiently used to achieve greater performance. The channel encoding/decoding, as
well as error protection and encryption of the radio channel are all parts of the GSM
system that occur at the Base Station and would be likely candidates to incorporate into
the Bulk DSP.
45
Finally, the majority of the work done in this investigation occurred at a
theoretical level, based on "paper calculations." Although the results gained from these
calculations are extremely revealing and necessary, there is another level of investigation
needed before the multiprocessor can actually be implemented in hardware. The next
step in design is to simulate the proposed organizations at both the algorithmic level and
the hardware level. The software analysis can be done by taking the best architectures
proposed and using a GSM decoding library to simulate these configurations. Actual
speech or GSM encoded data can be used as test vectors to see how long it takes each
architecture to process a given number of frames or data inputs. These results should
confirm which architecture is indeed the best to implement. Finally, the hardware should
be specified, which consists of designing and figuring out the exact hardware components
needed to execute GSM's decoding algorithm along with what control logic and exact
memory hierarchy will be used. There are several well known programs that can aid in
this part of the design, such as VHDL or Verilog. These programs provide a way to
implement the architecture at the circuit level as well as a method to simulate and verify
functionality.
5.3 Final Thoughts
The main conclusions drawn from this investigation are threefold.
1.) The GSM encoding/decoding algorithm is simply not complex enough to take
advantage of parallel processing. It contains only a small set of instructions, which does
not allow for efficient use of replicated processing power. The added processing power
46
is essentially lost in the latency that stems from a difference in the processing lengths of
its computing modules.
2.) Parallel processing involves much more than just simply replicating processors on a
chip. In fact, when replicating N'processors on a chip, it is not always the case that this
will result in a factor of N'improvement. The effect of parallel processing largely
depends on the algorithm that it tries to optimize. The number of instructions in the
algorithm play an important role, as does the schedule in which these instructions are
executed. As seen from the stated results, many instructions are necessary in order to
benefit from the increased number of processors. The optimal program or algorithm for
parallel processing is one that contains a lot of instructions, including large decision trees.
This implies that the added processors would actually be used since only a really large
cache could store all the instructions. Without the large cache present, cache misses
would claim a significant amount of the total computation time. In this scenario, adding
processors would effectively take the place of idle time.
3.) Because the Bulk DSP is essentially a parallel multiprocessor, it will only benefit
technologies with "optimal" parallel properties. In this study, it was specifically applied
to the GSM decoding algorithm and did not prove extremely beneficial. However, the
Bulk DSP could still be the most desirable processor for the other technologies
researched such as MPEG and ADSL since their specific algorithms were not identified
nor considered. Furthermore, a GSM implementation that incorporates more functionality
might also result in a more beneficial Bulk DSP.
47
Appendix A
Software Tools Used
Jutta Degener and Carsten Bormann, from the Technical University of Berlin, developed
the GSM 06.10 software used to research and study the encoding and decoding
algorithms. The software consists of a C library as a stand-alone program. It was first
designed for a UNIX-like environment although the library has been ported to VMS and
DOS environments which was the implementation used in this investigation. Several
other tools were used to test and use the library. The code was run using Microsoft
Visual C++ compiler. A digital audio editor named Cool Edit, was used to convert GSM
files to raw PCM format in order to test the encoding modules. Syntrillium Software
developed this program.
48
Appendix B
GSM Resources on the Web
Many web sites containing both official and unofficial information about GSM and the
telecommunications industry were found throughout this research. A list of the most
helpful and relevant sites follow.
*
Dr. Dobb's Journal --A good technical explanation of GSM encoding/decoding:
http://www.ddj.com/articles/1994/9412/9412b/9412b.htm
*
GSM Encoding/Decoding C library--Jutta Degener and Carsten Bormann:
http://www.kbs.cs.tu-berlin.de/-jutta/toast.html
*
GSM Information Network: http://www.gin.nl/
*
GSMag International: http://www.gsmag.com/
*
GSM Online Journal: http://www.gsmdata.com/today.html
*
GSM Specification--ETSI: http://www.etsi.org/
*
GSM Streaming Audio for the Web: http://itre.ncsu.edu/gsm/
*
GSM World--GSM MOU Association: http://www.gsmworld.com/
*
Intel's GSM Data Knowledge Site: http://www.gsmdata.com/
*
International Telecommunication Union: http://www.itu.int/
*
Audio Clips in different formats: http://www.geek-girl.com/audioclips.html
49
*
Total Telecom: http://www.totaltele.com/
*
Universal Mobile Telecommunications: http://www.umts-forum.org/
50
Bibliography
[1] ADSL Forum, ADSL Tutorial,available at http://www.adsl.com/adslforum.html.
[2] Analog Devices, A FastRamp to the Information Superhighway and How Pieces Fit
Together, available at
http://www.analog.com/publications/whitepapers/products/backadsl/index.html.
[3] Array Microsystems, Inc. VideoFlow Architecture, available at
http://www.array.com.
[4] Baldi Mario and Ofek Yoram. End-to-End Delay of Videoconferencing over Packet
Switched Networks, Yorktown Heights, NY, IBM T.J. Watson Research Center,
1996.
[5] Chen Ming-Syan and Kandlur Dilip D. Stream Conversion to Support Interactive
Playout of Videos in a Client Station, Yorktown Heights, NY, IBM T.J. Watson
Research Center, 1994.
[6] Degener, Jutta, DigitalSpeech Compression, available at
http://www.ddj.com/articles/1994/9412/9412b/9412b.htm.
[7] DSL Knowledge Center, How Does ADSL Work?, available at
http://www.orckit.comlorckitdsl_center.html.
[8] GSM MOU Association, GSM World, available at http://www.gsmworld.com.
[9] GSM 06.10 - European digital cellular telecommunications system (Phase 2); Full
ratespeech transcoding, ETS 300 580-2, European Telcommunications Standard
Institute, March 1998.
[ 10] Hennessy, J., and Patterson, D., Computer Architecture:A QuantitativeApproach,
2nd ed., Morgan Kaufmann Publishers Inc, San Fransisco, CA, 1996.
[11] Intel Corporation, Pentium III Processor,available at
http://developer.intel.com/design/pentiumiii.
51
[12] Kozyrakis, C., et al. Scaling Processors to 1 Billion Transistors and Beyond:
IRAM, IEEE Computer, September 1997, pp. 75-78.
[13] Mehrotra Asha, GSM System Engineering,Norwood, MA: Artech House, 1997.
[14] Motorola, DMT Line Code, available at
http://mot-sps.com/sps/General/chips-nav.html.
[15] Motorola, Echo Cancellation,available at
http://mot-sps.com/sps/General/chips-nav.html.
[16] Patterson, D., et al., A Case for Intelligent RAM: IRAM. IEEE Micro vol. 17, no.
2 (April1997), pp. 34-44.
[17] Redl Siegmund M., Weber Matthias K., and Oliphant Malcolm W., GSM and
PersonalCommunications Handbook, Norwood, MA: Artech House, 1998.
[18] Redl Siegmund M., Weber Matthias K., and Oliphant Malcolm W., An Introduction
to GSM, Norwood, MA: Artech House, 1995.
[19] Tisal, Joachim, GSM CellularRadio Telephony, West Sussex Pol9 IUD, England,
John Wiley & Sons Ltd, 1997.
[20] Turletti, Thierry, Bentzen, Hans, and Tennehouse, David, Towards the Software
Realization of a GSM Base Station, to appear in JSAC issue on software radios,
4th
52