A Multiprocessor Implementation for the GSM Algorithm by Jennifer C. Kleiman Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degrees of Bachelor of Science in Electrical Engineering and Computer Science and Master of Engineering in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology May 21, 1999 © Copyright Jennifer C. Kleiman 1999. All rights reserved. The author hereby grants to M.I.T. permission to reproduce and distribute publicly paper and electronic copies of this thesis and to grant others the right to do so. Author Department of Electrical Engineering(nd domputer Science May 21, 1999 Certified by Dr. Christ h~r J. Terman T jesSupepjsor Accepted by Prof. Arthur . Smith Chairman, Department Committee on Graduate es INSTITUTE MASSACHUSETTS OF TECHNOLOGY JUL 1 5 1999 LIBRARIES A Multiprocessor Implementation for the GSM Algorithm by Jennifer C. Kleiman Submitted to the Department of Electrical Engineering and Computer Science May 21, 1999 In Partial Fulfillment of the Requirements for the Degrees of Bachelor of Science in Electrical Engineering and Computer Science and Master of Engineering in Electrical Engineering and Computer Science ABSTRACT Telecommunications or simply communications play an important role in the computer industry. At the core of this industry lies the digital signal processor. Moreover, many communication technologies rely principally on signal processing. At both the system and component level, redundancy is present in most of these applications. Therefore, an opportunity exists to optimize these technologies using parallel processing. Specifically, the DSPs in these applications may be designed in a parallel configuration to achieve higher performance and a reduction of dedicated hardware. GSM, a mobile communications system, displays redundancy at core processing nodes in its network as well as in its fundamental speech processing algorithm, thereby making it an optimal choice for this implementation. This thesis describes the design methodology for this implementation and evaluates several different configurations. As a result, a new multiprocessor is proposed. Thesis Supervisor: Christopher J. Terman Title: Senior Lecturer, Dept. of Electrical Engineering and Computer Science 2 Acknowledgments First, I would like to thank God for giving me the patience, endurance, and motivation to complete this thesis. I would like to thank my advisor, Chris Terman, for providing a wealth of knowledge and assistance throughout my thesis endeavor. I am thankful for the opportunity to have worked with him. To my parents, I would like to give my deepest gratitude and honor. They have been a constant source of love, support and encouragement since the moment I came to MIT. I thank them for the sacrifices they made in order to provide me with an excellent education. "Keep Going" 3 Contents 1 Introduction 1.1 Background ........................................................................................................ 1.2 D igital Signal Processors....................................................................................... 1.3 N etw ork A rchitectures ........................................................................................... 7 7 8 9 2 Technologies of Interest 2.1 A D SL....................................................................................................................... 2.2 M PEG ...................................................................................................................... 2.3 GSM ........................................................................................................................ 12 12 16 19 3 The Bulk DSP Architecture 3.1 Parallel Com puting.............................................................................................. 3.2 Processing Elem ents........................................................................................... 3.3 M em ory Hierarchy ............................................................................................. 3.4 GSM Implem entation ......................................................................................... 3.5 Encoding and D ecoding ...................................................................................... 25 25 27 28 29 30 4 Scheduling 4.1 D esign M ethodology ........................................................................................... 4.2 G SM 's Com putational M odules ........................................................................... 4.3 Evaluation of Architectures................................................................................ 33 33 34 37 5 Conclusion 5.1 Summ ary ................................................................................................................. 5.2 Future W ork ........................................................................................................ 5.3 Final Thoughts.................................................................................................... 44 44 45 46 A Software Tools Used 48 B GSM Resources on the Web 49 4 List of Figures 2-1 ADSL Network Connection.................................................................................. 14 2-2 MPEG Encoding Algorithm .................................................................................. 18 2-3 GSM Network Architecture...................................................................................... 20 3-1 GSM Encoding Algorithm......................................................................................... 30 4-1 GSM Decoding Algorithm.........................................................................................35 4-2 LongTermS nthesisFiltering Module ............................................................. 36 4-3 Preliminary Building Block of Bulk DSP.............................................................. 38 4-4 Segment of Schedule for Proposed Architecture .................................................. 40 4-5 Final Bulk DSP Architecture .................................................................................. 42 5 List of Tables 4.1 Computational Module Parameters ........................................................................ 37 4.2 Comparison of Results with Best Architecture ...................................................... 41 4 .3 F in al R esults ............................................................................................................... 43 6 Chapter 1 Introduction This thesis identifies the utility of a parallel multiprocessor for use in communication technologies. In order to demonstrate this, several applications are explored, and the basic architecture of the new parallel multiprocessor is presented. A specific technology is then chosen as a vehicle to implement this multiprocessor, and the design process is described. In addition, several architectures are analyzed based on experiments done utilizing a set of software modules, and the performance results are presented. Finally, some conclusions are drawn for the final implementation. 1.1 Background Virtually everyone today uses some type of telephone service. Whether it is a 'plain old telephone', a wireless cellular phone, or a modem in a computer, people heavily depend on telephone systems as their primary source of communication. By the year 2001, it is estimated that there will be 1 billion phone lines worldwide and 580 million cell phone subscribers. Communication, however, does not just encompass spoken conversation between people. Technological advances have enabled communications or telecommunications to also include transmitting sound, video, and digital data across 7 subscribers.8 Communication, however, does not just encompass spoken conversation between people. Technological advances have enabled communications or telecommunications to also include transmitting sound, video, and digital data across telephone lines, radio frequencies, and cable lines. Because these types of communication have expanded so rapidly, there are now telephone systems in almost every area of the world. Even so, there still remains a large demand for global connectivity; the most prominent example of this today is the Internet. This demand drives telecommunication technologies to provide an infrastructure, which includes the hardware, software, and network topology to further connect people around the world while delivering as much information as possible. In addition, these communication systems need to service millions of customers simultaneously and efficiently. 1.2 Digital Signal Processors Communication technology relies on specialized hardware to perform digital signal processing computations. In general, the computations performed are part or all of a signal processing algorithm. Specifically, the numerous computations include receiving, decoding, directing, encoding, and transmitting data. All of these operations are done digitally by the DSPs. The benefits for communication applications reaped from operating in the digital realm include reliability in the transmission process as well as compression in data size, which makes transmission faster and easier. These procedures are carried on small portions of the data known as frames on each of the subscriber lines in order to achieve "seamless" point-to-point communication. The hardware components that perform these functions are known as digital signal processors (DSPs). These 8 processors have been designed to handle complex arithmetic computations precisely for speech and data processing applications. Digital signal processing applications tend to be repetitive in nature. Many signal processing functions rely on the execution of the same computation on each byte of data in order to complete the desired operation. In fact, numerous of these algorithms consist of a relatively small set of instructions that are just executed over and over again in loop configurations. Because of this inherit repetition, the overall signal processing algorithm can be broken down into smaller computing modules, which are then just applied repeatedly throughout the algorithm. 1.3 Network Architectures The network topologies of these communication systems vary somewhat depending on the particular type of data they transmit and the communication medium used. Although, the current trend is to build networks that can transmit all types of data, there still exist several methods in which to transmit this data (i.e. physical lines vs. wireless). In general, the basic foundations of these networks include a backbone network that connects everything in the network. There are also several central locations in the network, which brings together all the separate "channels" or subscriber lines in order to process the data. Usually, these channels carry multiple simultaneous calls, and thus, these central locations receive and transmit hundreds of channels at any given time. Moreover, the central office or location is equipped with many computers whose tasks are only to process these incoming and outgoing channels of communication. Because all the channels come through a central office, the majority or bulk of the processing in the 9 network occurs at these central locations. The other major components of the network infrastructure handle other operations such as transmitting and receiving the channels. Since the central offices control much of the computation in a communication system, they can be considered the "core processing nodes" of the network. The latency of these core processing nodes, however, is much larger than at anywhere else in the network and as a result much of the total computation time is spent here. Therefore, the efficiency of these nodes play a significant role in the overall network performance. As stated, much of the processing in many communication networks occurs in central locations. Present day infrastructures tend to use a single DSP processor for a small number of channels routed through these points. Since there are usually many channels transmitted through these locations, there are many DSPs located at these spots. Ideally a single DSP chip would process a large number of the channels at these locations which can vary anywhere from 100 to 1000 depending on the technology. This would immensely reduce the number of processors in the network. Logically, the next step towards achieving such an improvement is to design a DSP or multiprocessor that could process much more than just a few channels. By taking advantage of the instruction level parallelism(ILP) among the channels, a parallel computation architecture such as a SIMD (Single Instruction stream Multiple stream) vector processor might be used to design a more efficient implementation. SIMD refers to an architecture in which the same instruction is executed by multiple processors on different data streams. In particular, for communication applications, a multiprocessor or 'Bulk DSP'might be designed to contain many simple processors, such that each processor executes the same instruction or set of instructions in parallel. This design would enable the Bulk DSP to 10 control and process many channels. The reduction in hardware would result in significant cost savings for the given infrastructure. In addition, if successful, the Bulk DSP would exceed the performance of existing communication networks, thereby meeting the growing demands of communication technologies. 11 Chapter 2 Technologies of Interest Of the many communication and digital signal processing applications, the three that will be considered are ADSL (Asynchronous Digital Subscriber Line), MPEG (Motion Picture Experts Group video compression), and GSM (Global System for Mobile Communications). These technologies were chosen based on their significance to the communications world as well as the similarities in their computation structure. They each embody several key characteristics, which demonstrate the utility of a Bulk DSP as their core processor. 2.1 ADSL Asynchronous Digital Subscriber Line refers to a communication technology implemented on the copper telephone lines found in all homes and businesses. ADSL is an enhanced version of your basic phone line that provides much more data bandwidth to the subscriber. It transfers voice, data and video at a significant increase in data rates. The main point of attraction that drives this technology is a faster Internet connection. A salient detail of this technology lies in the data rate transmitted. Much more bandwidth is sent downstream (as from the Internet to one's computer) than in the reverse direction; 12 thus the name Asynchronous DSL. In fact, ADSL communication can theoretically reach data rates of 8Mbps (Mega bits per second) downstream and 1Mbps upstream. ADSL communication exists on the same twisted-pair wire as the telephone line and transmits simultaneously with existing phone services without interruption or interference. Because of this, no new phone lines need to be installed to implement ADSL, making it an attractive choice of communication. Once companies start deploying ADSL, customers need only to acquire an ADSL modem for their computers and to subscribe for ADSL service in order to start using ADSL. The ADSL network consists of the central office that contains the ADSL Modem Rack, the phone line connection, a POTS (Plain Old Telephone System) splitter, and the user-end ADSL modems. Figure 1-1 shows the details of the core network and how it interfaces with the actual phone lines and other types of networks [2]. As shown by the diagram, the central office in the network manages all the communication via phone lines to and from personal computers and corporate networks. When a call comes into the central office, it is first passed through a POTS splitter, which "splits" the call into voice and data signals and then directs them to the appropriate device. A voice call will proceed to the Public Switch Telephone Network while a data call will go to the ADSL Modem Rack. The Modem Rack consists of many line cards or ATU-Cs (ADSL Transceiver Unit- Central Office) and is the key device of interest since it uses digital signal processors. The ATU-C receives data from the access module and converts the data into analog signals. It also receives and decodes data from customers sent by the ATU-R (remote or end-user modem). Presently, each ATU-C can 13 accommodate up to 3 ADSL circuits, which means it can serve up to 3 individual phone lines. These ADSL circuits are implemented with several integrated circuits such as a core DSP to perform Discrete Multi-Tone (DMT) technology functions, a line driver/receiver, a general Central Office Public Phone Switch PT _ or _ Fax PC or Network Computer WWW or Video Server - ADSL Modem POTS POTS Splitter Sphitter ADSL Modem Rack Ethernet, Frame Relay, ATM Internet Backbone Figure 2-1: ADSL Network Connection purpose DSP, and an ASIC to perform all the analog and mixed-signal operations as well as the modem configuration software. The ADSL technology rests largely in its transmission methodology. As mentioned ADSL transmits far more information downstream than it does upstream; upstream and downstream refer to different "channels" which transmit information at different frequencies. These channels are created by Discrete Multi-Tone (DMT). This technique falls under the category of Frequency Division Multiplexing (FDM) and is a 14 multi-carrier modulation technology. Basically it takes a band of frequencies (input data) and divides them into separate "channels" so that the channels have the same band but a different center frequency. This allows the channels to be coded individually and independently from each other. The DMT transmitter relies on the efficiency of the IFFT (Inverse Fast Fourier Transform) to create these channels while the receiver uses the FFT (Fast Fourier Transform) to do the compliment operation. This transform pair represents a key digital signal processing technique used by the ADSL technology. DMT uses the band from 26KHz to 134KHz for the upstream channel and 138KHz to 1.1MHz for downstream. DMT reserves a 4KHz band (0 to 4KHz) for POTS to accommodate the ordinary phone line on the same copper wire [14]. ADSL modems also implement error correction algorithms in order to reduce errors that occur on a network line such as impulse noise or continuous noise coupled into a line. These operations, performed by DSPs, combine the channels into blocks and use error correction codes on each block of data. This method allows for effective and accurate transmission of both data and video signals on the wire. The ADSL network implements a high-speed transmission technology on normal copper phone lines. Because it uses the phone lines, it does not require much equipment from the customer and is easy and inexpensive to use. In addition, it meets or exceeds customer requirements with respect to Internet access. Examination of the ADSL network reveals the importance of DSPs to the basic functionality of this technology. In fact, DSPs coordinate and perform the main computations in the transmission technology. These DSPs are found on the ATU-C located at the central office in the ADSL network. Because the ATU-C contains up to 3 ADSL circuits, and in turn each ADSL circuit has at 15 least 2 DSPs, the minimum number of DSPs on each ATU-C is 6. This corresponds to 2 DSPs for each phone line. Presently 560 million copper phone lines exist worldwide. Therefore, if each of these phone lines subscribed to ADSL, over a billion DSPs would be needed in this network alone! 2.2 MPEG Although this next application is not the same type of communication technology as the other two studied, it exhibits some of the same characteristics, such as the need for many repetitive DSP computations. This is the MPEG (Moving Picture Experts Group) standard, which describes a compression technology for video. MPEG compresses video data into a smaller format so that more information can fit on a storage disk or more data can transfer across a network. The compression ratio achieved with MPEG ranges from 30:1 to 8:1, depending on the complexity of the video. One of the most popular applications today that employs MPEG compression is DVD (Digital Versatile Disk). This technology stores video information on DVDs similar to VHS video tapes and is played on special DVD players (like VCRs). Each disk can hold up to 17 Gigabytes of information. That's a lot of data! A second, very popular application that employs MPEG compression is video conferencing. This application relies heavily on the transmission of data across different types of networks. Thus, using the MPEG algorithm to compress data enables video conferencing applications to achieve real-time point-topoint communication. With MPEG compression, other technologies such as High Definition Television (HDTV) can be transmitted at 24 frames per second while movies and live broadcast at 30 frames per second in order to produce high quality resolution 16 pictures. An update to the standard, MPEG-2 adds the functionality to transmit highquality broadcast video. The main difference between the standards is the data rate at which they can transmit video sequences. The MPEG-I standard targets a data rate of 1.5Mbps, which transmits over most transmission links that support MPEG format, namely the Internet, cable networks, and ADSL networks; MPEG-2 transmits at a data rate of 4-8Mbps. MPEG-2 supports a broader range of applications including digital TV and coding of interlaced video, retaining all of the MPEG-I syntax and functionality. The MPEG compression algorithm depends largely on motion compensation and estimation. A block diagram of the algorithm is shown in Figure 1-2 [2]. It first takes low resolution video and converts the images to YUV space. In this domain, the U and V (color) components can be compressed to a greater degree than the Y component without affecting the picture quality. Video pictures characteristically do not contain a lot of movement in them and in a lot of cases the movement can be predicted if done in an intelligent manner. MPEG compression does this prediction by estimation and interpolative algorithms. Specifically, these techniques perform inter-frame coding which means motion is predicted from frame to frame in the temporal direction. The MPEG video stream consists of three types of frames. These frames are defined based on whether their spatial or temporal redundancy is eliminated. They are also grouped together to form GOPs (Groups of Pictures) or the MPEG bit stream. The I (Intra-coded) frames are coded by eliminating spatial redundancy using a technique derived from JPEG compression and serve as a reference point for the sequence. The I frame originates as a sequence of raw images and are then split into 3 8x8 blocks of pixels (one block for luminance and the other two for chrominance). These blocks then pass through a 17 Discrete Time Cosine Transform (DCT), are quantified, and finally proceed through an entropy encoder which transforms the images into a MPEG bit stream (see Figure 1-2). The second type of frame is the P (Predictive-coded) frame which is coded using motion estimation and depends on the preceding I or P frame. In addition, to the motion estimation and compensation operations, P frames require a DCT computation as well. The DCTS performed on I and P frames serve to eliminate the spatial redundancy found in these frames. Finally the B (Bidirectionally predictive-coded) frames are predicted Low Resolution Q DCT Compressed Data IQ DJDC IDCT -- Filter M.C. Motion Estimation Figure 2-2: MPEG Encoding Algorithm based on the two closest P or I frames and are the smallest frames in the sequence. Although this type of coding exploits similarities with future images and reduces temporal redundancy, it still introduces a large delay in the overall algorithm [5]. 18 Because images are not compressed as a single frame, an MPEG bit stream usually consists of thousands of blocks, which represents a single image. In essence these blocks are just smaller images and are encoded as described above using the MPEG algorithm. Consequently, the process to compress an image is a highly repetitive one since the same operations are executed on each block. In addition, these computations are independent of each other and require digital signal processing. Specifically DSPs are used to compute DCTs as well as the other signal processing operations required by the MPEG algorithm. A Bulk DSP implementation for this algorithm could reduce the overall compression time by performing many of these operations in parallel. 2.3 GSM Mobile communications technology has undergone a major change in the last several years. The mobile or cellular world has transferred from the analog to the digital domain. Previously, cellular phones used a strictly analog protocol to transmit signals. However with the increased number of cellular users, the push for faster data rates, and the need for better service, cellular technology has moved into the digital realm. GSM (Global System for Mobile Communications) a digital cellular radio network, relies on digital cellular technology. It has been widely used in Europe for several years and is gaining popularity in the US. GSM implements Personal Communication System (PCS) which delivers more than just a wireless phone service. PCS incorporates the transfer of calls, voice mail, and other data transfers anywhere, anytime. In fact, each GSM phone has a personal identifier, which is unique to the phone and identifies itself on the GSM network from any location. PCS also includes the ability to connect your GSM phone to a laptop 19 or computers in order to send and receive faxes, email, or connect to the Internet. GSM has stepped to the forefront in mobile communications and provides its services in over 200 countries worldwide [9]. The GSM network architecture consists of three main functional entities that interface with each other to provide end-to-end communication. The subsystems are the Base Station Subsystem, the Network and Switching Subsystem, and the Operation and Management Subsystem (see Figure 1-3). BSS BTS BTS BTS BTS BSC BTS BTS BSC ISD NSS OMC OMC NMC Figure 2-3: GSM Network Architecture GSM subscribers connect to the GSM network via a radio link from their phone (the Mobile Station) to the Base Station Subsystem (BSS). The BSS is actually composed of multiple Base Transceivers Stations (BTS) and a Base Station Controller 20 (BSC). The BTS includes all the transmission and reception equipment such as the antennas and transceivers in order to conduct the radio protocols and signal processing over the radio link. The BSC controls the set of BTSs in its service area and controls radio-channel setup, frequency administration, and handovers for each call. In addition, the multiplexing of speech data is performed by the Transcoder Rate Adaptation Unit (TRAU) which is located at either the BTS, BSC, or the MSC (Mobile Service Switching Center) depending on the configuration. The BSS subsystem interfaces with the Network and Switching Subsystem, specifically, the Base Station Controller connects to the main component of the Network and Switching Subsystem, the MSC. The MSC manages communication with other fixed telecommunication networks such as ISDN (Integrated Services Digital Network) and PSTN (Public Service Telephone Network), and it also performs paging, resource allocation, location registration, authentication, and encryption functionality required to service a mobile subscriber. Finally, all the equipment in the BSC and the Switching System connect to the Operational and Maintenance Center (OMC) which includes the operation and maintenance of GSM equipment and support of the operator network interface. The OMC performs mostly administrative functions such as billing within a region. Depending on the size of the network, there may only be one OMC in a country in which case the OMC is responsible for the network administration in the entire country [13]. GSM's technology allocates a range of frequencies to a GSM system and divides that band of frequencies into individual simultaneous data channels. Each GSM system has a bandwidth of 25MHz that allows for 124 carriers with a bandwidth of 200KHz each. There are 8 users per carrier and as a result approximately 1000 total speech or 21 data channels. The maximum speech rate on the channel (known as full-rate speech) is 13Kbps (Kilobits/sec) and the maximum data rate is 9.6Kbps. GSM's main purpose is to transmit information (either speech or data) reliably in wireless form from one location to another. The following explanation will describe the full-rate speech transmission in order to highlight the main details of GSM. The process begins when the mobile station (the GSM phone) receives an audio signal (speech) through a microphone. This signal must first be converted from an analog to a digital signal before processing begins. This occurs by first filtering the signal so that it only contains frequency components below 4KHz. This frequency characterizes baseband voice signals and is the minimum bandwidth necessary to accurately recognize a voice. Once filtered, the signal is sampled at a rate of 8000 samples per second (8KHz), which corresponds to the minimum sampling frequency needed in order to not to lose any information. As the signal is sampled, it is quantified into 13-bit words. Thus, the output of this analog to digital converter is a bit stream of 104Kbps (13 x 8000) which then becomes the input to the GSM speech codec. The speech codec's job is to reduce this data rate to a size more appropriate for radio transmission. In essence, it removes all the redundant information in the data stream. The codec uses the Linear Predictive Coding (LPC) and Regular Pulse Excitation (RPE) algorithms to perform this function and executes at a bit rate of 6.5Kbps. GSM's codec collects segments of the data stream every 20ms and produces speech frames of 260 bits every 20ms. This corresponds to a speech rate of 13Kbps. From there, the data is transmitted via the radio link to the Base Transceiver Station. The next step in the process occurs at the Base Station where the BTS receives the signal and proceeds to extract the signal and recover the modulation. 22 The signal gets directed and further transmitted by the Base Station Controller to the MSC where the GSM transcoder (speech encoder and decoder) converts the GSM formatted encoding into either a speech format for the PSTN or to 13Kbps data for GSM mobile station functions. The essential part of the GSM technology depends on digital signal processing to encode and decode bits of information into the GSM format. Specifically DSP processors perform the speech, modem, and channel coding, as well as decoding operations. The DSP operation of interest computes the encoding and decoding part of the GSM algorithm and is performed by the GSM transcoder or codec. There are many of these DSPs in the network located in the mobile station as well as in the BTS, BSC, or MSC depending on the network configuration. The BSC and MSC can be considered central processing locations in the GSM infrastructure since most of the phone calls are routed through these units. Effectively the transcoders here encode and decode individual phone calls where the likely configuration is one transcoder per channel, thus operating on one channel at a time. As with the ADSL network, GSM uses DSPs to perform key processing operations on each channel of communication. So again there is a central place in the network where "bulk processing" occurs and at which repetitive computations are executed among its processors. The ADSL, MPEG, and GSM technologies share similarities at two different levels; first in their governing algorithms and second, in their system architectures. At the algorithm level, they each require a lot of digital signal processing which characterizes most of their computations. As stated, DSP operations tend to be composed 23 of a relatively small set of instructions, which are executed repetitively. Thus, each DSP operation in the algorithm can be treated as a separate computational module. If the algorithm is subdivided into these modules then the steps of the algorithm are easily identified and an instruction level parallelism (ILP) results. GSM, ADSL, and MPEG exemplify this level of parallelism in their algorithms. The second similarity exists at the system level and also exemplifies a type of ILP. Each technology operates on multiple independent data streams in parallel, thus, there is an inherit repetition among the computations performed. The system architectures of each technology dedicate multiple processors to work on the data even though they are all essentially doing the same thing. Therefore the utility of a Bulk DSP is evident. This multiprocessor could take advantage of this inherit repetitive scheme by increasing computing power and thereby exceeding the performance of modern day microprocessors. Indeed, one would assume that if the DSP were designed with "N" simple processors, the improvement in performance would equal that of "N" present day DSP processors. However, if the Bulk DSP were designed to optimize a particular algorithm, one could imagine exceeding a factor of N' improvement in performance with the use of "N" processors. Therefore, a single application has been chosen and a Bulk DSP designed in efforts to achieve this type of improvement. 24 Chapter 3 The Bulk DSP Architecture 3.1 Parallel Computing Parallel processing describes a method of computational style suited for applications which exhibit some type of parallel algorithmic behavior. These applications usually consist of small computational modules or modules that are used throughout the algorithm. Given that there is a set number of transistors available to design a parallel multiprocessor, the question is how to best utilize these transistors to maximize performance? In order to answer this question, performance-critical aspects of the algorithm must be considered; they include the amount of data processed, the load balance among the computational modules, the parallel structure, the distribution of data, and the spatial and temporal access patterns to memory of the algorithm [10]. These factors determine the design parameters of the multiprocessor such as the architecture of the simple processing elements, the allocation of memory resources, the communication protocol, and ultimately the number of simple processors. With the exception of the communication protocol, all of these parameters were considered in this design process. Bulk DSP is basically a subset of this type of processor architecture. Bulk DSP aims at connecting many simple processors on a single chip instead of designing one large 25 complex processor. The gain in performance stems from using these simple processors in a parallel structure. Bulk DSP differs from modern day microprocessors in that its basic building block consists of simple hardware and a reduced instruction set. Unlike Intel's Pentium processor, which incorporates branch prediction and multiple instructions per clock cycle, Bulk DSP relies on a simple set of instructions using a RISC-like structure [11]. Also, in contrast to the Pentium, the Bulk DSP does not include an extensive memory hierarchy. The memory components of the Bulk DSP consist of a simple instruction cache and data cache. The instruction cache does not have to be large due to the small number of instructions utilized by the algorithm; the main part of memory is dedicated to the data cache that serves as a buffer memory between the modules of computation. This architecture also differs from another type of modern day processor named IRAM (Intelligent RAM) which was designed at the University of CaliforniaBerkeley [12]. This processor relies on the ability to place 1 billion transistors on a single chip, made possible by advances in integrated circuit technology. In having such a large transistor budget, IRAM is able to allocate a large portion of its transistors to memory, specifically on-chip DRAM. Its main purpose is to diminish the gap between microprocessor performance and the latency of main memory accesses. Although the Bulk DSP would also rely on being able to integrate a large number of transistors on a chip, the Bulk DSP allocates these resources to computing or processing power rather than memory. Instead of dedicating 80% of the transistor budget to memory as IRAM does, the Bulk DSP might dedicate this percentage to processors. Bulk DSP applications require a lot of arithmetic computations and thus would benefit from more processing power than memory. Both of these processors, the Pentium and IRAM, are beneficial for 26 certain classes of applications. The Pentium is designed for general-purpose applications that don't necessarily exemplify a specific type of algorithmic structure while the IRAM targets memory-intensive applications such as database and multimedia programs. The Bulk DSP targets neither of these areas, moreover, it aims at improving applications which require a lot of parallel signal processing. Thus, in comparison, an architecture such as the Bulk DSP would be more advantageous in performance than a Pentium or IRAM processor for this class of parallel applications. Additionally, the Bulk DSP does better from a cost standpoint; the cost to have many simple processors on a chip is less than the cost of a lot of DRAM or other specialized hardware characteristic of modern day microprocessors. 3.2 Processing Elements The architecture and organization of the Bulk DSP's simple processors model the processing elements used in SIMD parallel processing. A SIMD multiprocessor usually contains many simple processors called processing elements and a single control unit with only one instruction and data memory resource. These processing elements are characterized by their simplicity. Their main function is to execute the instructions given to them by a control unit that distributes the instructions to all processors. The ILP present in these programs, imply that short instruction sequences will be carried out in parallel. Because each simple processor only carries out the given instructions, it contains minimal control logic. In fact, these simple processors have a RISC architecture, which basically just fetches an instruction and data, executes the assigned computation, outputs the new data, and fetches the next instruction in a continuous cycle. 27 Essentially, these simple processors contain only basic hardware components and do not require a lot of complexity, therefore, they are inexpensive and easy to replicate on a chip. The number of simple processors used in this Bulk DSP will be discussed in the scheduling section. 3.3 Memory Hierarchy The memory resources of the Bulk DSP play a large part in the design process. For this multiprocessor, a single 2KB (kilobyte) cache for both instructions and data will be allocated to each simple processor. The caches will be subdivided into 256 byte sections, which can be designated to either instruction or data memory. The instruction size of the computing module(s) in each processor determine the portion of the cache used for instructions, and the number of input and output bytes for each module(s) determine the amount for data cache. Because the algorithm will be subdivided into computational modules among the processors, data will need to propagate from one processor to another. This means that each processor will have to both read data from and write to other processor memories. This data movement can be setup in such a way that the movement occurs in the "background." Consequently, the processors will not have to wait for data and no cycles will be wasted on data movement. This concept will be enabled by the buffer memories between the computing modules. For each processor, there will essentially be four buffer memories. Two on the input "side" and two on the output "side." One buffer memory on each side will be dedicated to the current set of data being processed; this will give the processor a place from which to read current data 28 and another place to write out current data. The other two memories associated with each processor are for other processors to write to or read from while the processor associated with those buffer memories is busy working on the current set of data. The focus of the remainder of the investigation explores how to best implement a Bulk DSP. GSM serves as an excellent application for the Bulk DSP and will be the main application of the designed processor. This is due to several reasons. First, the Base Transceiver Station in the GSM network acts as a core processing node at which many DSP computations take place. Second, a software library entailing the GSM algorithm was found and proved useful for this investigation. Third, GSM is a popular mobile cellular system, which has gained acceptance around the world thereby making it a very relevant and useful technology to explore. 3.4 GSM Implementation Because GSM is a cellular phone network, human speech encompasses the majority of the information transmitted across the network. As mentioned, the speech compression algorithm used in GSM is a Regular Pulse Excitation- Long-Term Prediction (RPE-LTP) specified in the GSM 06.10 standard [9]. A block diagram of the encoding algorithm is shown in Figure 2-1. This algorithm is executed in the GSM codec and serves as its primary functionality. The input frames to the codec consist of 160 signed 13-bit linear PCM values each of which are sampled at 8 kHz. They come from either the audio part of the mobile station or from the PSTN. These frames last for 20ms, and thus, cover about one glottal period of a very low voice or 10 periods for a very high 29 voice. Because this is a relatively short period of time, the speech wave does not change much and thus the algorithm will not lose any information by dividing up the speech Short-Term LPC I u S ign Log Area Ratios Short-Term Pre-Process 2 RPE Parameters RPE Analysis [0..1591 (4) Long-Term Analysis (1) Short Term Residual (2) Long Term Residual (3) Short Term Residual Estimate (4) Reconstructed Short Term Residual (5) Quantized Long Term Residual TP Analysis (5) RPE Grid LTP Parameters - Figure 2-1: GSM Encoding Algorithm signal as such. The encoder divides the input speech samples into a short-term predictable part, a long-term predictable part, and the rest into the residual pulse. Then, it encodes and quantifies the residual pulse and parameters for the two predictors. The decoder applies the long-term residual pulse to the residual pulse in order to reconstruct the speech and then passes the filtered output through the short-term predictor [6]. 3.5 Encoding and Decoding The first step in the encoding algorithm consists of preprocessing the samples to produce an offset-free signal and then passing them through a first-order preemphasis filter. The resulting 160 samples are then analyzed to determine the coefficients of the short-term 30 analysis filter. This short-term linear-predictive filter (LPC analysis) is the first stage of compression and the last stage of decompression. The speech compression in this algorithm is achieved by modeling the human-speech system with two filters and an initial excitation of which LPC is the first filter. In this process, the short-term filter acts as the human vocal and nasal tract such that when excited by a mixture of glottal wave and noise, produces speech that is hopefully similar to the one you are compressing. This is done using the set of coefficients determined from the preprocessed signal and using them as well as the 160 samples to produce a weighted sum of the previous output, which is, termed the short-term residual signal. In addition, the filter parameters, named the reflection coefficients, are transformed into log-area-ratios (LARs) before transmission since they will be used for the short-term synthesis filter in the decoder. The next stage in processing involves the long-term analysis where the main computation is the longterm prediction filter. Before filtering, the speech frame is subdivided into 40 sample blocks of the short-term residual signal. Also, the parameters of the long-term analysis filter, the LTP lag which describes the source of the copy in time and the LTP gain, a scaling factor, are estimated and updated in the LTP analysis block. Both of these prediction parameters are calculated based on the current sub-block and the previous 120 reconstructed short-term residual samples. With these parameters, an estimate of the short-term residual signal is found via the long-term prediction filter. Then, the last stage of this section, subtracts the estimated short-term residual signal from the actual shortterm signal to produce the long-term residual signal. With each 40 sub-block iteration, 56 bits of the GSM encoded frame are produced. The resulting 40 samples of the longterm signal are then passed to the regular pulse excitation analysis for the primary 31 compression operation of the algorithm. Here, each sub-segment of the residual signal is filtered by an FIR (Finite Impulse Response) algorithm and then down-sampled by a factor of 3. Thus results a four candidate sequence of length 13. The sequence with the most energy is chosen and the 13 samples are quantified by block adaptive PCM (APCM) encoding. The result is passed on to the decoder via a 2-bit grid selection. Lastly, the encoder updates the reconstructed short-term residual in order to prepare the next LTP analysis. In summary, the speech codec, or encoder compresses an input of 160 samples into an output frame of 260 bits every 20ms. Therefore, one can see that one second of speech equals 1625 bytes and one megabyte of compressed data holds about 10 minutes of speech [6]. The decoder mirrors many of the encoding computations. Decoding occurs when a call is received from the PSTN or from the Mobile Station (the cellular phone) at the Base Station. The decoding algorithm begins by multiplying the 13 3-bit samples by the scaling factor and expanding them back to 40 sample sub-blocks. This residual signal passes through the long-term predictor, which consists of a similar feedback loop as the one in the encoder. The long-term synthesis filter removes 40 samples of the previous estimated short-term signal, scales it by the LTP gain and adds it to the incoming pulse. This new short-term residual becomes part of the source for the next three predictions. In addition, these samples are applied to the short-term synthesis filter, which uses the reflection coefficients calculated by the LPC module. Finally, the de-emphasis filter processes the samples whose output should resemble the original speech signal. 32 Chapter 4 Scheduling In an effort to design a Bulk DSP that will optimize the performance of the GSM algorithm, different organizations of the algorithm's computational modules were arranged and considered for the building block of the multiprocessor. These architectures differ based on the number of computing modules grouped together in a single processor and the schedule in which the modules are executed by the control unit. Changing the architecture based on these parameters allowed the designer to explore the parallel structure already present in the GSM algorithm. 4.1 Design Methodology At first, the "best architecture" for this Bulk DSP might seem to be a set of simple processors with a fixed memory resource each assigned to process the entire decoding algorithm. This architecture results in each simple processor working on an entire frame at a time. The benefit from this solution is that each processor will continuously process data. The only exception, or idle time induced, would occur the first time an instruction is called within a frame; this results in some memory access time to fetch the instruction into the cache. Due to the limited memory resource, the entire decoding program can not 33 fit into the on-chip cache, thus the idle cycles while waiting for memory access. This design represents the scheme where a factor of N'improvement is achieved by simply replicating N'number of simple processors on chip. However, this architecture does not take advantage of any instruction-level parallelism present in the algorithm, and therefore, the idea is that a more efficient scheme using the same amount of hardware exists. Thus, given that this Bulk DSP is composed of simple processors with 2KB of memory each, what is the best organization of these resources? In order to best address this question, careful study of each computational module is required. Accordingly, a discussion of the computational modules will follow. 4.2 GSM's Computational Modules The GSM algorithm consists of two main operations: encoding and decoding. Each of these operations can be easily subdivided into a set of independent computing modules. This modularity allows for the flexibility in organizing the simple processors. Here, we will focus on the decoding part of the algorithm in the design of the Bulk DSP. Because many of the modules are the same for decoding as they are for encoding, a similar approach as the one taken here may also serve to design a multiprocessor for the encoder. In order to subdivide the algorithm, specific functions within the overall computation were identified (see Figure 4-1). Ten independent modules were distinguished differing in instruction and data size. Due to the nature of the GSM algorithm, several of these modules can be executed in parallel thus providing a means to optimize the architectural organization of the algorithm. The number of instructions executed characterizes each module. There are four important parameters that determine the above information for 34 each module. They are the numbers of Loads, Stores, Arithmetics, and Shifts encountered in the instruction set of the module. A Load represents a processor reading from memory (this could be either instructions or data). Specifically, a Load fetches two bytes of RPERPE Grid DeEmphasis SShort-Term Synthesis PositionS t Invers APCM AN Z LAR-to-RP APCM Coefficiet Quantization LTP Long-Term SSynthesis Decoding of LARs LAR Figure 4-1: GSM Decoding Algorithm information at a time. Stores, symbolize the times the processor writes to memory. Arithmetics are the actual computations executed by the processor, and Shifts corresponds to the computation of indexing a data array. Another aspect of these modules is the presence of loops in their structure. As noted earlier, DSP computations require many of the same operations repetitively which accounts for the large number of loops found in these modules. An example of this is demonstrated in Figure 3-1, which shows the code for one of the computing modules, Long_TermSynthesisFiltering. An example of the Load (L), Store (S), Arithmetic (A), and Shift accounting is also demonstrated. So, to determine the number of instructions executed by this 35 0 Signal P 1-0..159] computational module, the sum of Loads, Stores, Arithmetics, and Shifts was calculated without regard to the loops. This information designates the number of bytes stored in the instruction cache (I-cache). One assumption made here is that each instruction equals 4 bytes of memory. This is a typical number for most modern day instruction sets. The second calculation done, includes counting the number of the above parameters but this time including the loops. The sum of these parameters represent the total number of void GsmLongTermSynthesisFiltering struct gsm-state * S, word Ncr, word register word bcr, * erp, register word * drp register longword register int Itmp; k; ( brp, drpp, Nr; word Nr = Ncr < 40 11 Ncr > 120 ? S->nrp Ncr; S->nrp = Nr; assert(Nr >= 40 && Nr <= 120); brp = gsmQLB[ bcr ]; assert(brp != MINWORD); for (k = 0; k <= 39; k++) { drpp = GSMMULTR( brp, drp[ k - Nr] ); drp[k] = GSMADD( erp[k], drpp); } for (k = 0; k <= 119; k++) drp[ -120 + k] = drp[ -80 + k]; } //8L, 3S, 5A, 2Shifts Figure 4-2: LongTermSynthesisFiltering Module operations executed by the processor in a module. Operations process two bytes of data at a time, thus processing 16-bit operands. The number of total operations will also be used as the main factor to study the relative length of time the modules require to execute all their operations. Lastly, one final calculation was done to determine the number of data bytes "enter and exit" each module. Basically the input data and output data indicate 36 the number of bytes processed by the module in addition to revealing the data movement to and from the modules (see Table 4.1). Computational Module In/Out Operations Instructions GsmAPCM-quantization xmaxc to-exp-mant APCM inverse-quantization RPE-grid-positioning GsmLongTermSynthesisFiltering GsmLong_TermAdd Decoding-of the codedLogAreaRatios Coefficients LARtorp ShortTermSynthesisFiltering (k=13) ShortTermSynthesis Filtering (k=14) ShortTermSynthesisFiltering (k=120) Postprocessing 4/2 30/26 28/80 328/82 320/320 16/16 16/16 16/16 50/26 54/28 240/272 42/26 436 745 699 3451 920 334 248 242 6680 7193 61571 4650 53 109 65 92 16 334 122 34 90 90 90 39 Table 4.1: Computational Module Parameters 4.3 Evaluation of the Architectures There are two main aspects to the design of the multiprocessor: the organization, which includes the number of simple processors and the assignment of computational modules to each processor and secondly, the schedule of the modules, which represent the order and time the modules are called. For each organization, a schedule was comprised and the time it would take for the organization to produce 1, 10, and 25 frames was calculated. Frames represent segments of individual phone calls. Several approaches were taken in order to reach the optimal design for the building block of this multiprocessor. The first, perhaps an obvious choice, involved assigning each module to its own processor and arranging them as a pipeline in the order 37 in which they are executed by the algorithm. Thus, the scheduler for this organization simply called each module as the previous one completed. Because of the buffer memories associated with each processor, data movement between modules essentially occurred in the "background," and consequently, the calls to the modules could be made as soon as the previous one finished. Again, the number used as the time to complete a particular module was the total number of operations for that module. The next modification was made after observing the instruction-level parallel structure present in the algorithm. While maintaining the one or two modules per processor approach, the scheduler was modified to call the modules in parallel wherever possible as shown by Figure 4-3 where 'M' denotes the module. ~~M1-M2 M3 -10 M4 ~- M5, M9 -M6 M7 , M10 00 M8 Figure 4-3: Preliminary Building Block of Bulk DSP This change, of course, decreased the total computation time in comparison to the previous organizations. These designs were fairly straightforward, though, it was evident that better performance could be achieved due to the large number of idle cycles incurred. The next approach included looking more closely at the length of time it took each processor to execute its module. Unfortunately, there exists great disparity among the modules with respect to the computation time. As a result, more processors were 38 dedicated to modules with the most number of computations. Many of these modules are called four times each (four times per frame), so it was easy to assign four processors to one module without having to break-up the module. This also seemed to be a good idea since the modules could be executed in parallel and as a result performance improved. At this stage in the design, however, an important issue arose. How were these architectures comparing to our best case scenario? There was basically no metric used to see if the increase in performance was really significant or if it still lagged the oneprocessor-per-frame scheme. Up until then, it was thought that the total time it took each organization to compute 1, 10 or 25 frames could be used to compare the different organizations. But this number does not account for the number of wasted cycles incurred in each type of architecture. As stated, the idle time in the one-processor-perphone-call scenario derives from the limitation of the 2KB cache, which does not hold the entire program for the algorithm. Here, the minimum number of memory accesses each processor exercises equals the number of instructions in the entire algorithm. In contrast, in the schemes which hold just a few computational modules, the 2KB cache is sufficient to hold the instructions for the individual modules. However, in these architectures, there is idle time in the latency between processors since the number of operations vary greatly among the modules. In order to determine the time for memory access, a few assumptions were made. First, each access to memory takes 20 cycles and second, each access to memory fetches 4 instructions. Hence, the number of wasted cycles due to memory access equates to the total number of instructions multiplied by 5. The latency between modules was simply calculated from the scheduler. Due to the fact that certain modules take longer than others, it was often the case where some processors 39 would have to wait a long time until it could start its computation because it was waiting for its input from another processor. This is shown in Figure 4-4, which demonstrates a segment of the schedule for one of the proposed architectures. The numbered 'P's denote the simple processor while the numbers represent the computation time for the module. 345 P1 P2 P3 61571 920 7193 6680 6680 92253171467 0 920 Figure 4-4: Segment of Schedule for Proposed Architecture The idle time is denoted as the time in between processors. This turned out to be a major problem in all of the architectures considered thus far. The number of wasted cycles due to latency was calculated for each organization and compared to the idle time (memory accesses) for the first pass "best architecture" to see if performance was better. Unfortunately, the added hardware was not being used efficiently and the number of wasted cycles outweighed any gain in performance. These results are summarized in Table 4-2. Architecture 1 represents the building block shown in Figure 4-3 while Architecture 2 represents the architecture from Figure 4-4, which includes a total of 4 simple processors. The performance of the best architecture is measured based on the number of simple processors in each proposed scenario. 40 Frames Produced Avg. Idle Cycles/Proc 4770 96 71526 423 _ _ _ Best Architecture Scenario 1 Best Architecture Scenario 2 64 26 4770 56495 Table 4.2: Comparison of Results with Best Architecture Another approach was attempted, but this time careful attention was paid to the number of instructions and data for each module. The goal was to pack as many modules into one processor that would fit into the 2KB cache in an effort to keep each processor busy all the time, thus reducing idle time. The previous considerations such as computation time per module were also considered. It was noted that grouping modules together which did not need to execute orthogonal in relation to each other worked better. Moreover, modules that shared no data dependency such as Postprocessing and the LongTermSynthesisFiltering were grouped together. This reduced the idle time for each processor. It soon became apparent, though, that no organization could eliminate the large number of idle cycles in these architectures. The problem stems from the disparity of one particular module's computation time (ShortTermSynthesisFiltering) in comparison to all the others (see Table 4.1). This module was broken into four submodules based on the four times it is called in the frame. Three of these iterations had computational lengths on the same order, however, the last sub-module is ten times longer than the first two. This was the major source of processor idle time since essentially all the other processors had to wait until this computation was done before the next frame could be processed. A completely new approach was needed in order to exceed the performance of the first proposed architecture. 41 The final iteration necessary to produce a better, more efficient architecture than the original one entailed completely re-thinking the approach to the problem. This time, the design was done without a fixed cache size for the processors. So instead of trying to group the modules together based on their instruction sizes, the modules were grouped sequentially, and the algorithm was divided up evenly based on the number of total operations. Four processors were chosen as the number of processors needed to perform the complete decoding algorithm. This number was selected based on the results of the previous iterations and the number of buffer memories needed in this new scheme. Essentially, more buffer memories were required because the longest module, ShortTermSynthesisFiltering, was broken up four times and, thus, information regarding the state within a loop needed to pass from one processor to the next. In essence, the four processors executed approximately the same number of operations and required a total of 12KB of cache altogether. Figure 4-5 shows a picture of the final Bulk DSP including all the processing elements, memory and control units. P1 P2 * 0 1 P3 P4 1 0 0 Memory Control Logic Figure 4-5: Final Bulk DSP Architecture 42 P1: P2: P3: P4: M1-M8, M9 M9 M9 M9-M10 In order to accurately compare this architecture to the original one proposed it was necessary to modify the first one to equate the hardware resources allocated to each. Thus, 3KB of cache was allotted to each processor in the original architecture, so that evaluating the performance of four of these processors would accurately compare to the performance of four processors in the final architecture. Interestingly enough though, adding 1 KB of cache to the first architecture does not really affect its number of idle cycles, since 3KB of memory is not enough to store the entire program. However, the present cache scheme assumed for this architecture, is simply a least recently used (LRU) method, which just removes the oldest touched data in the cache when it needs to store new data and is not the most efficient way to utilize a cache. So in order to significantly reduce this idle time, a new cache scheme would need to be implemented. One can imagine reducing this number by two if half of the instruction set was four processors. Because the percentage of idle time in the original architecture represents such a small part (4%) of the overall computation time, it is difficult to design an architecture with a significant increase in performance over the original architecture. Frames Produced Avg. Idle Cydes/Proc Best Architecture 64 4770 Final Architecture 70 2.5 Table 4.3: Final Results 43 Chapter 5 Conclusion In this chapter, a summary of the work completed will be presented as well as several suggestions for future work. Lastly some final thoughts on this investigation will conclude the thesis. 5.1 Summary Recognizing the improvements achieved with parallel processing technologies motivated the idea for a Bulk DSP. This multiprocessor is intended for technologies which demonstrate repetition at two levels: system and algorithmic. The former is characteristic of communication technologies which generally entail some type of core processing node in their networks, and the latter of DSP technologies which exhibit repetitive instructions in their algorithms. This investigation aimed at researching several technologies that demonstrate these characteristics and applying the concepts of parallel processing to design a multiprocessor implementation. As a result, a specific technology was chosen, GSM, and a design methodology carried out. This process was described in order to present some of the architectures designed and analyzed for this investigation. Finally, the most prominent architecture was deemed as the best implementation of a GSM Bulk DSP based on the increase in performance it would bring to the existing technology. 44 5.2 Future Work The final architecture proposed only succeeded in achieving marginal improvements over the N' DSP factor. Thus, there might yet be another way to realize a larger improvement in performance. As stated, the limiting factor in utilizing the presence of multiple processors was the modules' disparity in computation length. If one modified the implementation of the decoding algorithm to reduce this disparity, the added processors might be used more efficiently. Specifically, the long sub-module of the ShortTermSynthesisFiltering could be divided into smaller, equal portions, which match the length of the other sub-modules. This would eliminate the wasted cycles due to latency between the processors in the organizations which grouped only a few modules together in an effort to take advantage of the instruction-level parallelism present in the algorithm. A second way to improve the design might be to consider incorporating more of the overall GSM algorithm that gets executed at the Base Station into the Bulk DSP. This idea is based on the fact that the encoding and decoding instruction set is just not very large and therefore, does not require a lot of computing power. If more functionality were required of the Bulk DSP, its parallel structure and added processing power might be efficiently used to achieve greater performance. The channel encoding/decoding, as well as error protection and encryption of the radio channel are all parts of the GSM system that occur at the Base Station and would be likely candidates to incorporate into the Bulk DSP. 45 Finally, the majority of the work done in this investigation occurred at a theoretical level, based on "paper calculations." Although the results gained from these calculations are extremely revealing and necessary, there is another level of investigation needed before the multiprocessor can actually be implemented in hardware. The next step in design is to simulate the proposed organizations at both the algorithmic level and the hardware level. The software analysis can be done by taking the best architectures proposed and using a GSM decoding library to simulate these configurations. Actual speech or GSM encoded data can be used as test vectors to see how long it takes each architecture to process a given number of frames or data inputs. These results should confirm which architecture is indeed the best to implement. Finally, the hardware should be specified, which consists of designing and figuring out the exact hardware components needed to execute GSM's decoding algorithm along with what control logic and exact memory hierarchy will be used. There are several well known programs that can aid in this part of the design, such as VHDL or Verilog. These programs provide a way to implement the architecture at the circuit level as well as a method to simulate and verify functionality. 5.3 Final Thoughts The main conclusions drawn from this investigation are threefold. 1.) The GSM encoding/decoding algorithm is simply not complex enough to take advantage of parallel processing. It contains only a small set of instructions, which does not allow for efficient use of replicated processing power. The added processing power 46 is essentially lost in the latency that stems from a difference in the processing lengths of its computing modules. 2.) Parallel processing involves much more than just simply replicating processors on a chip. In fact, when replicating N'processors on a chip, it is not always the case that this will result in a factor of N'improvement. The effect of parallel processing largely depends on the algorithm that it tries to optimize. The number of instructions in the algorithm play an important role, as does the schedule in which these instructions are executed. As seen from the stated results, many instructions are necessary in order to benefit from the increased number of processors. The optimal program or algorithm for parallel processing is one that contains a lot of instructions, including large decision trees. This implies that the added processors would actually be used since only a really large cache could store all the instructions. Without the large cache present, cache misses would claim a significant amount of the total computation time. In this scenario, adding processors would effectively take the place of idle time. 3.) Because the Bulk DSP is essentially a parallel multiprocessor, it will only benefit technologies with "optimal" parallel properties. In this study, it was specifically applied to the GSM decoding algorithm and did not prove extremely beneficial. However, the Bulk DSP could still be the most desirable processor for the other technologies researched such as MPEG and ADSL since their specific algorithms were not identified nor considered. Furthermore, a GSM implementation that incorporates more functionality might also result in a more beneficial Bulk DSP. 47 Appendix A Software Tools Used Jutta Degener and Carsten Bormann, from the Technical University of Berlin, developed the GSM 06.10 software used to research and study the encoding and decoding algorithms. The software consists of a C library as a stand-alone program. It was first designed for a UNIX-like environment although the library has been ported to VMS and DOS environments which was the implementation used in this investigation. Several other tools were used to test and use the library. The code was run using Microsoft Visual C++ compiler. A digital audio editor named Cool Edit, was used to convert GSM files to raw PCM format in order to test the encoding modules. Syntrillium Software developed this program. 48 Appendix B GSM Resources on the Web Many web sites containing both official and unofficial information about GSM and the telecommunications industry were found throughout this research. A list of the most helpful and relevant sites follow. * Dr. Dobb's Journal --A good technical explanation of GSM encoding/decoding: http://www.ddj.com/articles/1994/9412/9412b/9412b.htm * GSM Encoding/Decoding C library--Jutta Degener and Carsten Bormann: http://www.kbs.cs.tu-berlin.de/-jutta/toast.html * GSM Information Network: http://www.gin.nl/ * GSMag International: http://www.gsmag.com/ * GSM Online Journal: http://www.gsmdata.com/today.html * GSM Specification--ETSI: http://www.etsi.org/ * GSM Streaming Audio for the Web: http://itre.ncsu.edu/gsm/ * GSM World--GSM MOU Association: http://www.gsmworld.com/ * Intel's GSM Data Knowledge Site: http://www.gsmdata.com/ * International Telecommunication Union: http://www.itu.int/ * Audio Clips in different formats: http://www.geek-girl.com/audioclips.html 49 * Total Telecom: http://www.totaltele.com/ * Universal Mobile Telecommunications: http://www.umts-forum.org/ 50 Bibliography [1] ADSL Forum, ADSL Tutorial,available at http://www.adsl.com/adslforum.html. [2] Analog Devices, A FastRamp to the Information Superhighway and How Pieces Fit Together, available at http://www.analog.com/publications/whitepapers/products/backadsl/index.html. [3] Array Microsystems, Inc. VideoFlow Architecture, available at http://www.array.com. [4] Baldi Mario and Ofek Yoram. End-to-End Delay of Videoconferencing over Packet Switched Networks, Yorktown Heights, NY, IBM T.J. Watson Research Center, 1996. [5] Chen Ming-Syan and Kandlur Dilip D. Stream Conversion to Support Interactive Playout of Videos in a Client Station, Yorktown Heights, NY, IBM T.J. Watson Research Center, 1994. [6] Degener, Jutta, DigitalSpeech Compression, available at http://www.ddj.com/articles/1994/9412/9412b/9412b.htm. [7] DSL Knowledge Center, How Does ADSL Work?, available at http://www.orckit.comlorckitdsl_center.html. [8] GSM MOU Association, GSM World, available at http://www.gsmworld.com. [9] GSM 06.10 - European digital cellular telecommunications system (Phase 2); Full ratespeech transcoding, ETS 300 580-2, European Telcommunications Standard Institute, March 1998. [ 10] Hennessy, J., and Patterson, D., Computer Architecture:A QuantitativeApproach, 2nd ed., Morgan Kaufmann Publishers Inc, San Fransisco, CA, 1996. [11] Intel Corporation, Pentium III Processor,available at http://developer.intel.com/design/pentiumiii. 51 [12] Kozyrakis, C., et al. Scaling Processors to 1 Billion Transistors and Beyond: IRAM, IEEE Computer, September 1997, pp. 75-78. [13] Mehrotra Asha, GSM System Engineering,Norwood, MA: Artech House, 1997. [14] Motorola, DMT Line Code, available at http://mot-sps.com/sps/General/chips-nav.html. [15] Motorola, Echo Cancellation,available at http://mot-sps.com/sps/General/chips-nav.html. [16] Patterson, D., et al., A Case for Intelligent RAM: IRAM. IEEE Micro vol. 17, no. 2 (April1997), pp. 34-44. [17] Redl Siegmund M., Weber Matthias K., and Oliphant Malcolm W., GSM and PersonalCommunications Handbook, Norwood, MA: Artech House, 1998. [18] Redl Siegmund M., Weber Matthias K., and Oliphant Malcolm W., An Introduction to GSM, Norwood, MA: Artech House, 1995. [19] Tisal, Joachim, GSM CellularRadio Telephony, West Sussex Pol9 IUD, England, John Wiley & Sons Ltd, 1997. [20] Turletti, Thierry, Bentzen, Hans, and Tennehouse, David, Towards the Software Realization of a GSM Base Station, to appear in JSAC issue on software radios, 4th 52