Institutionen för systemteknik Department of Electrical Engineering Examensarbete Design and Implementation of a DMA Controller for Digital Signal Processor Examensarbete utfört i Datorteknik vid Tekniska högskolan i Linköping av Guoyou Jiang LiTH-ISY-EX--10/4244--SE Linköping 2010 Department of Electrical Engineering Linköpings universitet SE-581 83 Linköping, Sweden Linköpings tekniska högskola Linköpings universitet 581 83 Linköping Design and Implementation of a DMA Controller for Digital Signal Processor Examensarbete utfört i Datorteknik vid Tekniska högskolan i Linköping av Guoyou Jiang LiTH-ISY-EX--10/4244--SE Handledare: Dake Liu isy, Linköpings universitet Examinator: Dake Liu isy, Linköpings universitet Linköping, 12 August, 2010 Avdelning, Institution Division, Department Datum Date Division of Computer Engineering Department of Electrical Engineering Linköpings universitet SE-581 83 Linköping, Sweden Språk Language Rapporttyp Report category ISBN ¤ Svenska/Swedish ¤ Licentiatavhandling ISRN ¤ Engelska/English £ ¤ Examensarbete £ ¤ C-uppsats ¤ D-uppsats ¤ ¤ Övrig rapport 2010-08-12 — LiTH-ISY-EX--10/4244--SE Serietitel och serienummer ISSN Title of series, numbering — ¤ URL för elektronisk version http://www.da.isy.liu.se http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-58868 Titel Title Design and Implementation of a DMA Controller for Digital Signal Processor Författare Guoyou Jiang Author Sammanfattning Abstract The thesis work is conducted in the division of computer engineering at the department of electrical engineering in Linköping University. During the thesis work, a configurable Direct Memory Access (DMA) controller was designed and implemented. The DMA controller runs at 200MHz under 65nm digital CMOS technology. The estimated gate count is 26595. The DMA controller has two address generators and can provide two clock sources. It can thus handle data read and write simultaneously. There are 16 channels built in the DMA controller, the data width can be 16-bit, 32-bit and 64-bit. The DMA controller supports 2D data access by configuring its intelligent linking table. The DMA is designed for advanced DSP applications and it is not dedicated for cache which has a fixed priority. Nyckelord Keywords direct memory access, DMA, digital signal processing, DSP, linking table, processor, peripherals, scalability, testbench, verification Abstract The thesis work is conducted in the division of computer engineering at the department of electrical engineering in Linköping University. During the thesis work, a configurable Direct Memory Access (DMA) controller was designed and implemented. The DMA controller runs at 200MHz under 65nm digital CMOS technology. The estimated gate count is 26595. The DMA controller has two address generators and can provide two clock sources. It can thus handle data read and write simultaneously. There are 16 channels built in the DMA controller, the data width can be 16-bit, 32-bit and 64-bit. The DMA controller supports 2D data access by configuring its intelligent linking table. The DMA is designed for advanced DSP applications and it is not dedicated for cache which has a fixed priority. 5 Acknowledgments This is the result of master thesis work starting from spring of 2009 to the spring of 2010 in Linköping University. First of all, I would like to thank my supervisor and examiner Professor Dake Liu, who gave me the great opportunity to do this final year project. The thesis would not be possible to complete without his experience and support. Second, I would like to give my gratitude to those Ph.D students in the division of Computer Engineering. Their experience in the digital signal processor design helped me a lot. Jian Wang, who helped me with some key issues in the design of behavior model. Di Wu, who introduced me with this topic. Olof Kraigher, who helped me to solve some programming problems of the C++ model. I also want to thank He Zhang, who helped me discussing some example applications of the design. I also want to appreciate Thomas Österholm, who helped me to integrate my design to the complete DSP system. Andreas Ehliar and Johan Eilert who gave me a lot of advice while implement my design into ASIC. Last but not least, I want to express my appreciation to my parents in my hometown Shanghai, their love and supports are unlimited and throughout my entire academic career far away from home. 7 Contents 1 Introduction 1.1 Scope . . . . . . 1.2 Method . . . . . 1.3 Thesis Overview 1.4 Notations . . . . 1.5 Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 14 14 15 15 16 2 Background 2.1 DMA Basics . . . . . . . . . . . 2.2 DMA Operations . . . . . . . . 2.2.1 Normal DMA Operation 2.2.2 Chain Operation . . . . 2.2.3 Linking Table Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 17 18 19 19 20 3 Application Requirements 3.1 Application Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Requirement Specification . . . . . . . . . . . . . . . . . . . . . . . 23 23 26 4 Interfaces 4.1 Host Interface . . . . . . . . . . 4.1.1 Main Status Register . . 4.1.2 Main Control Register . 4.1.3 Special Memory Control 4.2 Memory Interface . . . . . . . . 4.3 Behavior model of I/O . . . . . 4.4 Task Packet Specification . . . . . . . . . . . . . . . . . . Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 29 29 30 31 31 32 32 5 DMA Hardware 5.1 Host Interface . . . . . . . . . . 5.1.1 Block Diagram . . . . . 5.1.2 Interface . . . . . . . . . 5.2 Source Address Generator . . . 5.2.1 Block Diagram . . . . . 5.2.2 Interface . . . . . . . . . 5.3 Destination Address Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 38 38 38 39 39 40 41 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 41 42 42 42 42 42 43 43 45 6 Integration 6.1 Hardware Integration . . . . . . . . . 6.2 Software Integration . . . . . . . . . 6.3 DMA Programming . . . . . . . . . 6.3.1 Initialize the DMA Controller 6.3.2 Poll the DMA Controller . . 6.3.3 Handle the DMA Interrupt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 47 47 48 49 50 51 7 Verification 7.1 Functional Verification . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . . 53 53 54 8 Conclusion 8.1 Achieved Results . . . . 8.1.1 DMA Benchmark 8.1.2 Comparison . . . 8.1.3 Conclusion . . . 8.2 Future Work . . . . . . 55 55 55 56 57 57 5.4 5.5 5.6 5.3.1 Block Diagram 5.3.2 Interface . . . . Source Decoder . . . . 5.4.1 Block Diagram 5.4.2 Interface . . . . Destination Decoder . 5.5.1 Block Diagram 5.5.2 Interface . . . . Transaction FSM . . . 5.6.1 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography 59 A DMA Simulator C++ Header 61 B DMA Simulator C++ Code 63 List of Figures 1.1 DIT butterfly of Radix-2 FFT . . . . . . . . . . . . . . . . . . . . . 14 2.1 2.2 2.3 2.4 System overview . . . . . . . . . . . . . . . . . . Basic DMA operation to save processor run time. DMA Chain operation example. . . . . . . . . . . An example of DMA linking table operation. . . . . . . 18 19 20 20 3.1 3.2 Matrix Transposition . . . . . . . . . . . . . . . . . . . . . . . . . . Transfer decomposition of Example 3.1 . . . . . . . . . . . . . . . . 23 24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contents 11 3.3 3.4 3.5 Transfer decomposition of Example 3.2 . . . . . . . . . . . . . . . . Neighbor Searching in Motion Estimation . . . . . . . . . . . . . . Transfer decomposition of Example 3.3 . . . . . . . . . . . . . . . . 25 25 27 4.1 DMA configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 DMA Hardware architecture . . . . . . . . . . . DMA Controller Block Diagram . . . . . . . . . Block diagram of Host Interface Module . . . . Block diagram of Source address generator . . . Block diagram of Destination address generator Block diagram of Source decoder . . . . . . . . Block diagram of Destination decoder . . . . . Finite State Machine of the control logic . . . . . . . . . . . . 37 38 39 40 41 42 43 44 7.1 DMA Functional Verification Flow . . . . . . . . . . . . . . . . . . 53 8.1 8.2 Timing diagram of basic DMA operation. . . . . . . . . . . . . . . Timing diagram of linking table operation. . . . . . . . . . . . . . 55 56 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Tables 3.1 3.2 Preparing DMA for Motion Estimation . . . . . . . . . . . . . . . . Requirement Specification . . . . . . . . . . . . . . . . . . . . . . . 26 28 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 Host Interface . . . . . . . . . . . . . . . . . . DMA Registers specification . . . . . . . . . . Main status register specification . . . . . . . Main control register specification . . . . . . Special memory control register specification Memory Interface . . . . . . . . . . . . . . . . Task packet specification . . . . . . . . . . . . Control Vector 1 . . . . . . . . . . . . . . . . Control Vector 2 . . . . . . . . . . . . . . . . Control Vector 3 . . . . . . . . . . . . . . . . Control Vector 4 & 5 . . . . . . . . . . . . . . Control Vector 6 & 7 & 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 30 30 30 31 31 33 34 35 36 36 36 5.1 5.2 5.3 5.4 5.5 5.6 Interface Interface Interface Interface Interface Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 40 41 43 44 45 8.1 Synthesis Result of DMA controller . . . . . . . . . . . . . . . . . . 56 of of of of of of Host Interface Module . . . . Source address generator . . . Destination address generator Source decoder . . . . . . . . Destination decoder . . . . . . Transaction FSM . . . . . . . . . . . . . . . . . . . 12 Contents 8.2 Results Comparison with and without DMA . . . . . . . . . . . . . 57 Chapter 1 Introduction Today, as the technology evolving, there are lots of DSP applications emerge on the horizon. The demands for rich content multimedia such as HDTV or 3D display are huge. Behind all these demands, there are always some technologies pushing the need for better experience of electronic products. One of them is called digital signal processing. The DSP techniques have provided improvements in traditional signal processing applications like audio, visual, radar, and communications [9, p.1]. The component which does the digital signal processing can be called digital signal processor. A special designed peripheral of the processor can help the processor itself with accessing memories. That peripheral can be called DMA controller. With the help of DMA or DMA controller, the processor can do more tasks related to computing itself while the data transfer is in progress. Since most of the memory accesses are hidden from the DSP algorithms, it is important to reveal the hidden memory accesses from the algorithms [6]. A DMA controller will be a great help in the perspective of both power consumption and performance benchmark. For example, a DIT butterfly algorithm, which is the basis of FFT algorithm, can be divided into the following steps and it is shown in Figure 1.1: 1. Load two complex operands; 2. Load one complex coefficient and perform one complex Multiply; 3. Perform two complex Addition; 4. Store two complex results. This is a simple example of memory accesses hidden in the basic DSP algorithms, more detailed discussion will be presented in Chapter 3. 13 14 Introduction Figure 1.1. DIT butterfly of Radix-2 FFT 1.1 Scope The scope of this thesis work is to design and implement a DMA peripheral for Senior, a DSP processor developed at the division of Computer Engineering in Linköping University. The interface between the DMA controller and DSP core was already done in another project [7, p.53]. The design work started from the definition of DMA specification. For many DSP applications, it is always desired to use a technique called linking table to accelerate the processing two-dimensional array [6, p.584]. The linking table is thus supported in the current DMA design. In order to make sure the design is correct, a test bench is also developed to verify the functionality of designed modules. Since the DMA should work with Senior DSP, the test bench was written on the basis of the Senior test bench. 1.2 Method For designing the DMA module, the specification should be defined on the requirement of applications. Since the DMA is designed to meet the need of Senior DSP, a behavioral model of DMA module should also be added to the exist Senior instruction set simulator. It is important to develop the behavioral model because it can be used not only to get the performance benchmark of the hardware, but also be used to compare with the actual hardware for verification. Once the behavioral model is done, the RTL implementation is to translate the behavioral model into RTL language such as Verilog. After the completion of RTL implementation, the behavioral model is used as a golden reference to verify the RTL module. If they produce the same result, then it is believed that the RTL implementation is correct. 1.3 Thesis Overview 1.3 15 Thesis Overview In Chapter 1, a brief introduction is presented to let the reader know what this thesis is about. Some basic knowledge background and operations of DMA will then be discussed in Chapter 2. In Chapter 3, some applications will be analyzed first and then the requirement specification will be discussed based on the analysis of application requirements. The designed DMA controller should work together with our host Senior DSP, in Chapter 4, the interfaces and registers of the DMA controller will be described along with the DMA task. Thus, the user of Senior will have an idea on how the DMA works with Senior DSP. After discussing the requirement specification and the host interface, Chapter 5 will describe the detailed hardware architecture of the designed DMA controller, the micro architecture of each block will also be detailed in this chapter. Once the DMA controller hardware is completed, we need to integrate it into the Senior system, Chapter 6 discuss the integration of DMA controller both in hardware perspective and in software perspective. The DMA controller behavioral model will also be discussed. Chapter 7 will discuss the verification of the implemented hardware. Chapter 8 is the summary which contains the results I have got, together with the conclusions and the future work. 1.4 Notations In order to make the thesis more understandable, there are some notations the readers should be kept in mind while go through the text. A $ and 0x before the number means that the number is in hexadecimal. A number without any prefix is a decimal number. For example, "0x64" means decimal value 100, while "64" means decimal value 64. When discussing specific bits of a word, the Verilog syntax is used as far as possible. Three zeros after each other followed by three ones is written as 6’b000111, where 6 denotes the total number of bits, the b tells it is a binary value. status[10:5] means the bits 10 to 5 of register status, and just bit 3 is written as status[3]. 16 1.5 Introduction Abbreviations 3D AGU ASIC ASIP DCT DDR DIT DM DMA DRAM DSP FFT FIFO FPGA FSM GIO HDTV I/O ISR IP JPEG LSB MB MP3 MSB MUX PC PM RTL SDR 3 Dimensional Address Generation Unit Application Specific Integrated Circuit Application Specific Instruction set Processor Discrete Cosine Transform Double Data Rate Decimation In Time Data Memory Direct Memory Access Dynamic Random Access Memory Digital Signal Processor Fast Fourier Transform First In First Out Field Programable Gate Array Finite State Machine General I/O High Definition Television Input/Output Interrupt Service Routine Intellectual Property Joint Photographic Experts Group Least Significant Bit Macro Block MPEG 1 Layer 3 Most Significant Bit Multiplexer Program Counter Program Memory Register Transfer Level Single Data Rate Chapter 2 Background With the help of pipeline, the processor core can execute one operation in one cycle, including calculation, data load and data store, in reality it is only possible to achieve optimal performance in the application if the processor core has to do the data transfer itself [4, p.75]. This is where the DMA controller can be used to relieve the core from data movements. 2.1 DMA Basics DMA stands for Direct Memory Access, and it is a technique to transfer data blocks between memories directly without using the processor for data access [6, p.535] [5]. Since the DSP is designed to do highly computational work, in most cases, a separated peripheral should help the processor core to access processor memories instead of the processor itself doing that. While the peripheral is doing memory transactions, the processor can do other operations not related to those memory transfers. DMA module or DMA controller, by definition, is a peripheral module of a processor core for direct memory access. The basic work flow of a DMA transaction can be described as follows. The core or other data units prepare and send a DMA request to the DMA controller when they want to transfer a lot of data. The DMA controller prepares and transfers data while the core can do other operations. The core might poll the status of DMA controller to see if the transfer is completed, or an interrupt will be sent to core or other data units by the DMA controller when the transaction is finished. Then the processor core can decide if it is going to continue to process on the data. A DMA subsystem can consist of a processor core, DMA module and several memory modules connected to both processor core and DMA module. The DMA module can provide DMA transfers between two memory interfaces. Transfers can also be performed between memories and high-speed I/Os. Figure 2.1 shows a typical DSP sub-system with the DMA module inside. In this DSP sub-system, the DSP core acts like the system master, and the DMA module is the slave of the DSP core. On the other hand, the DMA module 17 18 Background Figure 2.1. System overview is the master of its connected memory modules and high-speed I/Os, etc. Both the DSP core and the DMA module can access the memory modules, but cares must be taken since the memories cannot be accessed at the same time. From the DMA controller’s point of view, the master DSP core configure the data format of the transaction and request DMA to do the data transfer. The configuration is called a DMA channel, which consists of the task priority, source port and destination port of the transfer, start addresses of both ports, the data packet size, etc. 2.2 DMA Operations Usually, the DMA controller should be able to support more than one operation, since there are quite a lot of different access patterns according to different DSP algorithms. This section will illustrate several transfer options and their operations. 2.2 DMA Operations 2.2.1 19 Normal DMA Operation This is a simple DMA operation performing a block copy. In this operation, DMA performs a block copy from one location to another, either on the same interface or on different interfaces. The external software running on the processor core is responsible for limiting the access time. Figure 2.2 shows the basic DMA operation performed by the DMA controller. Figure 2.2. Basic DMA operation to save processor run time. As we can see from Figure 2.2, the processor core is responsible for the DMA transaction, once there is a need for the data of the processor, the processor will prepare a DMA request which specifies some basic parameters of the transfer. Then the processor will send the request through the general I/O to the DMA controller. The DMA controller will transfer the corresponding data from memory location 1 to memory location 2 based on the request sent by the processor. When the transfer is finished, the processor will check the status register of DMA controller or an interrupt will be sent to the processor. When the processor get the information that the transfer is done. It can use the data provided by the DMA. Thus, while the DMA is doing the data transfer, the processor can do other things rather than transferring the data itself, the run time can be saved. 2.2.2 Chain Operation In this operation, a contiguous set of elements can be transferred when a synchronous event occurs [1] [8]. The DMA controller is used to transfer a chain of data elements which have equal distance between each element. Once the DMA controller gets the task, it will setup the proper parameters and transfer each element in that chain. Figure 2.3 is an illustration of this operation. As we can see in Figure 2.3, each data element is separated by fixed stride. After transferring the first data element, the DMA can transfer the next element just like the data elements are chained together. By doing this operation, extra time for channel configuration can be saved. 20 Background Figure 2.3. DMA Chain operation example. 2.2.3 Linking Table Operation In this operation, multiple data blocks will be merged as one large data block of a DMA transaction. Since some of the DSP algorithms require data blocks at different locations in the main memory, with the help of linking table, multiple data blocks can be loaded sequentially by one DMA transaction. For example, in a video CODEC application, it is often desired to compare data from different reference frames [6]. A linking table concatenates several data blocks into one DMA transaction. Figure 2.4 gives an example of linking table. Figure 2.4. An example of DMA linking table operation. The first data block starts at the physical address 0x2000, the length of this block is 256 data words. While the first data block is loaded, the loading of second data block, which has the block number 2, is followed at once. As we can see from the Figure 2.4, the start address is 0x4000 and the length is 128. And after the loading of data block 2, the loading of data block 3 is activated immediately. The start address of data block 3 is 0x8000 and block length is 512. When the link=0 is reached at the end of data block 3, the DMA transaction is finished. Using linking table, three non-continuous data blocks transferring are merged into one single DMA transaction. 2.2 DMA Operations 21 Actually, linking table operation is a more flexible form of chain operation. Since the distance between each data element is not fixed, we need another parameter to determine the length of each data element. Table 4.7 gives us a detailed configuration of linking table. Chapter 3 Application Requirements In this chapter, several application examples will be described and analyzed, then the requirement specification will be proposed based on the analysis of these examples. 3.1 Application Analysis First of all, let us take several application examples into consideration. Example 3.1: Matrix Transposition Suppose we want to transpose a matrix. 0x0 0x4 0x8 0xC 0x1 0x5 0x9 0xD 0x2 0x6 0xA 0xE 0x3 0x7 0xB 0xF Address 0 1 2 3 4 5 . . 14 15 Data 0x0 0x1 0x2 0x3 0x4 0x5 . . 0xE 0xF 0x0 0x1 0x2 0x3 0x4 0x5 0x6 0x7 0x8 0x9 0xA 0xB 0xC 0xD 0xE 0xF Figure 3.1. Matrix Transposition The matrix may be saved in the memory consequently shown as Figure 3.1. In order to transpose the matrix, we can simply move the data from the original address to the desired position. It could be thus abstracted by the chain operation as we discussed in Section 2.2. 23 24 Application Requirements Figure 3.2. Transfer decomposition of Example 3.1 The data transfer can be represented in Figure 3.2, we can split the whole transfer into four chained transfer. In the example, the source address is discrete with a stride of 4 data words while the destination address is continuous. This is only a simple example due to the small size of the matrix. In more complicated application, the matrix could be very large, but the basic principle still holds. Example 3.2: Create a large Matrix Suppose we want to create a large matrix with 4096 elements, each element of the matrix is the same value 0 or 1. This case is quite common in the matrix manipulation in both communication algorithms and video processing algorithms. It is possible to create such matrix by writing continuous zeros or ones to a serious address. But to do this will waste quite a lot of precious core cycles, which makes impossible for the core to do more useful tasks. In this case, we can simply use the DMA controller to create the zero matrix. First we use the core to write one element in DM0 the matrix, then we use the DMA controller to transfer the same content to the DM1, suppose we should create the matrix in DM1. As we can see from Figure 3.3, the transfer is quite simple. The source address is fixed, while the destination address is continuous. The data to be transferred is the same as the size of the matrix. 3.1 Application Analysis 25 Figure 3.3. Transfer decomposition of Example 3.2 Let us see a more complicated and realistic example according to the algorithms of motion estimation [6, p.585]. Example 3.3: Motion Estimation In the motion estimation algorithm, each macro block (usually 16 × 16 = 256 pixels) in the current frame will be compared by searching the neighboring area of the reference frame. 01 09 17 25 33 02 03 04 18 26 34 19 20 28 36 35 05 06 07 08 01 09 17 25 33 02 03 04 05 06 07 27 Figure 3.4. Neighbor Searching in Motion Estimation Suppose we divide the picture into 8 × 8 = 64 macro blocks, each macro block contains 256 pixels. We want to estimate the motion vector of macro block 27 in the current frame. Based on the algorithm, we need to search the neighboring macro blocks in the reference frame. The macro blocks of number 18, 19, 20, 26, 28, 34, 35, 36 in the reference frame are going to be compared. Usually, the data memory of the processor core is not large enough to hold the whole picture, we need to transfer the desired data from main memory to the data memory of the processor core. Then the processor can perform the algorithms on the data. 08 26 Application Requirements Let’s say the segment address of the current frame in the main memory is 32768 and the address of the reference frame is 32768+(8×8)×(16×16) = 49152. Thus, we can specify the data block to be transferred in Table 3.1. Specification DMA task ID Task priority Number of links Source port Destination port Destination start address Link 1 start address Link 1 length Link 2 start address Link 2 length Link 3 start address Link 3 length Link 4 start address Link 4 length Link 5 start address Link 5 length Value 1 1 5 Main memory DM0 0 32768 + 26 × 256 = 39424 256 49152 + 17 × 256 = 53504 768 49152 + 25 × 256 = 55552 256 49152 + 27 × 256 = 56064 256 49152 + 33 × 256 = 57600 768 Comment The identification of transaction The priority of the transaction Block 27 in current frame Block 18 in reference frame 3 blocks in row Block 26 in reference frame Block 28 in reference frame Block 34 in reference frame 3 blocks in row Table 3.1. Preparing DMA for Motion Estimation Based on the data block specification in Table 3.1, we can draw the transfer decomposition in Figure 3.5 as follows. 3.2 Requirement Specification As we have described in Chapter 2, the DSP core is responsible to configure the DMA controller. So we need to specify the parameters of the memory transfer. As configured by the DSP core, the DMA controller will connect a source port and a destination port. Here, a port is either a data source supplying data or a data sink consuming data. In most cases, a port is a memory location or a data buffer. A DMA data transaction is to move data from source port to destination port as configured by the DMA task from the master DSP core. In order to design the DMA controller, we need to specify the following parameters of the memory transaction. • Number of ports supported by the DMA controller This specifies the number of channels can be connected by the DMA controller. • Address Generator Unit (AGU) The AGU is used to provide address required for memory access. At least 3.2 Requirement Specification 27 Figure 3.5. Transfer decomposition of Example 3.3 two AGUs are needed, one to provide source address and the other is to provide the destination address of a data transfer. • Data Width Since the DMA controller should support different memory modules, the width of data path should be configurable. We need to specify the data widths supported by the DMA module. • Memory Organization Since there are two different ways to store words in a byte-addressed memory. The least significant byte stored at lower address is called “little endian”, while the least significant byte stored at higher address is called “big endian” [3]. There is no specific reason why to choose one way or another, but still we need to specify the format we support during the data transfer. • Linking Table support As we described in Chapter 2, the linking table can save the extra cost for configuring several separate data blocks by concatenating several data blocks into one transaction. On the other hand, it also costs extra hardware to keep track of several different data blocks [2]. Thus we need to specify the length of the linking table. Table 3.2 shows the requirement specification of the DMA controller to be designed for Senior system. 28 Application Requirements No. 1 2 3 4 5 6 7 Description 16 Source ports: 8 on-chip memory, 1 off-chip memory, 1 high-speed I/O, other reserved. 16 Destination ports: 8 on-chip memory, 1 off-chip memory, 1 high-speed I/O, other reserved. Address Generator Unit (AGU) 1 for Source port, 1 for Destination port, each has 32b address space. Clock Generator: supply clock signal for memory (I/O),Source:Destination 1:1, 1:1/2, 1:1/4, 1:1/8, 1:1/16, 1:1/32, 1:1/64; 1:2, 1:4, 1:8, 1:16, 1:32, 1:64. Data width Source port: 8 bits, 16 bits, 32 bits. (64 bits not implemented in Senior.) Destination port: 8 bits, 16 bits, 32 bits. (64 bits not implemented in Senior.) Memory organization: The DMA controller should support both big endian and little endian data. Linking Table supported, the maximum length of linking table is 64. Table 3.2. Requirement Specification Chapter 4 Interfaces The DMA module is controlled by the Senior core. Thus, when configuring, the Senior core uses its I/O instruction in and out to read and write the registers of the DMA module. 4.1 Host Interface The host interface of DMA module conforms to the standard Senior I/O and should be connected through general I/O of Senior processor. The interface between DSP core and DMA module can be seen in Table 4.1. The data buses from and to the DMA module are 32 bits wide. Only the 16 LSB are used for current DMA configuration. Name clk_i rst_i addr_i data_i rd_strobe_i wr_strobe_i data_o width 1 1 16 32 1 1 32 DIR In In In In In In Out Description System clock. System reset, active low. Address input (from DSP core). Data input (from DSP core). Read strobe signal. Write strobe signal. Data output (to DSP core). Table 4.1. Host Interface Table 4.2 gives an overview of the DMA Register specification. The reference [7, p.53] has shown more detailed information about how to connect a peripheral to the Senior I/O. 4.1.1 Main Status Register The status register is used to show the status of DMA transactions. Firmware developer can use this register to handle the DMA transactions. 29 30 Interfaces Name Status Addr 00 Width 16 written by DMA Control 01 16 Senior Output Data 10 16 Input Data 11 16 Senior Description Show the status of DMA. Further details can be found in Table 4.3. Used for configuring and controlling the DMA, details can be found in Table 4.4. Not used in current implementation. DSP core writes task packet to this port to configure the DMA channel. Table 4.2. DMA Registers specification Bits [0] [1] [2] [3] [4] [15 : 5] Specification Idling or busy: Idle=0, Busy = 1. When 1, a channel can be configured, When 0, no channel is available. When 1, running task is finished. When 1, an exception is occurred. When 1, task queue is full. Reserved Table 4.3. Main status register specification 4.1.2 Main Control Register The control register, as the name suggests, is used to control a DMA transaction. Bits [0]=1 [1]=1 [2]=1 [9 : 3] [10]=1 [14 : 11] [15] Specification Reset DMA, flush the current task. Shutdown DMA. Data rate: always using DMA clock. Reserved Activate a task (Channel) which is specified in task ID. DMA task ID When [15] = 1, ask for a channel configuration Table 4.4. Main control register specification 4.2 Memory Interface 4.1.3 31 Special Memory Control Register This register doesn’t belong to the general I/O of Senior core. It is a special purpose register, which is written by the DMA controller and read by Senior core. By writing the corresponding bit in the register, the DMA controller will notify the Senior core which memory is being accessed now. Bits [0]=1 [1]=1 [2]=1 [15 : 3] Specification The DMA controller is accessing DM 0. The DMA controller is accessing DM 1. The DMA controller is accessing PM. Reserved Table 4.5. Special memory control register specification 4.2 Memory Interface The memory interface is used for the slaves of the DMA module. Since the DMA module supports 16 in ports and 16 out ports, we need 32 ports in all. Table 4.6 shows the detail of the memory interface needed for the DMA module. Name src0_data_i src0_addr_o src0_csn_o src0_oe_o width 32 16 1 DIR I O O 1 O src1 ... src15 Interfaces for Source Port 15. dst0_data_o dst0_addr_o dst0_csn_o 32 16 1 O O O dst0_we_o 1 O dst1 ... dst15 Description Data input for Source Port 0. Address output for Source Port 0. Memory chip select enable for Source Port 0, active low. Memory output enable for Source Port 0, active low. Interfaces for Source Port 1. Data output for Destination Port 0. Address output for Destination Port 0. Memory chip select enable for Destination Port 0, active low. Memory write enable for Destination Port 0, active low. Interfaces for Destination Port 1. Interfaces for Destination Port 15. Table 4.6. Memory Interface 32 Interfaces 4.3 Behavior model of I/O Since we use only one data I/O for both configuring the DMA module and writing DMA task, we need a protocol to distinguish the DMA configuration and task receiving. Figure 4.1 illustrates the configuration flow of the DMA module. Figure 4.1. DMA configuration Here, the PREAMBLE means the first control vector we sent to control register of the DMA module. Chapter 6 shows several examples of how to program the DMA controller. 4.4 Task Packet Specification The task packet is used to setup the DMA transfer channel, both for normal DMA operation and linking table multiple transaction. Since the DSP core has a general I/O of 16-bit data width, the task packet is also 16-bit wide per data word. We could specify a transaction by configuring a channel. The configuration includes configuring the source, the destination and the transaction. Generally, a basic channel configuration includes the following steps: • Task priority • Data size: the length of the data block. • Data from: the name of the source port. • Data to: the name of the destination port. • The physical start address of the source port. • The physical start address of the destination port. • The endian behavior of the source port: Big or Little endian. Besides the software configuration for the DMA transaction, the hardware specifications of transactions are also important to know by the DMA designers and DMA users: 4.4 Task Packet Specification 33 • The maximum source clock rate. • The maximum destination clock rate. • Data width of the source port: 8 bits, 16 bits, 32 bits or 64 bits. • Data width of the destination port: 8 bits, 16 bits, 32 bits or 64 bits. • Data protocol of the source port: error check or not. Table 4.7 shows a task packet consists of 2 links, and from Table 4.8 to Table 4.12, we can see the explanation of each control vector. The length of task packet depends on the total number of the linking table. SRC width 2b Number of Links Task Priority Task ID 8b 4b 4b DST SRC DST SRC DST SRC DST width proc proc endian endian rate rate 2b 1b 1b 1b 1b 4b 4b Reserved Source Port Destination Port 6b 5b 5b Destination Address low part 16b Destination Address high part 16b Source Address 1 low part 16b Source Address 1 high part 16b Length of Link 1 16b Source Address 2 low part 16b Source Address 2 high part 16b Length of Link 2 16b ... Table 4.7. Task packet specification 34 Name Number of Links Task Priority Task ID Interfaces Bits [15:8] [7:4] [3:0] Description Specify the total number of links, up to 64 Specify the priority of the task.(Not yet implemented) Specify Task ID. Table 4.8. Control Vector 1 4.4 Task Packet Specification Name SRC width Bits [15:14] DST width [13:12] SRC proc [11] DST proc [10] SRC endian [9] DST endian [8] SRC rate [7:4] DST rate [3:0] Description Specify the data width of source port: 2’b00: 8 bits 2’b01: 16 bits 2’b10: 32 bits 2’b11: 64 bits Specify the data width of destination port: 2’b00: 8 bits 2’b01: 16 bits 2’b10: 32 bits 2’b11: 64 bits Specify if the source port use parity check: 1’b0: Don’t use 1’b1: Use Specify if the destination port use parity check: 1’b0: Don’t use 1’b1: Use Specify endian of source port: 1’b0: Little endian 1’b1: Big endian Specify endian of destination port: 1’b0: Little endian 1’b1: Big endian Clock rate of source port: 4’b0000: clk; 4’b0001: clk/2; 4’b0010: clk/4; 4’b0011: clk/8; 4’b0100: clk/16; 4’b0101: clk/32; 4’b0110: clk/64; Clock rate of destination port: 4’b0000: clk; 4’b0001: clk/2; 4’b0010: clk/4; 4’b0011: clk/8; 4’b0100: clk/16; 4’b0101: clk/32; 4’b0110: clk/64; Table 4.9. Control Vector 2 35 36 Interfaces Name Reserved Source Port Destination Port Bits [15:10] [9:5] [4:0] Description Reserved for future use. Specify the source port number. Specify the destination port number. Table 4.10. Control Vector 3 Name Destination Address low part Destination Address high part Bits [15:0] [15:0] Description low 16 bit part of destination address. high 16 bit part of destination address. Table 4.11. Control Vector 4 & 5 Name Source Address 1 low part Source Address 1 high part Length of Link 1 Bits [15:0] [15:0] [15:0] Description Specify low 16 bit part of source address 1. Specify high 16 bit part of source address 1. Specify the length of Link 1. Table 4.12. Control Vector 6 & 7 & 8 Chapter 5 DMA Hardware Generally, the DMA controller hardware can be divided into data path and control path [6, p.572]. Figure 5.1 shows the basic architecture of the DMA module. Figure 5.1. DMA Hardware architecture The DMA data path gets data from the source port using source address generator, and stores data to the destination port using the destination address generator. In order to handle the data with different data rates and formats, source decoding and destination decoding module are also needed. The DMA control path consists of the channel configuration FSM (Finite State Machine) and transaction FSM. The DSP core can request for the configuration of a channel. When the DMA is idle, the channel configuration FSM will issue the channel to the transaction FSM module. The transaction FSM is responsible for the control of data path. When the block is transmitted, the channel configuration FSM will generate an interrupt to the DSP core. The following sections will give more detail information about the sub blocks of the DMA controller. Figure 5.2 shows the block diagram of the DMA controller with its main inputs and outputs. 37 38 DMA Hardware Figure 5.2. DMA Controller Block Diagram 5.1 Host Interface This is the interface between Senior DSP core and DMA controller. It is used to keep the control vectors sent by DSP core into registers inside the DMA controller and update the status register which can be accessed by the Senior DSP core. 5.1.1 Block Diagram Figure 5.3 shows the block diagram of the Host Interface. The input MUX is used to select input I/O data based on the input I/O address. The task FIFO is used to keep the Task packet, which will be used by transaction FSM. The output MUX is to output the desired data based on I/O address. 5.1.2 Interface Table 5.1 gives the detail interface description of the Host Interface. 5.2 Source Address Generator 39 Figure 5.3. Block diagram of Host Interface Module Name clk_i rst_i io_data_i io_addr_i io_rd_strobe_i io_wr_strobe_i io_data_o config_reg_addr_i config_reg_addr_en_i config_reg_data_o contrl_reg_o status_reg_i width 1 1 16 16 1 1 16 8 1 16 16 16 DIR I I I I I I O I I O O I Description Clock input. Synchronous reset, active low. Data input from Host interface. Address input from Host interface. Read strobe from Host interface. Write strobe from Host interface. Data output to Host interface. (Reserved) Read address for Task queue. Read enable signal for Task queue. Task queue data output. DMA control register, output to transaction FSM. DMA status register, input from transaction FSM. Table 5.1. Interface of Host Interface Module 5.2 Source Address Generator This module is used to generate the address for the source port, it is controlled by the transaction FSM. 5.2.1 Block Diagram Figure 5.4 shows the block diagram of the source address generator. Once the transaction FSM decodes the task packet parameter into several control signals, it will send these signals to the source address generator. As 40 DMA Hardware Figure 5.4. Block diagram of Source address generator shown in Figure 5.4, an Adder is used inside source address generator to produce the output source port address. Two counters are also implemented to count how many words and how many links have been transferred, and thus the end link or end transfer signal will be asserted once the transfer is finished. 5.2.2 Interface Table 5.2 gives the interface detail of source address generator. Name clk_i step_i enable_i set_addr_i end_link_o end_transfer_o src_addr_i src_length_i src_link_number_i src_addr_o width 1 2 1 1 1 1 32 16 8 32 DIR I I I I O O I I I O Description Clock input. Address increment step. Enable address increment. Set start address. Indicate the end of one link. Indicate the end of transfer. Start address of the transfer. Transfer length. Total number of links. Source address output. Table 5.2. Interface of Source address generator 5.3 Destination Address Generator 5.3 41 Destination Address Generator This module is used to generate the address for the destination port, the control signal to this module is provided by the transaction FSM. 5.3.1 Block Diagram Figure 5.5 shows the block diagram of the destination address generator. Figure 5.5. Block diagram of Destination address generator This module has the same structure as source address generator, the only difference is that it doesn’t need the counter for counting transferred words or links. 5.3.2 Interface Table 5.3 gives the detailed interface description of destination address generator. Name clk_i step_i enable_i setaddr_i addr_i addr_o width 1 2 1 1 32 32 DIR I I I I I O Description Clock input. Address increment step. Enable address increment. Set start address. Start address of the transfer. Address output. Table 5.3. Interface of Destination address generator 42 5.4 DMA Hardware Source Decoder This module decodes the incoming data based on the task packet provided by the transaction FSM. It will adapt the data into the internal data format which can be transferred through the channel. 5.4.1 Block Diagram Figure 5.6 shows the block diagram of the source decoder. Figure 5.6. Block diagram of Source decoder The source decoder consists of several MUXs to decode the incoming data based on control signals provided by transaction FSM. First, the input data are segmented by 8 bytes, then the MUXs will select the right combination of data bytes to get the internal data format. 5.4.2 Interface Table 5.4 gives the interface detail of Source decoder. 5.5 Destination Decoder This module will package the internal data format into the data format specified by the task packet. 5.5.1 Block Diagram Figure 5.7 shows the block diagram of the destination decoder. 5.6 Transaction FSM Name clk rst src_width src_parity src_endian channel_din channel_dout 43 width 1 1 2 1 1 64 64 DIR I I I I I I O Description Clock input. Synchronous reset, active low. Source data width. Source parity check. Source endian. Data input from source port. Data output to channel FIFO. Table 5.4. Interface of Source decoder Figure 5.7. Block diagram of Destination decoder The destination decoder has the similar structure as source decoder. The output MUX will combine the internal data into the desired data format based on control signals provided by transaction FSM. 5.5.2 Interface Table 5.5 gives the detail interface description of Destination decoder. 5.6 Transaction FSM This FSM is necessary to control all the transaction based on the task packet provided by the DSP core. It receives the incoming task packet and saves the packet into the DMA internal registers. According to the task packet, the transaction FSM will decode the task packet based on the specification in Table 3.2 and then 44 DMA Hardware Name clk rst dest_width dest_parity dest_endian channel_din channel_dout width 1 1 2 1 1 64 64 DIR I I I I I I O Description Clock input. Synchronous reset, active low. Destination data width. Destination parity check. Destination endian. Data input from channel FIFO. Data output to destination port. Table 5.5. Interface of Destination decoder issue different control signals to different sub blocks of DMA controller to complete the DMA transaction. Figure 5.8 shows the Finite State Machine of the control logic. Figure 5.8. Finite State Machine of the control logic There are eight states of the transaction FSM in the current design. IDLE is the default state when the DMA controller is reset. Once the Senior core requests to configure the DMA controller, CONFIG1 state will be entered, and the transaction FSM will decode the incoming common control vectors until it finishes the first 5 common control vectors. States CONFIG2_1, CONFIG2_2 and CONFIG2_3 continues to configure the source address and link length of the linking table. Once the channel is configured, state TRANS is entered, the DMA controller starts the data transfer. When the FSM receives the “end of link” signal, state WAIT is entered to wait for configure the next transfer in the linking table. Then the FSM will repeat states CONFIG2_1, CONFIG2_2 and CONFIG2_3 to configure the channel. Once the “end of transfer” signal is detected, state FINISH will be 5.6 Transaction FSM 45 entered and the interrupt signal will be sent to the Senior core and status register will be updated. Then the DMA controller will wait for the Senior core to respond either on the status register or on the interrupt signal. 5.6.1 Interface Table 5.6 gives the detailed interface description of Transaction FSM. Name clk_i rst_i src_port_o dst_port_o config_reg_data_i contrl_reg_i config_reg_addr_o config_addr_en_o status_reg_o src_addr_o src_addr_en_o src_addr_incr enable_src_gen_o link_length_o link_num_o end_link_i end_transfer_i dst_addr_o dst_addr_en_o dst_addr_incr enable_dst_gen_o src_rate_o src_parity_o src_endian_o dst_rate_o dst_parity_o dst_endian_o src_csn_o src_oe_o dst_csn_o dst_we_o width 1 1 5 5 16 16 8 1 16 32 1 2 1 16 8 1 1 32 1 2 1 4 1 1 4 1 1 1 1 1 1 DIR I I O O I I O O O O O O O O O I I O O O O O O O O O O O O O O Description Clock input. Synchronous reset, active low. Source port number. Destination port number. Task packet data input. Control register data input. Task packet read address. Task packet read enable. Status register data output. Start address of source port. Enable source port start address. Increment step of source port. Source address generator enable signal. Length of current transfer link. Total link number. End of current link. End of current transfer. Start address of destination port. Enable destination port start address. Increment step of destination port. Destination address generator enable signal. Source port data rate. Source port parity check. Source port endian. Destination port data rate. Destination port parity check. Destination port endian. Source port chip select enable, active low. Source port output enable, active low. Destination port chip select enable, active low. Destination port write enable, active low. Table 5.6. Interface of Transaction FSM Chapter 6 Integration Since the DMA controller should work together with the Senior DSP core, we need to integrate the DMA controller into the processor core. In this Chapter, the basic flow will be introduced. It includes the hardware integration and software integration. 6.1 Hardware Integration The DMA controller works as a peripheral of the Senior DSP core. As introduced in Chapter 4 and Reference [7], the peripheral can be connected to any available GIO. In the following piece of code, the DMA controller is connected to I/O number 5. The Senior DSP system has other peripherals connected such as timer and interrupt controller. The memory interface of the DMA controller should also be connected to the current Senior memory sub-system. Since the processor need to know which memory is being accessed by DMA controller to make sure the processor core will not access the same memory module, the Special Memory Control Register of DMA controller should be connected to Senior core, also. 6.2 Software Integration In order make the verification of the DMA controller easier, a behavioral model of DMA controller is also developed. Thus, it is necessary to integrate the behavioral model into the simulator. The behavioral model is written in C++. At first, the behavioral model is not exactly cycle accurate. After the simulation of hardware implementation, the behavioral model is further tuned to meet the timing specification of the actual hardware. The behavioral model should be compiled together with the Senior simulator. The DMA controller should be instantiated in header file of the simulator in Example 6.1. 47 48 Integration Example 6.1: Create DMA Behavioral Model in simulator class Senior { public: ... // -------------------------------// DMA // -------------------------------DMAController dma_controller; ... } In the Senior simulator, the DMA controller should be connected to the program memory and data memory in the constructor of Senior. It should be connected to a specific I/O address as well, the codes are shown in Example 6.2. Example 6.2: Connect DMA Controller Senior::Senior() { ... dma_controller.cycle = &cycle; dma_controller.peripherals = &peripherals; dma_controller.pm[0] = &pm[0]; dma_controller.pm[1] = &pm[1]; for (int i=0; i<4; i++) { dma_controller.dm[i] = &dm[i]; } ... } int SrSim::srmain(int argc, char** argv) { ... // Add DMA peripheral at I/O address 5 fprintf(stdout, "Loading DMA peripheral at address 5.\n"); addPeripheral(&(dma_controller),5); ... } 6.3 DMA Programming In this section, some sample codes by which the Senior DSP core can program the DMA controller will be listed. 6.3 DMA Programming 6.3.1 49 Initialize the DMA Controller In Example 6.3, the DMA controller is configured with a task packet contains 3 links by Senior core through its I/O instructions. Example 6.3: Configure the DMA Controller ;; Define the address of DMA registers ;; DMA is connected to I/O 5 #define #define #define #define DMA_STATUS DMA_CONTRL DMA_OUT_DATA DMA_IN_DATA 0x05 0x45 0x85 0xC5 .code ;;; DMA task 1 set r9,$FFFF ; start symbol, task package preamble ;;; number of link = 3, priority = 0, task ID = 2 set r10,$0301 ;;; width = 16bit, endian = 0, src / dst rate = 1 set r11,$5000 ;;; src port = 3, dst port = 4 set r12,$0064 out DMA_IN_DATA,r9 out DMA_IN_DATA,r10 ; write config vector to config fifo out DMA_IN_DATA,r11 out DMA_IN_DATA,r12 set set out out ;; link set set set out out out ;; link set set set ;; link set r10,$0010 ; dst addr low part r11,$0000 ; dst addr high part DMA_IN_DATA,r10 DMA_IN_DATA,r11 1 r10,$0000 ; src addr low part r11,$0000 ; src addr high part r12,32 ; link length = 32 DMA_IN_DATA,r10 DMA_IN_DATA,r11 DMA_IN_DATA,r12 2 r10,$0030 ; src addr low part r11,$0000 ; src addr high part r12,16 ; link length = 16 3 r13,$0060 ; src addr low part 50 Integration set set out out out out out out r14,$0000 ; src addr high part r15,$40 ; link length = 64 DMA_IN_DATA,r10 DMA_IN_DATA,r11 DMA_IN_DATA,r12 DMA_IN_DATA,r13 DMA_IN_DATA,r14 DMA_IN_DATA,r15 ;;; wait for channel configuration task1_channel_config in r1,DMA_STATUS nop and r1,$0002 sub r1,$0002 jump.ne task1_channel_config ;; start DMA task 1 ;; write control register, start config channel ;; and start DMA transfer set r1,0x8000 ; config a channel set r2,0x0400 ; start DMA out DMA_CONTRL,r1 out DMA_CONTRL,r2 6.3.2 Poll the DMA Controller In Example 6.4, the Senior core will poll the status register of the DMA controller to check if the transfer is completed. If the transaction is done, the processor will go out of the loop and continue to do the other things. Example 6.4: Poll the status of DMA Controller ;;; wait for DMA task 1 finish task1_done in r1,DMA_STATUS nop and r1,$0006 sub r1,$0006 jump.ne task1_done ;;; Start to do other things 6.3 DMA Programming 6.3.3 51 Handle the DMA Interrupt From Example 6.4, we can find that there is a big disadvantage of polling DMA controller. The processor cannot do anything but waiting for the DMA controller to complete the transfer. Thus, it is necessary to deal with the interrupt so that the processor core can do other things while the DMA controller is doing the transfer. In Example 6.5, the entry for the interrupt service routine (ISR) should be set correctly according to the actual hardware connection. The flow of the interrupt can be described as: [Interrupt Received] → [Push Flags] → [Push PC] → [PC = DM1[SPR(intaddr)]] → [Interrupt service routine] → [Instruction = RETI] → [Pop PC and Start Jump] → [Pop Flags] Example 6.5: Handle the DMA Interrupt .code set set set nop st1 jump sp, 0x7000 ; set the stack point intaddr, 0x0000 ; set interrupt BASE address (DM1) r0, INTERRUPT_ROUTINE (0x0008), r0 ; store interrupt address 4 at BASE+8 MAIN_PROGRAM INTERRUPT_ROUTINE ;;; Here is the interrupt service routine reti MAIN_PROGRAM ;;; Main Program Chapter 7 Verification After the hardware is completed, it is always important to verify the correctness of the designed hardware. In the area of semiconductor industry, it is extremely critical to make sure the design is bug-free before tape out, since the non-recurring engineering (NRE) cost of a tape out in 0.13µm technology is more than 1 million USD in the year 2004 [10]. Modern technology has even higher NRE cost. 7.1 Functional Verification The functional verification of the DMA controller is based on the test bench of Senior processor. The basic principle of verification is to compare the output from the behavioral model of DMA controller with the output from the RTL code simulation. If the results match, it is believed that the designed hardware is correct, otherwise debug procedures should be taken. Figure 7.1 shows the functional verification flow. Figure 7.1. DMA Functional Verification Flow Several test cases have been developed to increase the code coverage of the 53 54 Verification design. Currently, normal DMA operation, linking table operation and large block transferring with interrupt has been tested. The code coverage is 91.7%. 7.2 Hardware Implementation For a hardware design, it is always exciting to implement the design into real hardware, either on FPGA or on ASIC. It is an honor that Professor Liu offered me an oppertunity to make my design into real hardware. The FPGA implementation was targeted on Xilinx Virtex 4 FPGA while the ASIC implementation was targeted on Infineon 65nm CMOS technology. The implementation was straight forward, the logic synthesizer translates the RTL code into netlist based on the specific technology, either CMOS standard cell or FPGA cell. The backend tool will produce the layout based on the floorplan and synthesized netlist. Some optimization will be performed while the design hierarchy might be broken. Since the implementation was about the whole Senior system, I will only discuss the results of the DMA module in Chapter 8. Chapter 8 Conclusion 8.1 8.1.1 Achieved Results DMA Benchmark From the RTL simulation, we can see the timing diagram of the DMA controller when it is performing the transaction. The timing diagram is drawn in Figure 8.1 and Figure 8.2, respectively. Note that the extra 4 cycles in Figure 8.2 between 2 links are used to configure the corresponding transfer parameter for the second link. Figure 8.1. Timing diagram of basic DMA operation. The DMA controller has also been synthesized in 65nm digital CMOS technology and implemented in Xilinx Virtex 4 FPGA. Table 8.1 shows the result. From Table 8.1, we can find that the estimated gate count for CMOS 65nm technology is relatively high, that’s because a 256 word depth with 16-bit word 55 56 Conclusion Figure 8.2. Timing diagram of linking table operation. Target Technology Working Frequency Estimated Gate Count Number of Flip Flops Number of 4 input LUTs Estimated Power ST 65nm CMOS without mem 200 MHz 26595 4.18 mW Infineon 65nm CMOS with mem 200 MHz 18000 2.48 mW Xilinx FPGA Virtex 4 88 MHz 504 694 Not Available Table 8.1. Synthesis Result of DMA controller width dual-port RAM is used as the control FIFO in the DMA controller. And the memory was not optimized in this implementation and was synthesized directly. If memory cell is used in the synthesis, the actual gate count is 18000. The synthesis result for the FPGA implementation is quite comparable to the ASIC implementation without memory. 8.1.2 Comparison Theoretically, with the help of the DMA controller, the efficiency of memory transfer should be improved since the DMA controller can read and write the memory pipelined as shown in Figure 8.1. It is of course possible for the processor core to read and write memory pipelined, but it will cost extra register file and programming tricks. It is somewhat only partially pipelined because the limit of registers available when the desired transfer is too large such as tens of kilo bytes. In order to compare the efficiency of the memory transfer, Table 8.2 compares the Clock Cycle the Senior spent when transfer a certain amount of data blocks. The test case 1 includes three different memory transfer tasks from and to different parts of the memory sub-system. Task 1 contains three links with 32, 16 and 64 data words respectively. The transfer is from memory port 3 to port 4. Task 2 is almost the same as task 1, except the destination is memory port 5. 1 Here the optimization means software optimization such as software pipeline 8.2 Future Work Results Clock Cycle Code Size(Bytes) 57 without DMA and no optimization 1055 212 without DMA but with optimization 543 548 with DMA 466 488 Table 8.2. Results Comparison with and without DMA Task 3 is to transfer 32 data word from memory port 4 to memory port 3. The reader should keep in mind that the benchmark is only a way to estimate the actual performance. The performance benchmark should always been collected on the real-life applications such as a FFT or DCT kernels or even more complicated applications such as a complete JPEG decoder and MP3 decoder. 8.1.3 Conclusion The DMA controller can improve the memory transfer efficiency and make it possible for the processor to do other things while the data transfer is being performed. There is no free lunch, extra hardware cost and extra code size should be paid for this improvement. For some timing critical applications, it is almost impossible for the processor core to do both data calculation and data transfer. Thus, the DMA technique is preferred. 8.2 Future Work As discussed in section 8.1.2, the actual improvement of DMA controller should be measured on more complicated application such as baseband kernel algorithm or multimedia kernel algorithms. Which means the DMA controller together with the Senior processor core should be implemented on either FPGA or ASIC to make a chip, and the whole application should be developed on the platform. In order to support off-chip memory modules, external memory interface should also be developed. That would possibly include the commonly used DDR DRAM interface and NAND Flash memory interface. The behavioral model of the DMA controller is currently statically compiled into the Senior simulator. In order to protect Intellectual Property (IP) and technical detail of Senior core, it is better to compile it dynamically. Bibliography [1] TMS320C6000 DSP Enhanced Direct Memory Access (EDMA) Controller Reference Guide, March 2005. Literature Number:SPRU234B. [2] Dave Comisky, Sanjive Aganvala, and Charles Fuoco. A Scalable HighPerformance DMA Architecture for DSP Applications. In International Conference on Computer Design, pages 414–419, 2000. [3] Steve Furber. ARM System-on-Chip Architecture. Addison-Wesley Professional, 2nd edition, August 2000. [4] David J.Katz and Rick Gentile. Embedded Media Processing. Elsevier, September 2005. [5] Phil Lapsley, Jeff Bier, Amit Shoham, and Edward A. Lee. DSP Processor Fundamentals: Architectures and Features. Wiley-IEEE Press, February 1997. [6] Dake Liu. Embedded DSP Processor Design, Volume 2: Application Specific Instruction set Processors (Systems on Silicon). Morgan Kaufmann, June 2008. [7] Markus Svensson and Thomas Österholm. Optimization and Verification of an Integrated DSP. Master’s thesis, Linköping University, December 2008. [8] Tongtong Wang. Design of High-performance DMA Controller for Multi-core Platform. Master’s thesis, Linköping University, May 2006. [9] Lars Wanhammar. DSP Integrated Circuits. Academic Press, 1st edition, May 1999. [10] Kun-Cheng Wu and Yu-Wen Tsai. Structured ASIC, evolution or revolution? In Proceedings of the 2004 international symposium on Physical design, pages 103–106. ACM, 2004. 59 Appendix A DMA Simulator C++ Header #ifndef DMA_CONTROLLER_HPP #define DMA_CONTROLLER_HPP #include "support.hpp" #include "peripheral.hpp" #include "memory.hpp" #include "data_memory.hpp" #include <map> #include <queue> #include <stdlib.h> #include <stdint.h> #define DMA_LINK_NUM 64 // DMA linking table number #define DMA_TASKQ_SIZE 3 #define DMA_PM1 0 #define DMA_PM2 1 #define DMA_DM0_1 2 #define DMA_DM0_2 3 #define DMA_DM1_1 4 #define DMA_DM1_2 5 struct Links_t{ uint16_t srcAddrL; uint16_t srcAddrH; uint16_t length; }; struct DMATask_t{ uint8_t linkNumber; uint8_t taskPriority; uint8_t taskID; uint8_t srcWidth; uint8_t dstWidth; bool srcProtocol; bool dstProtocol; bool srcEndian; bool dstEndian; uint8_t srcRate; uint8_t dstRate; uint8_t srcPort; uint8_t dstPort; uint16_t dstAddrL; uint16_t dstAddrH; Links_t links[DMA_LINK_NUM]; }; struct DMAStatus_t{ bool busy; 61 62 DMA Simulator C++ Header bool bool bool bool chReady; finish; exception; queueFull; }; struct DMAControl_t{ bool reset; bool shutdown; bool dmaClock; bool start; uint8_t taskID; bool reqChConf; }; class DMAController : public Peripheral { public: cycle_T* cycle; std::map<unsigned int, Peripheral*>* peripherals; //connect to peripheral IO DMAController(void); ~DMAController(void); long ioCommunicate(unsigned int, unsigned long, unsigned long, unsigned int, unsigned long); int GetInterrupt(); int Step(); // Program memory Memory *pm[2]; // Data memory DataMemory *dm[4]; unsigned long clockTag; void start(unsigned long cycle); void configChannel(unsigned long cycle); uint16_t getStatusReg(unsigned long cycle); uint16_t getControlReg(); void setControlReg(uint16_t data); void putTaskQueue(uint16_t data, unsigned long cycle); void shutDown(); void reset(); private: // DMA config DMAStatus_t _status; DMAControl_t _control; DMATask_t _task; uint16_t _statusReg; uint16_t _controlReg; uint16_t _taskQueue[DMA_TASKQ_SIZE][198]; // DMA task queue uint16_t _queuePtr; uint16_t _nextQueuePtr; uint16_t _taskPtr; uint32_t _taskRegAddr; std::queue<uint16_t> _taskQ; // Task queue function void _setTaskReg(uint32_t queID, uint32_t addr, uint16_t data); uint16_t _getTaskReg(uint32_t queID, uint32_t addr); // DMA data transfer function void _trans(); uint32_t _transCycle(); void _syncReg(); void _syncTask(); }; #endif Appendix B DMA Simulator C++ Code #include "dma_controller.hpp" #include <stdlib.h> static inline int gv(unsigned int insn, int bitpos, int bits) { return ((insn >> bitpos) & ((1<<bits)-1)); } //----------------------------// DMA peripheral I/O //----------------------------long DMAController::ioCommunicate(unsigned int addr_in, unsigned long data_in, unsigned long data_in2, unsigned int read_write, unsigned long cycle) { if (read_write == 1) { // Reading switch(gv(addr_in,6,2)) { case 0: // Status register return getStatusReg(cycle); case 1: // Control register return getControlReg(); case 2: // Out port data to DSP core fprintf(stderr, "Warning: No data written to DSP core.\n"); return -1; case 3: // In port from DSP core fprintf(stderr, "Warning: Can’t read In port data.\n"); return -1; default:// Unkown operation fprintf(stderr, "Warning: Unknown operation.\n"); return -1; } } else if (read_write == 2) { // Writing switch(gv(addr_in,6,2)) { case 0: // Status register fprintf(stderr, "Warning: Trying to write read-only status register.\n"); return -1; case 1: // Control register setControlReg((uint16_t)data_in); printf("DMA: Cycle(%lu), write DMA_CONTROL, value = 0x%04x.\n",cycle,(uint16_t)data_in); if (gv(data_in,0,1)) { 63 64 DMA Simulator C++ Code reset(); // Reset } else if (gv(data_in,1,1)) { shutDown(); // Shutdown DMA } else if (gv(data_in,15,1)) { configChannel(cycle); // Request config DMA task } else if (gv(data_in,10,1)) { start(cycle); // Start DMA transaction } return 0; case 2: // Out port data to DSP core fprintf(stderr, "Warning: Trying to write OUT port of DMA.\n"); return -1; case 3: // In port data from DSP core putTaskQueue((uint16_t)data_in, cycle); return 0; default: // Unknown operation fprintf(stderr, "Warning: Unknown operation.\n"); return -1; } } fprintf(stderr, "DMA ERROR: Wrong read_write state,read_write=%d.\n", read_write); return -1; } int DMAController::GetInterrupt() { return 0; } int DMAController::Step() { return 0; } //----------------------------// DMA controller behavior //----------------------------DMAController::DMAController() { reset(); } DMAController::~DMAController() { }; void DMAController::start(unsigned long cycle) { printf("DMA: Cycle(%lu), start transfer.\n",cycle); _status.busy = 1; _status.finish = 0; _status.chReady = 0; _syncReg(); clockTag = cycle + _transCycle(); printf("DMA: update clockTag = %lu.\n",clockTag); } uint16_t DMAController::getStatusReg(unsigned long cycle) { if (_status.busy && clockTag <= cycle) { _trans(); _status.busy = 0; _status.finish = 1; _status.chReady = 1; printf("DMA: Cycle(%lu), task(%d) finished.\n",cycle,_task.taskID); } if (_taskQ.size() == DMA_TASKQ_SIZE) _status.queueFull = 1; else _status.queueFull = 0; 65 _syncReg(); return _statusReg; } uint16_t DMAController::getControlReg() { _syncReg(); return _controlReg; } void DMAController::setControlReg(uint16_t data) { _controlReg = data; } void DMAController::putTaskQueue(uint16_t data, unsigned long cycle) { if ((uint16_t)data == 0xFFFF) { _queuePtr = _nextQueuePtr; _taskRegAddr = 0; _taskQ.push(_queuePtr); if (_queuePtr >= DMA_TASKQ_SIZE-1) _nextQueuePtr = 0; else _nextQueuePtr = _queuePtr+1; printf("DMA: Cycle(%lu), Senior config task queue[%d].\n",cycle,_queuePtr); } else { _setTaskReg(_queuePtr, _taskRegAddr, data); _taskRegAddr++; } } void DMAController::configChannel(unsigned long cycle) { _taskPtr = _taskQ.front(); _taskQ.pop(); printf("DMA: Cycle(%lu), configChannel(); _taskPtr = %d, _taskQ.size = %d.\n",cycle, _taskPtr, _taskQ.size()); _syncReg(); _syncTask(); printf("DMA task packet: \n"); printf(" |-Link number %d, Task priority %d, Task ID 0x%04x\n", _task.linkNumber,_task.taskPriority,_task.taskID); printf(" |------------\n"); printf(" |-SRC width %d, DST width %d, SRC protocol %d, DST protocol %d\n", _task.srcWidth,_task.dstWidth, _task.srcProtocol,_task.dstProtocol); printf(" | SRC endian %d, DST endian %d, SRC rate %d, DST rate %d\n",_task.srcEndian,_task.dstEndian, _task.srcRate,_task.dstRate); printf(" |------------\n"); printf(" |-SRC port %d, DST port %d\n", _task.srcPort,_task.dstPort); printf(" |------------\n"); printf(" |-DST addr low 0x%04x\n", _task.dstAddrL); printf(" |-DST addr high 0x%04x\n", _task.dstAddrH); for (int i = 0; i < _task.linkNumber; i++) { printf(" |-DMA link %d\n", i); printf(" | |-SRC addr low 0x%04x\n", _task.links[i].srcAddrL); printf(" | |-SRC addr high 0x%04x\n", _task.links[i].srcAddrH); printf(" | |-Link length %d\n", _task.links[i].length); } } void DMAController::shutDown() { clockTag = 0; _queuePtr = 0; _nextQueuePtr = 0; _taskPtr = 0; 66 _status.busy = 1; _status.finish = 0; _status.chReady = 0; _status.exception = 0; _status.queueFull = 1; setControlReg(0x0000); _syncReg(); DMA Simulator C++ Code } void DMAController::reset() { clockTag = 0; _queuePtr = 0; _nextQueuePtr = 0; _taskPtr = 0; _status.busy = 0; _status.finish = 1; _status.chReady = 1; _status.exception = 0; _status.queueFull = 0; } void DMAController::_trans() { uint16_t tmpData; uint32_t srcAddr; uint32_t dstAddr; dstAddr = ((uint32_t)_task.dstAddrH << 16) + (uint32_t)_task.dstAddrL; for (int link = 0; link < _task.linkNumber; link++) { srcAddr = ((uint32_t)_task.links[link].srcAddrH << 16) + (uint32_t)_task.links[link].srcAddrL; for (int i = 0; i < _task.links[link].length; i++) { // Read data vector from source switch(_task.srcPort) { case(DMA_PM1): tmpData = pm[0]->Read((uint16_t)srcAddr); break; case(DMA_PM2): tmpData = pm[1]->Read((uint16_t)srcAddr); break; case(DMA_DM0_1): tmpData = dm[0]->dmaRead((uint16_t)srcAddr); break; case(DMA_DM0_2): tmpData = dm[1]->dmaRead((uint16_t)srcAddr); break; case(DMA_DM1_1): tmpData = dm[2]->dmaRead((uint16_t)srcAddr); break; case(DMA_DM1_2): tmpData = dm[3]->dmaRead((uint16_t)srcAddr); break; default: break; } // Write data vector to destination switch(_task.dstPort) { case(DMA_PM1): pm[0]->Write(dstAddr,tmpData); break; case(DMA_PM2): pm[1]->Write(dstAddr,tmpData); break; case(DMA_DM0_1): dm[0]->dmaWrite(dstAddr,tmpData); break; case(DMA_DM0_2): dm[1]->dmaWrite(dstAddr,tmpData); break; case(DMA_DM1_1): dm[2]->dmaWrite(dstAddr,tmpData); break; case(DMA_DM1_2): dm[3]->dmaWrite(dstAddr,tmpData); break; default: break; } // Update address pointer 67 } } srcAddr++; dstAddr++; } uint32_t DMAController::_transCycle() { uint32_t transCycle = 0; for (int link = 0; link < _task.linkNumber; link++) { transCycle += _task.links[link].length; } return transCycle; } void DMAController::_syncReg() { // Status register _statusReg = (_status.queueFull << 4) | (_status.exception << 3) | (_status.finish << 2) | (_status.chReady << 1) | (_status.busy); // Control register _control.reset = gv(_controlReg,0,1); _control.shutdown = gv(_controlReg,1,1); _control.dmaClock = gv(_controlReg,2,1); _control.start = gv(_controlReg,10,1); _control.taskID = gv(_controlReg,11,4); _control.reqChConf = gv(_controlReg,15,1); } void DMAController::_syncTask() { //DMA task register _task.linkNumber = gv(_taskQueue[_taskPtr][0],8,8); _task.taskPriority = gv(_taskQueue[_taskPtr][0],4,4); _task.taskID = gv(_taskQueue[_taskPtr][0],0,4); _task.srcWidth = gv(_taskQueue[_taskPtr][1],14,2); _task.dstWidth = gv(_taskQueue[_taskPtr][1],12,2); _task.srcProtocol = gv(_taskQueue[_taskPtr][1],11,1); _task.dstProtocol = gv(_taskQueue[_taskPtr][1],10,1); _task.srcEndian = gv(_taskQueue[_taskPtr][1], 9,1); _task.dstEndian = gv(_taskQueue[_taskPtr][1], 8,1); _task.srcRate = gv(_taskQueue[_taskPtr][1], 4,4); _task.dstRate = gv(_taskQueue[_taskPtr][1], 0,4); _task.srcPort = gv(_taskQueue[_taskPtr][2],5,5); _task.dstPort = gv(_taskQueue[_taskPtr][2],0,5); _task.dstAddrL = _taskQueue[_taskPtr][3]; _task.dstAddrH = _taskQueue[_taskPtr][4]; for (int i=0; i < DMA_LINK_NUM; i++) { _task.links[i].srcAddrL = _taskQueue[_taskPtr][5+i*3]; _task.links[i].srcAddrH = _taskQueue[_taskPtr][5+i*3+1]; _task.links[i].length = _taskQueue[_taskPtr][5+i*3+2]; } } uint16_t DMAController::_getTaskReg(uint32_t queID, uint32_t addr) { return _taskQueue[queID][addr]; } void DMAController::_setTaskReg(uint32_t queID, uint32_t addr, uint16_t data) { _taskQueue[queID][addr] = data; } 68 DMA Simulator C++ Code Upphovsrätt Detta dokument hålls tillgängligt på Internet — eller dess framtida ersättare — under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/ Copyright The publishers will keep this document online on the Internet — or its possible replacement — for a period of 25 years from the date of publication barring exceptional circumstances. The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for his/her own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/ c Guoyou Jiang °