A Data Servicing Subsystem for the Chidi Reconfigurable Processor By Mark Lee Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degree of Master of Engineering in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology August 6, 1998 ) Copyright 1998 Massachusetts Institute of Technology. All rights reserved. Author Department of Electrical Engineering and Computer Science August 6, 1998 Certified by r f b V. Michael Bove, Jr. Sesis Supervisor Accepted by Acepted Arthur C. Smith Chairman, Department Committee on Graduate Theses MASSACHUSET S INSTITUTE OF TECHNOLOGY NOV 16 998 LIBRARIES W1 A Data Servicing Subsystem for the Chidi Reconfigurable Processor by Mark C. Lee Submitted to the Department of Electrical Engineering and Computer Science August 6, 1998 In Partial Fulfillment of the Requirements for the Degree of Master of Engineering in Electrical Engineering and Computer Science Abstract Application Specific Integrated Circuits (ASICs) are often used to enhance system performance, especially when a General Purpose Processor (GPP) is too inefficient or ill suited to perform a specialized task. However, the time and hardware costs inherent in the development and implementation of such a solution can be quite expensive. The use of Field Programmable Gate Arrays (FPGAs) to implement a Reconfigurable Processor (RP) can help alleviate some of the overhead encountered with ASIC development. The RP is a dynamic processing node that can be configured in-circuit to compute any realizable function at run-time. After the function has finished execution, the RP can be reconfigured to compute a different function. This concept is illustrated with the reconfigurable, multimedia Chidi Processing System. A network of Chidi boards, each with a closely coupled GPP and RP, is used to execute a sequence of multimedia related functions. One of the main issues in utilizing a RP efficiently is the ability to provide it with data effectively. The design and implementation of a data servicing subsystem for the Chidi Reconfigurable Processor, in an effort to increase system performance, is the main focus of study. This research is supported by the Digital Life Consortium at the MIT Media Laboratory. Thesis Supervisor: V. Michael Bove, Jr. Title: Principal Research Scientist, MIT Media Laboratory Acknowledgements The work documented in this thesis could not have been done without the guidance, assistance, and support of many people. I would like to thank Dr. V. M. Bove for giving me the opportunity to work on the Chidi project. His faith, supervision, and understanding have made this an extremely rewarding experience. John Watlington's patience when answering all of my questions, in addition to teaching me the ropes for hardware design, implementation, and debugging have been invaluable. Much of what I have learned about hardware can be directly attributed to Wad's expert tutelage. Wad's influence can be seen in all aspects of the Chidi project and the work documented in this thesis is no exception. A hearty thank you goes out to Yuan-Min Liu for being a fellow Chidi hardware engineer. Always available to discuss any hardware, software, or basketball related issue, Min helped keep moral high and development moving even in the darkest moments and contributed to my sanity and progress greatly. Chris McEniry was directly involved in much of the design for FPGA interaction in the RP subsystem. As chief engineer for the SAG, Chris deserves credit for a lot of what is documented here. Dr. Thomas Nwodoh was chief board engineer for the Chidi project and deserves credit for much of the system-level work, in addition to the RP clock implementation documented in section 6.2.3. Ken Kung assisted in all areas, from FPGA device details to software tools. His background and experience always provided a unique perspective to the problem at hand. His patience in answering my questions is greatly appreciated. Thank you's are in order for Josh Stults, Peter Yang, Chris Yang, and Peggy Chen for their continuing development of Chidi hardware. Specifically, Josh and Peter deserve credit for their extensive contributions in the areas of RP Configuration and RP subsystem FIFO implementation and debugging. My parents, Paul and Lily, deserve more gratitude, appreciation, and love than I could ever give for their encouragement, understanding, and unconditional support. None of my accomplishments could have been possible without them and I can only hope that my completion of this thesis and subsequent graduation gives them some of the happiness that they deserve. Finally, I would like to thank Ms. Jenny Huang for giving me a reason to work as hard as I possibly could these last five years. I could not have completed the work documented here without these people, and countless others that I may have forgotten to mention here. For this, I owe you a debt of gratitude. Thank you and good night. Table of Contents 1 PURPOSE AND SCOPE .................................................................................................. 2 RECONFIGURABLE COMPUTING OVERVIEW ..................................................... 11 2.2 APPLICATIONS AND RESEARCH AREAS ....................................................................... 11 2.3 THE CHIDI MULTIMEDIA PROCESSING SYSTEM ............................................... OV ERV IEW .................................................................................................................... 15 3.2 ARCHITECTURE AND BUS SPECIFICATION ................... 3.3 FUNCTIONAL B LOCKS ................................................................................................. .............................................. PowerPC604e Microprocessor................................... MPC106 PCIBridge/Memory Controller............................................................. Reconfigurable Processor(RP).................................................................. ..................... Stream Address Generator(SAG) ......................... ...... Data Shuffler (DS) ....................................................................................... .. .............. ....... ........................ External Interfaces ...................................... FIFOs .............................................................................. ................................... FPGA DEVICE DESCRIPTION AND DESIGN PROCESS ..................................... 4.3 ............. FunctionalDesign/Design Entry...................................30 Compilationand FunctionalSimulation ............................................................... 30 .......................................... ........................ 31 L ogic Synthesis................................. Place and R oute............................................ ...................................................... 31 Tim ing A nalysis .................................................... .............................................. 31 Post-Synthesis Simulation ...................... ............................................................. 32 Device Configuration .............................................................. 32 34 ........................................ Physical Channels..... Virtual Channels.................................................................................................... 35 DATA PROCESSING ......................................... 5.3.1 5.3.2 ...... ................. ................................. 35 ...................... 36 Data Request Mechanism............................................................ 37 Interrupt Mechanism ........................................................................................ REGISTER INTERFACE ......................... ........................ ........ RECONFIGURABLE PROCESSOR ..................................... 6 .1 33 OV ERV IEW ...................................................................................................................... 33 PHYSICAL AND VIRTUAL CHANNELS ............................................................................ 34 5.2.1 5.2.2 5.4 22 ........................ 29 .. RP SUBSYSTEM ............................................................................................................ 5.3 18 18 18 19 19 20 20 21 23 ....................................... ................ O verview......................................... Logic Elem ent (LE)................................................. ............................................ 24 Logic Array Block (LAB)...........................................................25 Embedded Array Block...........................................................26 FastTrackInterconnect ............................................................. 28 DESIGN PROCESS .................................. 4.3.1 4.3.2 4.3.3 4.3.4 4.3.5 4.3.6 4.3.7 5 .1 5.2 16 GENERIC FPGA OVERVIEW.............................................22 ALTERA FLEX 10K DEVICE FAMILY ................................................................ 23 4.2.1 4.2.2 4.2.3 4.2.4 4.2.5 6 15 3.1 4.1 4.2 5 ........................... 12 CHEOPS OVERVIEW .............................................................................. 3.3.1 3.3.2 3.3.3 3.3.4 3.3.5 3.3.6 3.3.7 4 11 B ACKGROU ND ..................................................................................................................... 2.1 3 9 ........ ...... ........................ 38 ................. 40 OV ERV IEW ....................................................................................................................... 40 6.2 FUNCTIONAL BEHAVIOR ..................................................................... 6.2.1 6.2.1.1 6.2.1.2 6.2.1.3 6.2.2 FunctionalBlocks................................................................................................41 RP ........................................................................... SR A M .................................................................................................................... High-Speed I/O Port ............................................................ Interfaces................................................................................ ................. 46 46 46 47 48 6.2.3 RP Clock Circuitry.............................................................................................. 6.2.4 Device Configuration.............................................. ........................................ 49 6.3 IMPLEMENTATION ........................................ 6.3.1 6.3 .2 6.3.3 6.3.3.1 6.3.3.2 6.3.3.3 .... 51 Design Details ................................................................. 52 C onfiguration Tim e.......................................................................................................................... 54 Performance .............................................................................. 54 7.2.1 OperationsSupported............................................ ........................ Interfaces........................................... ......................................... .................. 57 ......................... 58 7.2.2.1 DS/SAG Interface ................................................................................... 7.2.2.1.1 Control Registers ................................... ........ ....... 7.2.2.1.2 Status Registers ................................................................... 7.2.2.2 DS/PPC Bus Interface ............................................................. 7.2.2.3 DS/External FIFO Interface ................................ ................. 7.2.2.3.1 ReadO External FIFO Interface............................... ....................... 7.2.2.3.2 Readl External FIFO Interface............................... ...................... 7.2.2.3.3 Write0 External FIFO Interface............................... ...................... 7.2.3 Input Channels - ReadO and Read] ....... ................................. 7.2.4 Output channel - WriteO.................................... 7.3.2.1 7.3.2.2 7.3.3 7.3.3.1 7.3.3.2 7.3.3.3 7.3.4 Functionalityand Performance............................ Implementation Details................................. .... ..................... DS/SAG Interface .................................................................. Read Address Generator FSMs ........................................ Optim ization ........................................... 62 64 65 66 ....................... IMPLEMENTATION ......................................................................... 7.3.1 7.3.2 58 58 59 59 60 60 61 62 62 7.2.3.1 Data Path................................................................. 7.2.3.2 Control Logic .......................................................................... 7.2.3.2.1 R egister C ontrol Logic ....................................................................................................... 7.2.3.2.2 Multiplexer Control Logic................................... ................... ............ ........... 67 68 ....................... 68 .................. 68 ..................... 68 69 ................................. 71 16-tol Multiplexer ........................................ ........................ 71 Register and Logic Replication ........................................ ..................... 72 EA B Pipelining ........................................................... ............................... 73 Device Configuration......... ..................... .............................................. 73 FUTURE W ORK ............................................................................................................. 8.1 8.2 8.3 56 O V ERV IEW ..................................................................................................................... 56 FUNCTIONAL DESCRIPTION ............................................ ........................ 57 7.2.2 7.3 48 High Speed I/O Port...................................... ................................................... 51 R P...................................................................................... .................................. 5 1 RP Configuration................................................................................................ 52 DATA SHUFFLER .......................................................................................................... 7 .1 7.2 8 41 42 43 ............................. 46 6.2.2.1 RP/SAG Interface ............................................................ 6.2.2.2 RP/External FIFO Interface ..................................................................... 6.2.2.2.1 ReadO Physical Channel Interface ........................................ ...................... 6.2.2.2.2 Readl Physical Channel Interface ........................................ ................... 6.2.2.2.3 Write0 Physical Channel Interface ....................................... ................... 7 41 DS DEVELOPMENT................................. RP DEVELOPMENT................................. APPLICATION DEVELOPMENT.......................... 6 . .......... 76 ................ 76 ..................... 76 ................................. 77 9 W ORKS CITED............................................................................... ............................... 78 APPENDIX .................................................................................... .............................. 80 10 10 .1 10.2 A CR ON YM S ................................................................................................................. DATA SHUFFLER PATTERNS ............................................................... 10.3 RP SUBSYSTEM PARTS LIST ............................................ 80 81 109 Tables and Figures TABLE 1: FLEX 10K50/10K100 DEVICE FEATURES [10] ....................................... ...... 24 TABLE 2: EXTERNAL REGISTER INTERFACE CONTROL SIGNAL TRUTH TABLE........................39 TABLE 3: SRAM CONTROL SIGNAL DESCRIPTIONS ................................................. 42 TABLE 4: SRAM CONTROL SIGNAL TRUTH TABLE.................................................................. 43 TABLE 5: HIGH-SPEED I/O PORT SIGNAL DESCRIPTIONS ................................ ......................... 44 TABLE 6: SAG/RP INTERFACE - SIGNAL DESCRIPTIONS ........................................................ ... 46 TABLE 7: RP READO PHYSICAL CHANNEL FIFO SIGNAL DESCRIPTIONS .................................. 47 TABLE 8: RP READO PHYSICAL CHANNEL FIFO FUNCTION TABLE ............................................. 47 TABLE 9: RP READ1 PHYSICAL CHANNEL FIFO SIGNAL DESCRIPTIONS .................................. 48 .......... 48 TABLE 10: RP WRITEO PHYSICAL CHANNEL FIFO SIGNAL DESCRIPTIONS .................. .................................................... 49 TABLE 11: RP CLK FREQUENCIES ...................................... TABLE 12: CONFIGURATION EPROM SCHEME TIMING PARAMETERS [6] ................................. 50 57 TABLE 13: OPERATIONS SUPPORTED ................................................................. TABLE 14: SAG/DS INTERFACE - CONTROL REGISTERS ............................................................ 59 TABLE 15: SAG/DS INTERFACE - REQUEST MECHANISM .................................................... 59 TABLE 16: DS/PPC BUS INTEFACE SIGNAL DESCRIPTIONS ............................................. 60 TABLE 17: DS READO PHYSICAL CHANNEL SIGNAL DESCRIPTIONS ............................................. 61 TABLE 18: DS READO PHYSICAL CHANNEL FUNCTION TABLE..................................................61 TABLE 19: DS READ 1 PHYSICAL CHANNEL SIGNAL DESCRIPTIONS..........................................61 TABLE 20: DS WRITEO PHYSICAL CHANNEL SIGNAL DESCRIPTIONS ..................................... 62 TABLE 21: REGISTER CONTROL LOGIC SIGNAL DESCRIPTIONS ................................................. 66 TABLE 22: MULTIPLEXER CONTROL LOGIC SIGNAL DESCRIPTIONS .......................................... 67 TABLE 23: WRITEO PHYSICAL CHANNEL SIGNAL DESCRIPTIONS .............................................. 68 TABLE 24: PASSIVE SERIAL CONFIGURATION SCHEME TIMING PARAMETERS [6].....................75 TABLE 25: READO LUT FOR PATTERN 1 (STRAIGHT THROUGH), OFFSETS 0-7 ............................. 81 TABLE 26: READO LUT FOR PATTERN2 (DECIMATE BYTES BY 2), OFFSETS 0-7........................... 82 TABLE 27: READ 1 LUT FOR PATTERN2 (DECIMATE BYTES BY 2), OFFSETS 0-7......................82 TABLE 28: READO LUT FOR PATTERN3 (DECIMATE BYTES BY 3/EXTRACT ONE CHANNEL), OFFSETS 0-7 ................................................................................................................... 83 TABLE 29: READI LUT FOR PATTERN3 (DECIMATE BYTES BY 3/EXTRACT ONE CHANNEL), OFFSETS 0-7 .................................................................................. ..................................... 84 TABLE 30: READO LUT FOR PATTERN4 (DECIMATE BYTES BY 4), OFFSETS 0-7...................... 85 TABLE 31: READ 1 LUT FOR PATrERN4 (DECIMATE BYTES BY 4), OFFSETS 0-7......................86 TABLE 32: READO LUT FOR PATTERN5 (DECIMATE BYTES BY 6/CHANNELS BY 2), OFFSET 0-3..87 TABLE 33: READO LUT FOR PATTERN5 (DECIMATE BYTES BY 6/CHANNELS BY 2), OFFSETS 4-7 88 TABLE 34: READ 1 LUT FOR PATTERN5 (DECIMATE BYTES BY 6/CHANNELS BY 2), OFFSETS 0-3 89 TABLE 35: READI LUT FOR PATTERNS (DECIMATE BYTES BY 6/CHANNELS BY 2), OFFSETS 4-7 90 91 TABLE 37: READ 1 LUT FOR PATTERN6 (DECIMATE SHORTS BY 2), OFFSETS 0-7 ........................ 91 TABLE 36: READO LUT FOR PATTERN6 (DECIMATE SHORTS BY 2), OFFSETS 0-7 ..................... TABLE 38: READO LUT FOR PATTERN7 (DECIMATE SHORTS BY 4), OFFSETS 0-7 ..................... 92 TABLE 39: READ 1ILUT FOR PATTERN7 (DECIMATE SHORTS BY 4), OFFSETS 0-7 ..................... TABLE 40: READO LUT FOR PATTERN8 (EXTRACT TWO CHANNELS), OFFSETS 0-7 .................. TABLE 41: READ1 LUT FOR PATTERN8 (EXTRACT TWO CHANNELS), OFFSETS 0-7 .................. TABLE TABLE TABLE TABLE TABLE TABLE TABLE TABLE TABLE TABLE TABLE TABLE 42: READO LUT FOR 43: READO LUT FOR 44: READ 1 LUT FOR 45: READ 1 LUT FOR 46: READO LUT FOR 47: READ1 LUT FOR 48: READO LUT FOR 49: READO LUT FOR 50: READ1 LUT FOR 51: READ1 LUT FOR 52: READO LUT FOR 53: READ 1 LUT FOR 93 94 95 PATTERN9 (SELECT EVERY PIXEL), OFFSETS 0-3 ............................ 96 PATTERN9 (SELECT EVERY PIXEL), OFFSETS 4-7 ............................. 97 PATTERN9 (SELECT EVERY PIXEL), OFFSETS 0-3 ........................... 98 PATTERN9 (SELECT EVERY PIXEL), OFFSETS 4-7 ........................ 99 PATTERN10 (DECIMATE PIXELS BY 2), OFFSETS 0-7 ................... 100 PATTERN10 (DECIMATE PIXELS BY 2), OFFSETS 0-7 ................... 101 PATTERN11 (DECIMATE PIXELS BY 4), OFFSETS 0-3 .................. 102 PATTERN11 (DECIMATE PIXELS BY 4), OFFSETS 4-7 ................... 103 PATTERN11 (DECIMATE PIXELS BY 4), OFFSETS 0-3 ................... 104 PATTERN 11 (DECIMATE PIXELS BY 4), OFFSETS 4-7 ................... 105 PATTERN12 (DECIMATE WORDS BY 2), OFFSETS 0-7 .................. 106 PATTERN12 (DECIMATE WORDS BY 2), OFFSETS 0-7 ................... 106 TABLE 54: READO LUT FOR PATrERN13 (DECIMATE DOUBLE WORDS BY 2), OFFSETS 0-7 ....... 107 TABLE 55: READ1 LUT FOR PATTERN13 (DECIMATE DOUBLE WORDS BY 2), OFFSETS 0-7 ....... 108 TABLE 56: RP SUBSYSTEM PARTS LIST ..................................................................................... 109 FIGURE FIGURE FIGURE FIGURE FIGURE FIGURE FIGURE FIGURE FIGURE FIGURE FIGURE FIGURE FIGURE FIGURE FIGURE FIGURE FIGURE FIGURE FIGURE FIGURE FIGURE 1: 2: 3: 4: 5: CHEOPS BLOCK DIAGRAM [1] ....................................... 13 CHIDI BLOCK DIAGRAM [3] ..................................................... ......................... 16 ALTERA FLEX 10K FAMILY ARCHITECTURE [9] ...................................... .... 23 ALTERA FLEX 10K FAMILY LOGIC ELEMENT [10].................................. ..... 24 ALTERA FLEX 10K FAMILY LOGIC ARRAY BLOCK [10]..........................................26 6: ALTERA FLEX 10K FAMILY EMBEDDED ARRAY BLOCK [10].......................... 27 7: LAB CONNECTIONS TO FASTTRACK INTERCONNECT [7] ............................................. 29 8: FPGA DESIGN FLOW........................... ......................... 30 9: RP SUBSYSTEM ................................................... ...... ................. 33 10: RP SUBSYSTEM DATA FLOW ............................................................... 37 11: RP BLOCK DIAGRAM ................................................................. 41 12: HIGH-SPEED I/O PORT ..................................................................... 45 13: 14: 15: 16: 17: 18: 19: 20: 21: RP CONFIGURATION TIMING DIAGRAM [6] ..................................... ........... 50 RP CONFIGURATION BLOCK DIAGRAM ........................................ 53 DATA SHUFFLER BLOCK DIAGRAM .......................................................... 56 READO/READ1 DATAPATH.............................................................63 READO/READI MULTIPLEXER CONFIGURATION AND 16-TO-1 MULTIPLEXER .......... 64 READ CONTROL LOGIC .................................................... 65 READ ADDRESS GENERATOR FSMS ...................... ...... ....................... ... 70 OPTIMIZED 16-TO-1 MULTIPLEXER .......................................... 72 CONFIGURATION EPROM SCHEME CIRCUIT DIAGRAM [6] .................................... 74 FIGURE 22: CONFIGURATION EPROM SCHEME TIMING WAVEFORM [6] .................................. 74 1 Purpose and Scope This Master's Thesis concentrates on the issues related to designing, implementing, and debugging of a data-servicing subsystem, or Data Shuffler (DS), for a Reconfigurable Processor (RP) in the Chidi multimedia system. The main goal is to design an RP subsystem that has the proper architecture so that it can operate efficiently and effectively in conjunction with a microprocessor, or General Purpose Processor (GPP). Many systems that use RPs as coprocessors do not utilize both the RP and GPP at the same time, giving up the available parallelism that could be used to improve system performance. This parallelism is something that will be taken advantage of in Chidi. The work for this thesis can be divided into four sections. First, the Data Shuffler (DS) for the RP and the RP itself must be specified at a behavioral level. This step includes setting performance and design goals for the two blocks. Second, the interfaces for the DS must be defined down to the signal-level in order for it to interact with the RP and the system in an efficient manner. Third, since the DS will be implemented in an FPGA, the design must be specified down to the gate level and meet the functional and performance goals that were set at the beginning of the design process. The design is to be verified first in software through the use of simulations. Finally, the design is to be downloaded onto a physical board and verified that it functions properly in a laboratory setting. On a higher level, work must also be done at a system level. This includes preparing schematics and layout notes, generating a netlist, and specifying the physical components that will comprise the Chidi system. This information is given to a contractor that is responsible for layout, fabrication, and assembly of the Chidi boards. In order to understand the motivation behind designing such a system, an introduction to Reconfigurable Computing (RC) needs to be provided. An overview of RC and some of the research that has already been done in this area is provided in Section 2. Special consideration is given to the Cheops Imaging System, the predecessor to Chidi, in this section. Section 3 provides a system-level perspective on the Chidi Multimedia Processing System. Design goals, both functional and in terms of performance, are presented in this section as well as a brief overview of each functional block found in Chidi. Section 4 provides background information on both FPGAs and the FPGA design process. This information is necessary in order to put the discussion on implementation issues for the Data Shuffler in the proper context. Section 5 gives an overview of the RP subsystem and the characteristics that allow it to implement a RC element effectively. Section 6 discusses in more detail the design and implementation of the RP. Section 7 provides the design and implementation details for the Data Shuffler, which is the main focus of this thesis. Finally, a discussion of some potential areas of research that can be pursued in the future utilizing the ideas and work outlined in this thesis is given in Section 8. 2 Reconfigurable Computing Overview 2.1 Background Application Specific Integrated Circuits (ASICs) are commonly used to enhance system performance when a microprocessor, or General Purpose Processor (GPP), is too inefficient or ill suited to execute a time-constrained task. ASICs can be implemented as a co-processor or as an independent processing node. ASICs have been extremely successful in application areas such as digital signal processing (DSP) and computer graphics, among others. However, costs associated with the development and implementation of an ASIC solution are usually quite high, both temporally and economically. One solution to the problem of the high cost associated with ASIC development is the use of Field Programmable Gate Arrays (FPGAs). The FPGA architecture of programmable logic elements and a programmable switching network, in addition to being in-circuit programmable, allow changes to be made quickly and with little impact to the design process. Traditionally, these attributes have made FPGAs ideal for prototyping an ASIC design. The design would first be implemented on an FPGA and once it became stable and mature enough would then be ported onto an ASIC. However, until recently, FPGAs have not had high enough densities, fast enough speeds, and low enough configuration times to move into other application areas. Currently, the FPGA market is one of the fastest growing segments of the semiconductor industry. Vendors such as Altera, Xilinx, and Lucent, among others, are designing and manufacturing many variations of these devices, all targeting different application areas and different markets. The increase in the level of competition for this market has resulted in significant improvements in density, speed, and configuration times. The devices have improved in these three important areas to the point that using FPGAs as processing elements is now a reality, giving birth to the relatively new field of Reconfigurable Computing (RC). Research into using FPGAs in processing and co-processing systems has increased dramatically both in academics and industry in an effort to determine how effective these devices are in supplementing and/or replacing microprocessors and ASICs. 2.2 Applications and Research Areas The applications for which FPGAs are being used are widely varying. The Transmogrifier-2 (TM-2) [13], being developed at the University of Toronto, is a powerful prototyping system. Although prototyping is a more traditional application for FPGA systems, a full TM-2 system contains over one million useable gates, which is far from traditional. A full TM-2 system contains 16 boards each containing two Altera 10K50 FPGAs, which have approximately 35K user gates (this paper does not account for the embedded RAM found on Altera's 10K family of devices). This magnitude of available gates dwarfs those found in previous prototyping systems and allows for a much more rapid and powerful process. But there is still the question of whether FPGA-based systems are suitable for applications outside of prototyping. Researchers at the Queen's University of Belfast [20] showed that the FPGA architecture is quite capable of implementing DSP applications, in their case a 2D DCT. Their implementation used a single Xilinx XC6264 device and operated at 25 frames per second with VGA resolution. Singh and Bellec [19], at the University of Glasgow, showed that FPGAs are quite good at implementing both simple and complex computer graphics algorithms. They found that FPGA systems performed worse than specialized graphics chips, but better than general-purpose processors with specialized graphics instruction sets. They concluded that the advantage to using an FPGA-based system is the ability to execute many different algorithms on the same piece of hardware, a gain that outweighs the increased speed factor found when using specialized graphics hardware. 2.3 Cheops Overview The Cheops Imaging System, the predecessor to Chidi, developed at the MIT Media Lab by the Information and Entertainment Group, also investigates the use of reconfigurable processors [2]. Cheops uses one general-purpose processor, an Intel i960, and many specialized stream processors to implement a modular platform for acquisition, processing, and display of digital video sequences and model-based representations of moving scenes. Some of the functions that these processors implement include transposition, filtering, DCT, motion estimation, color space conversion, remapping, superposition, and sequencing operations. In Cheops, the data to be processed is stored in one of the eight blocks of VRAM. The data is then streamed out of the VRAM into one or more specialized stream processors that performs some manipulation of or computation on the data stream. The data is then streamed into the appropriate destination in VRAM. Below is a simplified version of the system-level block diagram. Nile Buses (block transfers) Figure 1: Cheops Block Diagram [1] As the system matured and the demand for time-constrained applications increased, more and more specialized stream processors were needed. Each time a function that required a hardware implementation was found, a new processor board had to be designed and manufactured. In order to combat the need of having so many different specialized processors, a reconfigurable processor, called the State Machine was designed and built. The State Machine's goal was to realize many functions on one piece of hardware, thereby implementing functions for which a specialized processor is not available and removing the need for so many different specialized processors. In order to realize its main goals of being specialized and flexible, the State Machine was designed using FPGAs as well as a microprocessor [1] [23]. It utilizes two 40,000 gate SRAMbased FPGAs, a pair of AT&T 2c40 ORCA devices from Lucent Technologies, as well as a microprocessor, a PowerPC603 from IBM Microelectronics. Each of the three processing elements has a 1Mbyte of SRAM that is closely coupled to it, but this memory block can also be accessed by the other two processing elements. Cheops is very much a data-flow processing system and the State Machine conforms to this model of processing. Data is streamed into the State Machine through the FPGAs into one of the SRAM blocks. Next, the data is processed by one or more of the processing elements on board. The State Machine then signals to the main Cheops processor that it has completed the operation and the data is streamed out into the appropriate VRAM location or to another stream processor. The State Machine was designed to be a very versatile processing node with the ability to handle functions that were suitable for both microprocessor and specialized hardware. Although it met most of its design goals, the State Machine was plagued by two major problems. The first problem was that the different processing elements were not being used efficiently. For example, consider when the contents of one SRAM needed to be accessed by multiple processing elements. On the State Machine, interrupting the FPGAs during the processing of stream data is expensive. During these times, the microprocessor is essentially idle if it needs to access data in the same physical block of SRAM. Therefore, maximizing processor utilization for all three processing units on the State Machine is difficult, which is actually a common trait of many FPGA-based co-processing systems. The second problem with Cheops and the State Machine is that they support only two basic data types: 16-bit and 24-bit data. Although when necessary, Cheops is able to handle other data types (32-bit integers and floating point numbers), the microprocessor is required; there is no hardware support for such data types. Therefore, dealing with non-16 or 24-bit data is expensive in Cheops and on the State Machine in terms of processing time and effort. One of the main goals of the data processing subsystem on Chidi is to eliminate this constraint and support a much wider range of data types. 3 The Chidi Multimedia Processing System 3.1 Overview Chidi is a reconfigurable, multimedia processor being developed by the Information and Entertainment Group at the MIT Media Laboratory under the supervision of Dr. V. Michael Bove, Jr. The main motivations behind designing the Chidi system are to: 1) Investigate the incorporation of specialized/reconfigurable processing elements into otherwise general purpose computing systems. 2) Design a system that is able to process large data sets, like those found in audio, video, and holographic applications, in real time. 3) Design a system that scales easily using existing network infrastructure [3]. Chidi couples a microprocessor, or General Purpose Processor (GPP), with a Reconfigurable Processor (RP) subsystem, both sharing a single bus and a single block of DRAM memory. Also on board are units that assist in this coupling and in the transfer of data to and from the RP, the Stream Address Generator (SAG) and the Data Shuffler (DS). The board communicates through any one of three possible interfaces, the PCI interface, the FireWire or IEEE 1394 interface, or the High-Speed I/O port. Below is a simplified block diagram for a single Chidi board. PCI Figure 2: Chidi Block Diagram [3] Being PCI-compliant, Chidi will plug into any UNIX workstation, Macintosh, or LINUX PC. A simple Chidi system will only include one host with one Chidi board. Usually, Chidi will obtain data through the PCI interface, although for certain applications, the FireWire or high-speed I/0 port will be used. Chidi scales by adding multiple Chidi boards and multiple networked host systems. Chidi is designed to be a high-bandwidth processing node. The main Chidi data bus is 64-bits wide running at 66 MHz. The GPP operates at speeds equal or greater than 266 MHz. The RP processor speed is design dependent, but most designs will run between 32 MHz and 66 MHz. The PCI and FireWire interfaces are both 32-bits wide and runs at 32 MHz, while the High-Speed I/O Port is 32-bits wide and runs up to 65 MHz. 3.2 Architecture and Bus Specification Architecturally, Chidi adheres to the common hardware reference platform (CHRP), jointly specified by Apple, IBM, and Motorola. This reference platform serves as the foundation for all PowerPC-based system design. A subset of this specification is the PowerPC bus interface for 32bit microprocessors. Four of the features of this bus specification contribute to Chidi's ability to process data efficiently at extremely high rates. First, the PowerPC bus interface supports decoupled address and data busses. In systems that have coupled address and data busses, arbitration for the busses occurs once. As a result, the address transaction does not complete until the data transaction corresponding to that address has completed. By decoupling the two busses, the address and data busses are arbitrated for independently. Therefore, arbitration for a second address transaction can begin while the data corresponding to the first address transaction is processed, making the bus more efficient for bus transactions with multiple masters. Second, the bus interface supports pipelined bus transactions. Therefore, instead of just arbitrating for the second address transaction while the data corresponding to the first address is being processed, a bus grant can actually be issued and a second address transaction can begin. The bus interface supports pipelining of two addresses before the completion of one data transaction. Third, the PowerPC bus interface supports multiprocessor configurations. This allows up to four PowerPC microprocessors or PowerPC microprocessor emulators, to share the same address and data busses. Since the RP is purely a slave device, not having multiprocessor support would imply that the microprocessor would have to manage all memory accesses for the RP, an extremely inefficient use of the GPP. Chidi takes advantage of this feature of the bus interface by implementing a PowerPC bus interface on the SAG, allowing it to service all memory accesses for the RP. This allows the microprocessor to be used much more efficiently even when the RP is processing large data sets. Finally, the bus interface supports both single word and four-beat burst transactions. In a normal single word transaction, one address returns a 64-bit data word as the result from memory. In four-beat burst transactions, one address returns not only the 64-bit word at that address, but also the next three 64-bit words in sequential order in memory. Burst transactions reduce the overhead needed for sequential memory address access, the type of access used for processing large data streams, thereby increasing the bandwidth of the main data bus. In addition, this allows for large data transactions to occur (such as frames of video data) without giving sole possession of the bus to any one master. In the Chidi system, this allows one processing element to still execute useful tasks (which most likely involve accessing main memory) while another performs data-intensive computations. 3.3 Functional Blocks 3.3.1 PowerPC 604e Microprocessor Each Chidi contains one GPP that can perform general data processing operations as well as execute other tasks such as running the Chidi operating system, configuring the RP, and managing the 1394 Interface. The GPP is implemented with a PowerPC 604e microprocessor from Motorola in a 255-pin Ball Grid Array (BGA) package. The 604e is an implementation of the PowerPC family of reduced instruction set computing (RISC) microprocessors [17]. The 604e microprocessor implements the 32-bit addressed version of the PowerPC architecture. This version supports 32-bit effective, or logical, addresses, integer data types of 8, 16, and 32 bits, and floating-point data types of 32 and 64 bits, providing single and double precision. The PowerPC 604e is also a superscalar processor. It can issue four instructions simultaneously and as many as seven instructions can be executed in parallel. A 64-bit data bus and a 32-bit address bus provide the external interface. In addition, the 604e supports single-beat as well as burst data transfers for main memory and memory-mapped I/0 accesses. Chidi systems are available with microprocessors that execute at 266 or 300 MHz. Aside from the features that the PowerPC 604e offers, one of the main reasons that this processor was chosen is that it conforms to the same programming model and bus interface as the PowerPC603, the microprocessor used in the State Machine. Familiarity with these two aspects of the microprocessor helps minimize mistakes during development as well as reduce the design time. 3.3.2 MPC 106 PCI Bridge/Memory Controller The MPC106 provides a PowerPC common hardware reference platform (CHRP) compliant bridge between the Chidi local/internal bus and the Peripheral Component Interconnect (PCI) [15]. For the Chidi local bus, this means that the MPC106 supports multiple, up to four, 604 processors, a 32-bit address bus and a 64-bit data bus, full memory coherency, 604 local bus slave support, and decoupled address and data busses for pipelining of 604 accesses. For the PCI bus, the MPC106 supports all accesses to the PCI address space, big or little-endian operation, and bus speeds up to 33 MHz. In addition to being a Chidi local to PCI bus interface, the MPC106 also serves as the memory controller to Chidi main memory. The MPC106 supports 1 Gbyte of RAM, 16 Mbytes of ROM, and fast page mode or extended data out (EDO) DRAMs. By utilizing the MPC106 to implement the memory controller as well as the PCI bus interface, more of the design effort can be concentrated on developing the RP and investigating the other aspects of the Chidi system. 3.3.3 Reconfigurable Processor (RP) The main goal of incorporating a Reconfigurable Processor in the Chidi Multimedia System is to provide a processing element that can perform computations in hardware and still possess the flexibility of being able to perform more than one function. RP configuration is initiated by the microprocessor to allow for dynamic reprogramming. In addition, each time the RP is configured for a different application, a different clock frequency can be used based on the speed at which that particular design runs. This allows the RP to run at frequencies at or below 66 MHz independently of the rest of the system. In addition, the RP has 2 Mbytes of SRAM for local storage. The RP is implemented with a FLEX10K 100 FPGA; a 503-pin PGA packaged device from Altera Corporation. The FLEX 10K100 is a 100,000 gate FPGA, which consists of 4992 programmable Logic Elements (LEs) and 12 Embedded Array Blocks (EABs). Unlike the PowerPC 604e, the RP is purely a slave device. A discussion on general and Altera FLEX10K family specific FPGA architecture is presented in Section 4.1. Design goals and details for the RP are discussed in Section 6. 3.3.4 Stream Address Generator (SAG) As mentioned above, the RP is a slave device, meaning that it cannot make accesses to main Chidi memory to load or store data itself. The RP only processes data when data is made available to it. Therefore, the Stream Address Generator handles memory accesses for the Reconfigurable Processor. After the RP is configured by the microprocessor, it notifies the SAG that it is ready to begin processing data for a particular stream. The SAG then generates the address for that particular stream and writes the appropriate information to the Data Shuffler via the SAG/DS interface. In addition to handling the addressing needs of the RP, the SAG also processes all register interface accesses for itself, the DS, the RP, 1394 Interface, and the two four-character alphanumeric displays. All 1394 control mechanisms also reside within the SAG. The SAG is implemented with a FLEX 10K50 FPGA from Altera. The purpose of using an FPGA for this functional block is different from that of the RP. In the SAG case, an FPGA is used for the traditional application of prototyping, not that of reconfigurable computing. The FLEX 10K50 is a 50,000 gate FPGA, which consists of 2880 LEs and 10 EABs and comes in a 356-pin BGA package. 3.3.5 Data Shuffler (DS) The Data Shuffler is similar to the SAG in that it services the RP's memory accesses. However, the Data Shuffler handles the data instead of the address phase of the transaction for both RP stream and register accesses. The main purpose of the DS is to manipulate the data as it comes out of from memory and present it in a format that the RP can process. The DS is designed to handle bytes, shorts (16-bit), packed RGB (24-bit), words (32-bit), and double words (64-bit) data. In addition, the DS can handle data offsets from 64-bit boundaries. As mentioned above, when the SAG generates the address for the current transaction it writes certain values to the SAG/DS interface registers. These registers include information about the type of transfer (single or burst), which channel the data is destined for, how many bytes the data is offset from the 64-bit boundary, and what type of manipulation, or pattern, the DS should employ for the current transfer. After the DS has completed processing the data appropriately, it writes the data to the DS/RP external FIFOs. After this operation, the RP can read the data from these FIFOs, using its own internal clock frequency. The DS is also implemented with a FLEX 10K50 FPGA from Altera for prototyping purposes. Design goals and details for the DS are discussed in Section 7. 3.3.6 External Interfaces In addition to the PCI interface, Chidi has two external interfaces that allow for data transactions to occur. The first of these interfaces is the IEEE 1394 FireWire Interface. This serial interface can transfer data at 200 Mbits/sec from other 1394-compliant devices, such as digital cameras and digital video camcorders [4]. The second interface allows for data streams directly in and out of the RP. Utilizing LVDS (Low Voltage Differential Signaling) technology, the High-Speed I/O Port Interface provides a 32-bit data bus that can be clocked at up to 65 MHz. In addition, this interface provides a synchronization signal back into the RP, allowing for multiple Chidi boards to be synchronized for the processing of large data sets, for applications such as Holovideo. 3.3.7 FIFOs Although FIFOs are not technically a functional block, they are useful because they decouple the data transmitter and receiver. This is important in Chidi because some of the functional blocks operate at different frequencies. FIFOs are used for input and output for both the RP and the FireWire Interface. This means that the RP, which runs at a variable clock speed, and the FireWire Interface, which runs at 32 MHz, will not affect the bandwidth of the 66 MHz data bus. Burst reads and writes will actually be to the FIFOs, allowing the RP and the FireWire Interface to process that data using their own clock rate. This allows the FIFOs to handle the clock boundary that exists between the main data bus and the RP and FireWire Interfaces. 4 FPGA Device Description and Design Process 4.1 Generic FPGA Overview In order to discuss the issues involved with implementing a complex design on an FPGA in the proper context, an understanding of the target device must first be developed. The discussion begins with one of the simplest and most familiar Programmable Logic Devices (PLDs), the PAL (Programmable Array Logic), designed and manufactured by AMD (Advanced Micro Devices). Architecturally, these devices have a programmable AND-plane, a fixed OR-plane, and programmable registers. Capitalizing on the success of PALs, another class of PLDs, Complex Programmable Logic Devices (CPLDs), were introduced. CPLDs can be thought of as an array of PAL-like structures, or cells, that are connected via some type of programmable interconnect. The cells of these CPLDs usually offer more flexibility and more logic capacity than a standard PAL, but still have essentially the same architecture. SRAM-based FPGAs are fundamentally different from CPLDs. FPGAs contain a twodimensional array of logic cells or elements. However, unlike CPLDs, FPGAs do not contain programmable logic planes. Instead, FPGA cells are built around N-input Look-Up Tables (LUTs) that can be configured to implement any N-input functions. Usually, these cells also contain programmable registers, multiplexers, and random logic in addition to the LUT. FPGA logic cells are always connected by some type of programmable interconnect. Although there are many manufacturers of FPGAs, the three main companies in the market are Xilinx, Altera Corporation, and Lucent Technologies. Not surprisingly, each company implements the logic cells and the programmable interconnect differently. The Altera FLEX 10K devices were selected over devices from the other FPGA vendors for three main reasons. First, the FLEX 10K FPGAs are the most inexpensive devices in terms of gates per dollar. Minimizing cost is always desirable, especially in a system with one 100,000 gate FPGA and two 50,000 gate FPGAs. Second, designers for the Chidi project already have experience using Altera's software package, Max+Plus II. Having a good understanding of the FPGA software development tools allows design, debugging, and optimization to be completed in a more timely manner. Finally, the FLEX 10K devices offer something architecturally not found in other devices in the form of Embedded Array Blocks (EABs), the advantages of which are discussed below. 4.2 Altera FLEX 10K Device Family 4.2.1 Overview The FLEX 10K family of devices from Altera Corporation features 10,000 to 250,000 gate FPGAs. Each member of the device family, regardless of size, possesses the same architecture. A column of Embedded Array Blocks (EABs) serves as the spine of the device that segments the Logic Array Blocks (LABs) into two halves. These components are connected to each other and the Input/Output Elements by the FastTrack programmable interconnect system. Figure 3 illustrates the architecture of the FLEX 10K family by presenting the relationship between the different components for a portion of the device. I/OElemel [JOE) T-V- t-t- R0w Interon.tet /~kKbRr c~sB - U ~LOOdK Nent LEI [tilrEl i II Logic Aray 86k [LAB] a~i LoI~ecalnt~ L "uwnn~ "a~lei Embed Array , -I-:: l! l ' m Uot kray ... Figure 3: Altera FLEX 10K Family Architecture [9] In the Chidi system, the SAG and the DS are implemented using FLEX 10K50s while the RP is implemented using a 10K100 device. The SAG and DS were targeted to implement a set of specified functions. With the use of some rough preliminary estimates, the 10K50 devices were selected because they provided the appropriate number of user I/O pins as well as enough gates to design with comfortably. In contrast, the RP is used as a RC device and not for prototyping. Therefore, it must be large enough to be able to implement complex algorithms and designs that will be specified at a later time. Although Altera now manufacturers 250,000 gate FPGAs, at the time Chidi was designed the 10K100 was the largest device in the FLEX 10K family, hence its selection to implement the RP. Table 1 summarizes some of the features that differentiate these two devices. Feature Device Typical Gates LEs LABs EABs Total RAM bits User I/O Pins 10K50 10K100 50,000 2,880 360 10 20,480 274 100,000 4992 624 12 24,576 406 Table 1: FLEX 10K50/10K100 Device Features [10] A discussion of the general building blocks for FLEX 10K devices is given in the sections below. 4.2.2 Logic Element (LE) The logic cell is the basic building block for an FPGA. It normally contains some type of programmable LUT, a programmable register or registers, and some random logic. The Altera logic cell is called a Logic Element (LE) and is depicted in Figure 4. Carry-ln Cascade-In DA 'A 1 to Fastl'rck 'Interconnect DAT44 to LXB Local - nterconnect IABCI RL I IABCIfRI 2 Chip-Wide Reset IABITRI 3 LABCI RL4 '_ Cart -Out Cascade-Out Figure 4: Altera FLEX 10K Family Logic Element [10] As can be seen in Figure 4, the LE contains a four-input LUT, which can compute any four-input function. In addition, the LE contains a programmable register, that can be configured as a D, T, JK, or SR flip-flop. For combinational functions, the register can be bypassed, allowing the LUT to drive the output of the LE. The LE can be routed to and from both LEs in the same LAB, adjacent LABs, and chip-wide row and column interconnects. Another feature of the FLEX 10K LEs are the carry and cascade chains. These chains provide high-speed interconnectivity between adjacent LEs without using the local LAB interconnect resources. These chains are useful when implementing high-speed adders, counters, and wide fanin functions. 4.2.3 Logic Array Block (LAB) The LEs in a FLEX 10K device are arranged in groups of eight in what are called Logic Array Blocks (LABs). A two-dimensional array of LABs provides the architectural structure for FLEX 10K devices. The LAB provides fast local routing between the 8 resident in addition to routing to the row and column connects and adjacent LABs. Figure 5 below shows the block diagram of a LAB. Dedicated Inputs Row lnterconnect LAB Local n rterconnec Note (2) Column-to-Row ,Interconnect LAB Control Signals Column Carr-Out & Cascade-Out Notes: (1) EPF10K50 devices have 22 inputs to the LAB local interconnect channel from the row; EPF10K00 devices have 26. (2) EPF10K5O devices have 30 LAB local interconnect channels; EPFO0K100 devices have 34. Figure 5: Altera FLEX 10K Family Logic Array Block [10] 4.2.4 Embedded Array Block One of the features that differentiate the FLEX 10K family of devices from other FPGAs is the use of what Altera calls Embedded Array Blocks (EABs). Most FPGAs only contain the twodimensional array of LABs described in the previous section. The FLEX 10K family also has additional configurable RAM cells embedded in each device. The use of these embedded RAM blocks frees up more LEs for other functions and increases performance for most designs. The most common use of EABs is to implement large LUTs. With LEs, implementing logic functions with a large number of inputs does not scale efficiently due to the routing associated with connecting a large number of LEs together. Using EABs to implement functions such as multipliers or those found in DSP applications is much more efficient in terms of both speed and area. EABs are extremely flexible. Each EAB can be configured as one block of 256x8, 512x4, 1024x2, or 2048x1 RAM. In addition, EABs can be connected together to form deeper and wider RAM blocks. As can be seen from Figure 6 below, each EAB also contains not only a RAM block, but also registers and bypass paths. This allows the EAB to be configured to implement synchronous, asynchronous or, data outputs only, control signals only, or some combination of the three RAM block. LAB Local lnterconnect. Note (1I Note: EPFIOK50devices have 22 EAB local interconnect channels;EPF1OKO00 devices have 26 Figure 6: Altera FLEX 10K Family Embedded Array Block [10] 4.2.5 FastTrack Interconnect The FastTrack Interconnect system provides the ability to connect any device component (LE, EAB, or I/O element) to any other. One of the most important differences between FastTrack and other programmable interconnects is that FastTrack uses row and column connects that span the whole device. Most other programmable interconnects contain segmented resources that must then pass through a series of programmable switching matrixes. By using continuous interconnects, FastTrack provides predictable routing delays, even for complex designs. A dedicated row channel serves each row of LABs, while a dedicated column channel serves each column of LABs. Figure 7 shows the interconnections between the rows and columns of the FastTrack Interconnect as well as to and from the LAB. Column Channels Note (2) I To Other Columns Rowv Channels Note (1) Fro-m Adjacent lTo Adjacent 1AB I to AB Local Interconnect I To Other Rows Notes: (1) EPF1OK50 devices have 216 channels per row; EPFIOK00 devices have 312. (2) EPF1OKSO and EPFIOKI00 devices both have 24 channels per column. Figure 7: LAB Connections to FastTrack Interconnect [7] 4.3 Design Process Taking a design from the conceptual state to actually configuring an FPGA with that particular design is a challenging, although sometimes tedious, process. Figure 8 shows the general design flow used for FPGAs. The following sections elaborate on each aspect of the design process in more detail. Discussion of general design techniques is presented as well as those aspects that are particular to FPGA design for the Chidi project in the Information and Entertainment Group at the MIT Media Lab. optntization and design chang s ,ntr. Text -1 unt rional simulation timing (ontri 5 Synthesis (Synops)i Route nala t with no netlist Iiuning file Information m thdl output file withtming , in - Simulation configuratio (Ahera)r I inlormnallon Figure 8: FPGA Design Flow 4.3.1 Functional Design/Design Entry The first step in developing designs for FPGAs is to specify the design in a way that is appropriate for a particular design environment. Specifically, this involves translating the design from state diagrams, state tables, or other design descriptions into a format that can be processed by the particular EDA tools or software packages used by the designer. This specification can take the form of a text-based Hardware Description Language (HDL), usually Verilog or VHDL (VHSIC, or Very High Speed Integrated Circuit, HDL), or the more visual format found in schematic entry. For the Chidi project, designers write VHDL-87 compliant code to specify their FPGA designs. 4.3.2 Compilation and Functional Simulation After design entry, the design needs to be compiled and simulated at a functional level. This step in the design process is completed with the aid of an EDA software package. This package checks for syntax errors in the design code and allows the designer to test the design functionally. Since the EDA tool does not have any information about the target device or how the design is placed in this step of the design process, the designer can only verify that the design behaves as he or she expects by providing the appropriate test vectors. Incorrect simulation results are usually caused by inherent design flaws or by inappropriate test vectors. Designers targeting Chidi FPGAs use the Synopsys VHDL/FPGA Design Analyzer for compilation and the Synopsys VHDL/FPGA Simulator and Debugger for functional simulation. 4.3.3 Logic Synthesis After functional simulation is completed, the next step in the design flow is logic synthesis. Synthesis is the process of translating a design description into the actual registers, multiplexers, and gates needed for implementation. Logic synthesis is done automatically using an EDA software package. The designer's role is to assign the appropriate timing and area constraints for the synthesizer to obtain the most optimal design. The result of this step is some type of netlist that describes how the components of the design are wired together. Chidi designers use Synopsys, recognized as the industry leader in logic synthesis, to generate the netlist in EDIF format. 4.3.4 Place and Route The next step in the design process is to map the logic generated during synthesis onto a particular device. Since each device has its particular architecture, logic cell, routing, and delay characteristics, this is usually done automatically with software provided from the FPGA chip vendor. The designer's role during place and route is to provide the appropriate parameters, such as the logic synthesis style, optimizations for speed or area, and device selection. Since the FLEX 10K Family devices are used in the Chidi system, designers use the Max+Plus II software from Altera for place and route. 4.3.5 Timing Analysis After obtaining an initial mapping onto the desired device, the speed at which the design functions needs to be determined. Combinational delays and registered performance both need to be determined for each design. This information is obtained automatically from the place and route information generated by the vendor specific tool. With the speed demands of the Chidi system (the DS and SAG both need to operate at 66.66 MHz or 15.0 ns), initial designs usually fall short of the timing requirements. Chidi designers use the Timing Analyzer tool in the Altera Max+Plus II software package to determine the critical path of the design. The designer must then optimize the design by minimizing the critical path delay in an effort to get the design to run at a faster speed. One strategy to obtain a faster design is to pipeline the critical path. This involves going back to the VHDL design and inserting registers appropriately. This is usually done if the design inherently has too much combinational logic between registers. A second approach is to place timing constraints on the design. This can be done globally for the entire design or locally for one particular path. A third approach is to assign placement constraints. In this case, the designer specifies to the place and route tool that certain LEs are to be placed together in close proximity. In the FLEX 10K architecture, the delay on the same row is much smaller than that between different rows. Optimizations usually include employing a combination of these strategies in order to meet the timing requirements. After these constraints or design changes are made, the process must begin again, either at the place and route stage or the compilation stage. 4.3.6 Post-Synthesis Simulation After the timing constraints for the design are met, the next step is to execute post-synthesis simulation. This is to verify that the design functions properly after all of the timing information for the target device is taken into account. The timing information includes setup and hold times for registers, propagation delays for registers and gates, and interconnect delays, among others. Since functional simulation performed earlier in the design process does not take this information into account, post-synthesis simulation in necessary to confirm that the design will operate as expected after it is configured into a device. 4.3.7 Device Configuration After post-synthesis simulation has been completed and the design has been verified to meet all the timing requirements of the target device, the device must then be configured with that particular design. Device configuration can be done in two ways, one using some type of configuration ROM, the other using a microprocessor. For Chidi, both the SAG and DS are configured automatically at power up using the Altera EPC1PC8 Configuration EPROMs. However, since the RP is a dynamic processing element, it is configured by the microprocessor, via the SAG, at run-time. Device configuration details for the DS and RP can be found in their respective sections. 5 RP Subsystem 5.1 Overview One of the main areas of study in the Chidi system is the coupling of a RP and GP to form an efficient computing element. As the State Machine project illustrated, two of the main problems usually encountered when designing a system that incorporates RC along with general computing elements are: 1) Keeping the RC element supplied with enough data. 2) Utilizing both the GP and RP in an efficient manner. Chidi attempts to solve these problems by incorporating an RP subsystem that not only includes a reconfigurable processor, but address and data servicing entities as well. This architecture is used in the hope that an infrastructure can be established to make the development of applications for the RP easy, while providing both flexibility and the necessary data throughput to make such a system useful. Below is a simplified block diagram for the RP subsystem found in Chidi. ~~ --------------------------------------------PowerPC Bus Control Main Addr SRAM RP Config -k SAG SAG/RP Interface Main Data SAG/S SAG/I 10 24 2 Register Interface 4 H FIFO FIFO Figure 9: RP Subsystem As can be seen from the figure above, the SAG serves as the main interface for the RP subsystem to the rest of Chidi. The SAG handles the address phase of all transactions to and from main memory for the RP. For stream processing transactions, the SAG arbitrates for the main address bus, generates the appropriate address in main, and notifies the DS how it should process the corresponding data. The SAG is able to generate the appropriate addresses for each stream because it contains registers that hold the mapping information between streams and their locations. This stream to address mapping information is managed by the microprocessor based on how applications are scheduled to be configured on the RP and how the streams are scheduled be processed by each application. In addition, it also handles all register interface accesses for the RP, itself, and the DS. On the other hand, the DS is responsible for handling the data portion of any RP transactions. For input streams, the DS manipulates the data appropriately as it is received by the subsystem from main memory and presents it to the RP in a format that it can process correctly. For register writes, the DS passes the data along to the RP so that it can process it internally. For output streams or register reads, the DS waits for the SAG to complete arbitration for the data bus before asserting the proper values onto the main data bus. There are several characteristics of the RP subsystem that allow it to process data in an efficient manner. First, the use of virtual channels allows multiple streams of data to be multiplexed over a fixed physical channel. Second, the RP subsystem can act as a second PowerPC 604 as seen by the main bus arbiter, the MPC106. This allows the RP subsystem to take advantage of the features of the PowerPC bus interface. Finally, the RP subsystem provides control mechanisms back to the microprocessor so that it may instruct and manage the subsystem efficiently. 5.2 Physical and Virtual Channels 5.2.1 Physical Channels The RP Subsystem contains one 64-bit input and one 64-bit output path through which data flows in and out of the RP. These input and output paths are implemented using both the Data Shuffler and the external FIFOs between the DS and RP. The 64-bit input path can be configured as one 64-bit or two independent 32-bit paths called ReadO and Readl. The output path is strictly a 64bit output path called Write0. These paths are referred to as physical channels. The Data Shuffler portion of the input physical channel, or channels, performs manipulations on data coming from memory destined for the RP. These functions include data realignment, data extraction, or no data manipulation at all. The output physical channel does not alter the data in any way. The DS portion of the physical channel operates at 66.66 MHz and interfaces with the Chidi local bus. The physical channels also consist of external FIFOs, which reside between the DS and RP. The FIFOs' main purpose is to serve as a buffer between two entities that operate at different clock frequencies. The FIFO can be written to and read from using two independent clock signals. This allows the DS to write to the FIFO at 66.66 MHz while the RP processes the data from it at whatever speed the application dictates. In most applications, the DS should keep the 64-bit wide, 64 word deep FIFOs mostly filled so that the RP idle time can be minimized. 5.2.2 Virtual Channels For some applications, the RP only requires one or two input data streams in order to generate an output data stream. In these situations, the input physical channel can be configured so that each input data stream has a dedicated channel through which data can be transmitted. However, for other applications, more than two input data streams are required to generate an output data stream. For these cases, virtual channels are used to deliver multiple streams to the RP for processing. The RP supports eight virtual input channels and four virtual output channels. This allows the RP to implement applications that process up to eight streams and generate up to four streams. Virtual channels are time multiplexed over physical channels, with up to four virtual channels mapping to one physical channel. Virtual channels are implemented using FIFOs internal to the RP. Flow control mechanisms that provide the appropriate mapping between physical and virtual channels must also be provided. RP application designers will choose from a library of interface designs, that differ depending on the number of input and output streams and the mapping mechanism, in order to provide data to their designs correctly. 5.3 Data Processing The main function of the RP subsystem is to efficiently process large data sets, or stream data, at a high rate. The RP subsystem can process stream data using two different methods, which is application dependent. For time-constrained applications, data is read in, processed immediately, and written to the appropriate destination. In this case, portions of the stream will already have been processed while other portions are still being read into the RP. For less time-intensive applications, the entire data stream (or up to 2 Mbytes) can be "flooded" into the RP SRAM. The RP will then process this data and write the result back into the SRAM after which the stream can be "flooded" out to the appropriate destination. For both of these cases, data can be obtained over the Chidi local bus. This would be done if the source data resides in Chidi main memory or in host memory (this would also require data transactions over the PCI bus interface). Obtaining data over the Chidi local bus requires the RP subsystem to emulate a PowerPC 604 processor for bus arbitration purposes. Data can also be obtained from an off-board source and streamed directly into the RP via the High-Speed I/0 Port, thereby bypassing the Chidi local bus, the SAG, and the DS. 5.3.1 Data Request Mechanism Providing data to the RP in order to maximize utilization is the primary goal of the RP subsystem. The DS provides the means to format the data appropriately while the SAG generates addresses for memory accesses for the streams that are to be processed by the RP. The data request mechanism provides the necessary information to the SAG for it to decide which streams it should retrieve from or write to main memory. The signals used for the data request mechanism is implemented using the SAG/RP and SAG/DS interfaces. For applications where only one or two input streams and one output stream are required, only physical channel information is used by the SAG to determine when the next memory access will occur. For input physical channels, the DS notifies the SAG whenever it is able to process another four-beat data burst transaction. For the output channel, the DS notifies the SAG whenever there is data that needs to be written back into main memory. When virtual channels are required, the request mechanism becomes more complicated. Each input virtual channel must notify the SAG when it is able to accommodate another four-beat data burst transaction. However, the SAG can only issue an address for that virtual channel, or stream, if the corresponding physical channel also indicates that it is available to process the same amount of data. Similarly, output virtual channels can only be serviced if the output physical channel is able to process the memory write. Figure 10 below shows channel relationships and request mechanism entities for the RP Subsystem. 4 1Fi 64 X 04 FIFO1 Figure 10: RP Subsystem Data Flow 5.3.2 Interrupt Mechanism As discussed before, for applications that are not time constrained, data is flooded into the SRAM, processed by the RP, and the results written back into the SRAM. In order to notify the microprocessor that it has finished processing data, the RP subsystem utilizes an interrupt mechanism. When the RP has completed a data processing task, it will enable an interrupt signal to the SAG. The SAG will then write an interrupt register with the value corresponding to an RP interrupt and mask all future interrupts until the current one is cleared. Masking the interrupt register is required because the SAG services interrupts not only for the RP, but also for itself and the 1394 interface as well. The SAG will then in turn interrupt the microprocessor. The interrupt handler on the PowerPC 604 will then read the interrupt register on the SAG to determine what caused the interrupt, process the interrupt appropriately, and finally clear the interrupt and unmask interrupts on the SAG. 5.4 Register Interface Before the RP subsystem can be used for data processing, certain control information must be specified by the microprocessor. This information might include: 1) The location of the RP configuration file 2) When to begin RP configuration 3) Source and destination information for various data streams 4) Which patterns are to be used by the DS for which streams In addition, control registers are also used for flow control during stream processing. The Register Interface provides the mechanism for writing control registers on the SAG, DS, and RP (the 1394 interface and alphanumeric display registers are also serviced by the Register Interface. For the purposes of this discussion, since they physically reside in the SAG, they are considered SAG control registers). Unlike stream processing mode, the RP subsystem is considered a slave device on the Chidi local bus when processing Register Interface transactions. The control logic that processes all Register Interface transactions resides physically on the SAG. If the register being accessed is located on the SAG also, the SAG processes the transaction internally. The SAG generates independent read, write, and chip select signals for each entity's control registers (SAG, 1394, and alphanumeric registers). However, for DS and RP registers, the SAG must assert the proper control signals on the external Register Interface with the appropriate address. Since the DS and RP share read, write, and chip select signals the SAG must generate dependent signals that conform to the truth table given below. /CS 0 0 0 0 1 1 1 1 /WR 0 0 1 1 0 0 1 1 /RD 0 1 0 1 0 1 0 1 Function Invalid Write DS Registers Read DS Registers No function Invalid Write RP Registers Read RP Registers No function Table 2: External Register Interface Control Signal Truth Table The Data Shuffler Register Interface consists of the three control signals mentioned above and 17 address signals. This provides addressing to 131,072 locations, more than enough for the 2,800 LEs and the 20,480 RAM bits within the DS. However, unlike the DS, the RP Register Interface consists of four control signals and 20 address bits. The four control signals include the /CS, /WR, and /RD signals mentioned previously, as well as a signal called /REGAACK. The DS runs on the same 66.66 MHz clock as the SAG, allowing it to latch the signals for a register transaction in one clock cycle. However, the RP runs on a variable clock signal that is application dependent. Therefore, depending on the clock speed of the RP, it might require more than one clock period to process Register Interface transactions. The /REGAACK signal is the acknowledgement indicating to the SAG that the Register Interface transaction has been completed. Before the SAG detects that this signal has been asserted, it must maintain the validity of all control and address signals for the RP Register Interface. The 20 address bits provide 64-bit word addressing for the 2 Mbyte SRAM as well as bit addressing for the 4,992 LEs and the 24,576 RAM bits in the 10K100 device. 6 Reconfigurable Processor 6.1 Overview By definition, the RP is a dynamic processing element. Unlike most processors implemented using dedicated hardware, the RP can be configured to compute any function any number of times by the microprocessor at run time. The main purpose of the RP block is to provide an infrastructure that allows application engineers to develop functions in VHDL that requires only a limited understanding of the underlying Chidi system. By providing an added layer of abstraction, applications can be developed at a much faster rate, reducing both development time and cost. Therefore, functional and behavioral specifications for the RP must be application independent. The goal is to provide a structure under which all foreseeable applications can be executed, given reasonable constraints. The following is a summary of the design requirements the RP block must support: 1) 64-bit input and output data busses 2) a variable clock frequency for different designs of various applications 3) interface capabilities between it and DS and SAG 4) two physical input channels and up to eight virtual channels 5) one physical output channel and up to four virtual output channels 6) addressing to all registers and RAM bits inside the RP as well as to the 2 MByte SRAM 7) serial configuration mode via the SAG 8) a 2 MByte SRAM that is supports byte and word writes, burst transactions, and a 64-bit data bus 9) a 1-2 Gbit/sec High-Speed I/O Port 10) additional signals for debugging purposes Below is a block diagram for the RP block, which includes the RP, SRAM, High-Speed I/O Port, and input/output FIFOs. It illustrates the interconnection between the RP sub-blocks, the number of signals necessary for inter-block communication, and the number of signals required to interface with other parts of the Chidi system. INI Config CtrlStt IR 10 tr /:San a (A4 KCSC~ - n V_%_4I _IUJA ILN Input FIFO ~9 c Iata a Cirl/SuRevnftgurable Iw Ctrl/Sta 2 Proessor 6€4 I W_ mob 36 O /N I a- yanta IDaa twoo ,€l U ° IynC ACrlStat 2 !Strat Oulput3COutput A CtrliStat h O 4 gn N. t4t~ ' 20 12 -4 Lb Q 4 2) b ) -4 Figure 11: RP Block Diagram 6.2 Functional Behavior A description of the RP can be divided into functional blocks and interface descriptions. As can be seen in Figure 11, the RP consists of not only the 100,000 gate FPGA, but also a 2 Mbyte SRAM and a High-Speed I/O Port. In addition, the RP provides interfaces to the SAG, external FIFOs, and the Register Interface. Below is a description of each block and interface for the RP. 6.2.1 6.2.1.1 Functional Blocks RP The RP itself is a large 100,000 gate FPGA. At any one time, it needs to support not only the application it is running but also all of the logic necessary to implement the many different interfaces between itself and the SRAM, High-Speed I/O Port, the SAG, external FIFOs, and Register Interface. The RP is implemented using a FLEX 10K100 FPGA from Altera and is configured by the microprocessor, via the SAG, at run-time. Device details can be found in Section 4.2 while device configuration information is located in Section 6.2.5. 6.2.1.2 SRAM The SRAM is a 2Mbyte synchronous memory element that interfaces directly with the RP. Its main purpose is to store incoming or outgoing data in some stream processing applications. The interface to the SRAM supports a 64-bit data bus and 17-bit addressing (64-bit word addressing) with 8 byte enables (byte addressing). Table 3 below gives a brief description for the signals that comprise the RP/SRAM interface. Signal Name DATA<63:0> ADDR<16:0> /WE /WEH# /WEL# /CE /OE /ADSC /ADV ZZ Description 64-bit data bus 17-bit address bus Write Enable High Byte Write Enable Low Byte Write Enable Chip Enable Output Enable Address Status Controller. Initiates and extends single and burst reads and writes Address Advance. Extends burst reads and writes Snooze Enable Note: #=1, 2, 3, or 4 Table 3: SRAM Control Signal Descriptions The signals described above are used to control the SRAM in a variety of ways. The SRAM supports power-saving modes as well as burst capabilities for read and write transactions. Burst capabilities are extremely useful when accessing data into and out of the SRAM for stream processing applications. The table below summarizes the operations supported by the SRAM. Data ZZ /ADSC /ADV /CE /WE /OE None L L X H X X High-Z None H X X X X X High-Z External L L X L L X WRITE Data External L L X L H L READ Result Next L H L X L X WRITE Data Next L H L X H L READ Result Current L H H X L X WRITE Data Current L H H X H L READ Result Address Operation Used Deselected Cycle, Power-down Snooze Mode, Power-down Write Cycle, Begin Burst Read Cycle, Begin Burst Write Cycle, Continue Burst Read Cycle, Continue Burst Write Cycle, Suspend Burst Read Cycle, Suspend Burst Note: Values only valid on risingedge of CLK signal Table 4: SRAM Control Signal Truth Table 6.2.1.3 High-Speed I/O Port The High-Speed I/O Port serves as an interface directly in and out of the RP from an off-board source. This port will mostly be used when the overhead required and non-deterministic latency for the Chidi local bus and the PCI bus are unacceptable for certain data-intensive applications. The High-Speed I/O Port supports two independent 32-bit data paths with four control/status signals in and out of the RP. This interface has an operating frequency of up to 65 MHz, allowing for up to 2.34 Gbits/sec throughput in both directions. In addition, the High-Speed I/O port supports two special synchronization signals that will allow multiple Chidi's to be used to process different parts of large data sets and still remain synchronized. Table 5 gives a description for the signals that make up the RP/High-Speed I/O interface. Signal Name HSOUT<31..0> HSOUTCLK HSOUTC/S<9..0> HSSYNC<1..0> HSIN<31..0> HSINCLK<1..0> HSINC/S<9..0> I/O O O O I I I I Description High-Speed 32-bit data output Output clock Output Control and Status Signals Synchronization signals 32-bit data input Input clock Input Control and Status Signals Table 5: High-Speed I/O Port Signal Descriptions The High-Speed I/O Port is implemented using the LVDS chip set from National Semiconductor for many reasons. First, as the name suggests, LVDS (Low Voltage Differential Signaling) technology takes advantage of differential signaling techniques to transfer large amounts of data at high rates and with low power [14]. Differential signaling is less susceptible to common mode noise and can migrate to lower supply voltages with greater ease than traditional single-ended signaling schemes. A second reason that the LVDS signaling convention, as specified by the IEEE 1596.3 standard, was chosen over other standards was that it did not impose any data type constraints on our system. Other chip sets from other manufacturers supported standards that supported bit sizes not easily compatible with the Chidi word or double word (such as the 10 or 20 bit format required by Fibre Channel). Finally, termination consists of only one 100 ohm resistor between each signal pair in LVDS technology, simplifying implementation. Below is a simple diagram illustrating how the High-Speed I/O is implemented using LVDS transmitter and receiver chips from National Semiconductor. RP r.4F 18 _ 2 data TRANS 0o 0 cU CD) 0 1E Note: Terminating resistors have a value of 100 ohms Figure 12: High-Speed I/O Port The transmitters take in TTL/CMOS signals and convert them into LVDS signals. Six TTL/CMOS signals are transmitted over one LVDS signal pair while one clock signal is transported over one LVDS signal pair. Similarly, the receivers convert LVDS signals into TTL/CMOS signals. In the figure above, the SYNC chip is just a smaller version of the receiver. As a result, 16 LVDS signals (8 pairs) come into the High-Speed I/O Port from an on-board connector, while 20 LVDS signal (10 pairs) go to the connector. 6.2.2 Interfaces The RP communicates with the rest of the Chidi system through two main interfaces, the RP/SAG interface and the RP/External FIFO interface. Since the RP operates at a variable clock frequency, which is application dependent, it does not communicate directly with entities not in the RP subsystem. The two RP interfaces are described in detail in the sections that follow. 6.2.2.1 RP/SAG Interface The RP/SAG Interface is actually two independent sets of signals. Signals going to the SAG from the RP indicate which virtual channels are in need of data for processing. The signals going to the RP from the SAG specify the destination of the next data transfer as well as how many words (either 32-bit or 64-bit words) are in the transfer. The table below gives a more detailed signal description. Signal Name REQCHAN<3..0> REQSIZE REQSTROBE WRITECHAN<3..0> TRANSFERSIZE<4..O> I/0 O O O I I Description Indicates the virtual channel making the data request Indicates the size of the request Request valid signal Virtual channel destination Size of transfer in words (32 or 64-bit words, depending on the mode) Table 6: SAG/RP Interface - Signal Descriptions 6.2.2.2 RP/ExternalFIFO Interface The RP receives and processes stream and register interface data through three physical channels. The interfaces that allow the RP to control data flow to and from these physical channels is detailed below. 6.2.2.2.1 ReadO Physical Channel Interface The ReadO Physical Channel is one of the two 32-bit channels that is used to receive stream input data. It can be configured as one independent 32-bit channel or as the lower half of a 64-bit channel. In addition, the ReadO channel also supports Register Interface transactions. Register Interface reads and writes are supported through the use of two "mailbox" registers These mailbox registers are located physically within the FIFO. They provide a way to use the same physical channel to perform read and writes that can be accessed independently of the contents of the FIFO. This allows stream processing to continue after it has been interrupted for a Register Interface transaction. Another way of supporting Register Interface transactions is to use the ReadO Physical Channel to process RP register writes (input data) and the WriteO Physical Channel to process register reads (output data). This allows the ReadO channel to only be used for data flowing into the RP, while the WriteO channel only services data flowing out of the RP. The only reason that this solution was not implemented involves pin-usage efficiency. It requires more overall signals to configure the ReadO and WriteO channels to both support input and output mailbox registers, respectively, than just the ReadO channel for both input and output. Therefore, the ReadO Physical Channel supports all Register Interface transactions, regardless of type. Table 7 below gives the signal descriptions while Table 8 provides the function table for the ReadO Physical Channel. Signal Name RODATA<31..0> /ROEF /ROW/RB ROMBB /ROMBF1 I/O I/O I O O I Signal Description 32-bit data bus for stream input data and register input/output data. ReadO Empty Flag. When /ROEF is LOW, the FIFO is empty and reads from its memory are disabled. /ROEF is forced LOW when the device is reset and is set HIGH by the second LOW-to-HIGH transition of RPCLK after data is loaded into empty FIFO memory ReadO Write/Read Select. See table below for usage details. ReadO Mailbox Select. See table below for usage details. ReadO Mailbox Flag. Low when data is valid in the Mailbox Register (implies the end of a RP Register WRITE). HIGH-to-LOW transition, synchronous to DSCLK66. LOW-to-HIGH transition, synchronous to RPCLK. Table 7: RP ReadO Physical Channel FIFO Signal Descriptions ROW/RB H L L H MBB H L H L Function Mail2 Write FIFO Read Maill Read High-Z Table 8: RP ReadO Physical Channel FIFO Function Table 6.2.2.2.2 Readl Physical Channel Interface The Readl Physical Channel is the other 32-bit channel that is used to receive stream input data. Like the ReadO channel, it can be configured as one independent 32-bit channel or as the upper half of a 64-bit channel. The table below gives the signal descriptions for the Readl Physical Channel. Since the Readl channel is purely an input channel and does not support Register Interface transactions, the only function that is required is the ability to read from the FIFO, which is controlled by the R1W/RB signal. Signal Name R1DATA<31..0> R1W/RB I/O I O Signal Description 32-bit data bus for stream input data. Readl Write/Read Select. LOW indicates a read from the FIFO, HIGH indicates High-Z. /R1EF Readl Empty Flag. When /R1EF is LOW, the FIFO is empty and reads from its memory are disabled. /ROEF is forced LOW when the device is reset and is set HIGH by the second LOW-to-HIGH transition of I RPCLK after data is loaded into empty FIFO memory. Table 9: RP Readl Physical Channel FIFO Signal Descriptions 6.2.2.2.3 WriteO Physical Channel Interface The WriteO Physical Channel is the 64-bit output channel used for stream processing. The table below gives the signal descriptions for this channel. Since the Write0 Physical Channel is purely an output channel and does not support any Register Interface transactions, the only function that is required is the ability to write to the FIFO, which is controlled by the WOENA signal. Signal Name WODATA<63..0> WOENA I/O O O /WOAF1 I /WOFF1 I /WOAF2 I Signal Description 64-bit data bus for stream output data. Write0 Enable. HIGH indicates a FIFO Write, LOW indicates HighZ. WriteO Almost Full Flag. Programmable signal synchronized to DSCLK66. LOW when the number of empty locations in the FIFO is less than or equal to the value in the offset register. WriteO Full Flag. LOW when the FIFO is full and writes to the FIFO are disabled. Forced LOW when the device is reset and is set HIGH by the second LOW-to-HIGH transition of DSCLK66 after reset. Used for debugging purposes only to verify that both FIFOs are in the same state. /WOFF2 I Used for debugging purposes only to verify that both FIFOs are in the same state. Table 10: RP WriteO Physical Channel FIFO Signal Descriptions 6.2.3 RP Clock Circuitry In order to support many possible algorithms with different performance specifications, the RP requires a variable chip clock. This allows the application engineer to decide at what speed functions targeted for the RP should run. In addition, it allows the application engineer to decide how much effort should be put into design optimization. A register in the SAG that is written by the microprocessor sets the RP clock frequency. The following table provides the mapping of control bits to RP clock frequencies. RP CLK Frequency/Period SETRPCLK<3:0> (MHz/ns) 66.66/15.00 60.00/16.66 55.00/18.18 50.00/20.00 48.00/20.83 33.33/30.00 30.00/33.33 27.50/36.36 25.00/40.00 14.32/69.82 0010 0001 0011 0000 Olxx 1110 1101 1111 1100 10xx Table 11: RP CLK Frequencies Essentially, this allows the RP to operate at 10 different possible frequencies, relatively evenly distributed between 66.666 and 14.32 MHz. As a result, this gives the application designer much more flexibility in deciding how fast a particular algorithm should run and how much time should be budgeted for and spent on optimizations. 6.2.4 Device Configuration Unlike the SAG and DS, the RP is not configured at power-up. Since the RP is a dynamically programmable device, it needs to have the ability to be programmed by the microprocessor at any time with any configuration an indefinite number of times after power-up. Although the microprocessor initiates the RP configuration process, it is the SAG that contains all of the logic necessary to actually program the RP. The microprocessor provides the starting address of the configuration file in memory and instructs the SAG to begin RP configuration. Then, the SAG handles the remainder of the configuration details and interrupts the microprocessor when the process is complete. A FLEX 10K family device can be configured in a number of different ways. In order to perform configuration via the microprocessor and the SAG, the MSEL<1:0> and nCE bits must be tied to ground. In this configuration scheme, the configuration file must be in the RBF format (Raw Binary File) instead of the POF format (Programming Object File) used in the configuration EPROM scheme. The RP itself only requires a few signals for the configuration process. To initiate configuration, the RP requires a low-to-high transition on the nCONFIG signal. Then, data is clocked in serially (at or below 10 MHz) until the CONF_DONE signal is pulled high. After CONF_DONE goes high, the RP requires another 10 periods of the configuration clock (DCLK), before passing from configuration to device initialization. Figure 13 below provides the timing diagram for RP configuration while Table 12 gives a description of the timing parameters. CFG nCONFIG -- I nSTATUS t t CON DONE 000 lCK I - ST!,TtS I ' I -0- C II CF2ST I I -C0H1 i t I - t CLK I I !- 000I I I DCLK r"t CL I D DATA I D2 1 ~4 t DH D4 D3 go•* I I)5 Figure 13: RP Configuration Timing Diagram [6] Symbol tCF2CD tCF2ST tCFG tSTATUS tCF2CK tDSU tDH tCH tCL tCLK fMax Parameter nCONFIG low to CONF_DONE low nCONFIG low to nSTATUS low nCONFIG low pulse width Min 2 Max Units us 1 us 1 us nSTATUS low pulse width nCONFIG high to first rising edge on DCLK 2.5 5 us us Data setup time before rising edge on DCLK Data hold time after rising edge on DCLK 30 0 ns ns DCLK high time DCLK low time 50 50 ns ns DCLK period DCLK maximum frequency 100 Table 12: Configuration EPROM Scheme Timing Parameters [6] 10 ns MHz After device initialization has been complete, the RP enters what is called user mode. At this time, the RP can begin to execute the function or algorithm for which it was configured. 6.3 Implementation At the time of writing, the Chidi multimedia system has completed the layout, fabrication, and assembly process. Debugging has commenced in all areas and is progressing. Therefore, this necessarily means that all physical RP specifications have been implemented. A parts list that summarizes each physical part used to implement not only the RP, but also the entire RP subsystem is given in the Appendix, Section 10.3. 6.3.1 High Speed I/O Port Special attention was given to the LVDS signals, used in the High Speed I/O Port, during the layout process. Since LVDS signals are differential in nature, it was necessary to route them in matching pairs and to use heavy chamfering. These steps were necessary to reduce skew and phase differences between signal pairs, as well as minimize signal reflection and EMI. In addition, the board layout and fabrication engineers were instructed to control LVDS signal trace impedance's to 100 ohms +/- 10 ohms. Along with 100 ohm termination resistors and 106 ohm cables, there is an approximate 100 ohm impedance throughout the LVDS signal path. Finally, placing the transmitters and receivers extremely close to the off-board connector minimized LVDS signal trace lengths. These design techniques were employed during the layout process, as specified by the "LVDS Owner's Manual and Design Guide" [14], in order to implement LVDS technology correctly. 6.3.2 RP Besides the physical design outlined in the previous sections, design must also be done for the FLEX 10K100 FPGA that is the centerpiece of the RP block. There are many roles that engineers must assume when designing and developing the RP. RP development can be viewed at three levels: 1) External interface design - this specifies how the signals are connected between blocks 2) Internal interface design - VHDL-level specification of interface behavior. 3) Application design - VHDL-level specification of particular functions or algorithms. Although these three levels are listed separately, the work done in one area must incorporate some level of understanding of the other two areas. External interface design has been completed for the RP and has been documented in this chapter. However, it could not have been done correctly without specifying at a behavioral level the internal interface design and taking into account how applications will use the RP. An internal interface design engineer must then take the physical and timing constraints as given and create a type of "wrapper" that will be used in applications. This wrapper will actually be an entire library of wrappers that engineers will select from based on the number of virtual channels that are required for the application. Different wrappers can also be chosen based on if the SRAM or High-Speed 1/O Port is needed for this particular algorithm. Finally, when an application engineer targets a design for the RP, he/she must select the appropriate wrapper. In addition, he/she should have some level of understanding of how data is transferred to the RP from Chidi main memory or host memory in order to specify other information vital to successful implementation. This information might include which patterns the DS is to use for manipulating which streams, among others. In this manner, the burden of implementing system specific logic in order to obtain data correctly is lifted from the application designer. He/she only has to have enough of an understanding to pick the appropriate interface logic and set the correct parameters, thereby making application development easier and more efficient. 6.3.3 6.3.3.1 RP Configuration Design Details RP Configuration has been implemented on the SAG. Figure 14 below provides a simple block diagram of the logic required for RP configuration. SAG PowerPC Bus Control Signals , PPC Bus Interface RP I Stat 21 Vcc VcC Ajdr 32 Dta Main 64 Data B -s-/-0 Address Generator & RP Config/ PPC Bus Interface Ctrli Str Stat La Configuration FN CONF_DONE nSTATUS INITDONE nCONFIG DCLK 6 Register Bank (64x4)& Control Logic 1 DKIAT nCE N-SEL I MSELO GN Figure 14: RP Configuration Block Diagram Since the final version of the Address Generator and Register Interface are being developed independently by other engineers and were not available, the design for RP configuration uses its own scaled down versions of these entities. The design was partitioned so that it would require minimal or no redesign when the final versions of the Address Generator and Register Interface are completed. As mentioned previously, before configuration can begin the microprocessor must first write two RP Configuration registers. The first register contains the starting address of the configuration file. The second register signals to the RP Configuration design to start configuration. After configuration is complete, the SAG is suppose to interrupt the microprocessor signaling that the RP has been configured. In this implementation of the design, the interrupt mechanism was not available. Therefore, the RP Configuration design resets the register that indicates that configuration is to begin. The microprocessor then polls this register in order to determine when configuration has finished. The design that actually configures the RP is made up of three main blocks. The first block, the clock generation circuit, is a simple clock divider that generates the 8.33 MHz (120 ns) signal necessary for configuration from the 66.66 MHz (15 ns) system clock. The second block is the Configuration FSM, which generates the control signals and regulates the data necessary for configuration. Finally, the third block is the register bank and corresponding control logic. One register is a 64-bit bit shifter that allows each byte of data to be serially shifted, LSB first, into the RP. The second 64-bit register holds the next eight bytes of data for configuration. Therefore, the only constraint placed on the design is that the next eight bytes of data have to be fetched in 512 system clock cycles (remember that each bit is clocked into the RP every 120 ns), which is more than enough time in most conceivable cases. 6.3.3.2 ConfigurationTime Knowing the configuration time is extremely important for scheduling purposes. The configuration time can be calculated as follows: tCONFIG = tCFG + tCF2CK + tCLK(# of bytes of configuration data)(8 bits/byte) + lO(tCLK) The theoretical minimum time, using timing and configuration file size information provided by Altera, is: tCONFIG,MIN = 2 us + 5 us + (100lns)(149,134 bytes)(8 bits/byte) + 10(100 ns) = 119.3152 ms Since a 8.33 MHz (120 ns) clock was used in this implementation, the expected configuration time is: tCONFIG, THEOR = 2 us + 5 us + (120ns)(149,134 bytes)(8 bits/byte) + 10(120 ns) = 143.17684 ms Experimentally, a configuration time of 143.2 ms was observed, which matches closely to the expected value (The HP1660AS logic analyzer used has 2 ns precision, but only .1 ms display precision when measuring on a millisecond scale). 6.3.3.3 Performance One of the constraints for any design implemented on the SAG is that it must run at 66.66 MHz, since it interfaces with the Chidi local bus. One trouble spot involved integrating all of the different modules together. It became necessary to insert registers in long combinational delay paths between the modules to improve performance. Another problem area dealt with fan out. In particular, when dealing with a large number of registers, like those found in the register bank, it became necessary to replicate the control logic. Also, FSMs that drove a high number of outputs also experienced fan out problems. In these cases, the state bits were replicated so that one set was used to feed state transitions, while the other set or sets were used to drive the outputs. Finally, large FSMs also posed performance problems. Because each LE in a FLEX 10K device can only accommodate four inputs, FSMs with more than three state bits (eight states) almost always require at least two LEs to compute the next state (the next state depends on the current state and inputs). For this reason, the configuration FSM is actually divided into two smaller FSMs. One FSM handles the beginning of configuration, such as asserting the nCONFIG signal for at least 2 us, checking that the nSTATUS signal transitions from low-to-high while the CONF_DONE signal remains low, and waiting at least 5 us before notifying the second FSM should begin actual device configuration. The second FSM then handles the configuration data and completes the process. By utilizing the above three strategies, the design was able to run at 68.49 MHz (14.6 ns). 7 Data Shuffler 7.1 Overview The Data Shuffler services the data portion of any transaction going to or coming from the Reconfigurable Processor. These transactions include main memory accesses for stream processing as well as register interface accesses. For register interface transactions, the data is passed directly to the RP via mailbox registers, while for stream processing, the DS performs some type of manipulation on the data before loading it into the external FIFOs so that it may be processed by the RP accordingly. A block diagram for the Data Shuffler is given below. 1 , SAG/DS Interface ReadO ea . To/From SAG ". FIFO I/F Physical Channel IIFO Bus Ctrl 4-1 S1Data Bus U/P Data 4 Bus 4*-*, 04 U ReadlRed Physical Channel Contro 64 II 12t F1 FO C I/Frol 3 Read I FIFO IiL Figure 15: Data Shuffler Block Diagram The DS will need to "shuffle" the stream data going from main memory to the RP for two main reasons. First, the data stream might be offset from a 64-bit word boundary coming out of main memory. The DS will align this data before allowing the RP to begin processing. Second, the data might need to be parsed in some particular manner based on the application that is running in the RP. Some examples include extracting every other piece of data (decimate by 2) and extracting every third piece of data (decimate by 3 or extracting one particular channel from packed rgb data), among others. The Data Shuffler supports byte, short (16-bit), pack rgb (24-bit), word (32bit), and double word (64-bit) data types. 7.2 Functional Description For stream processing, the Data Shuffler can operate in one of two modes. The first mode implements two independent 32-bit wide input physical channels. Named with respect to the RP, these channels are called ReadO and Readl. Each input physical channel is made up of a section that resides internally to the DS and a section implemented using external FIFOs. The DS can also be configured for 64-bit mode. In this mode, the two physical channels are used as one 64-bit input channel. For either mode, there is a 64-bit output channel, Write0, which holds the output streams generated by the RP. 7.2.1 Operations Supported Conceivably, the Data Shuffler should be able to perform any type of ordering or selecting of the input data stream for the RP. However, for the first version of the design, only thirteen of these manipulations, or patterns, are supported. These patterns were selected based on the frequency of probable use in stream processing. Table 13 below lists the initial thirteen patterns that are supported by the DS. Operation Straight Through Decimate bytes by 2 Decimate bytes by 3/Extract one channel Decimate bytes by 4 Decimate bytes by 6/ Channels by 2 Decimate shorts by 2 Decimate shorts by 4 Extract two channels Extract every pixel Decimate pixels by 2 Decimate pixels by 4 Decimate words by 2 Decimate double words by 2 Table 13: Operations Supported Pattern # 1 2 3 4 5 6 7 8 9 10 11 12 13 Due to implementation details, the DS can only support eight different patterns at one time (or less, depending on the pattern). However, this pattern map can be loaded with different patterns by the microprocessor via the register interface. Theoretically, this allows the DS to support an unlimited number of patterns for input data streams. Practically, due the area constraints, the number of patterns obviously cannot increase indefinitely. The Data Shuffler internals can be categorized into interfaces and input and output physical channels. The interfaces allow the DS to receive relevant status signals from other parts of the system as well as assert any necessary control signals. The physical channels perform manipulations on the data as well as serving as intermediate storage between system elements. Descriptions for each are given in the sections below. 7.2.2 Interfaces 7.2.2.1 DS/SAG Interface The DS/SAG Interface consists of both control and status registers. The following two sections present the needs and uses of these two sets of registers in detail. 7.2.2.1.1 Control Registers The control registers hold information from the SAG about the next data transfer to or from memory. The SAGCTLWR signal indicates when new values are being written into the control registers. The physical channel controllers will only latch these new values if the channel registers contain the corresponding physical channel values. The /RESET register is used to clear a physical channel of any residual data values that are no longer deemed necessary by the SAG. This occurs at the end of a stream when more bytes were retrieved than necessary if accesses did not end on a 64-bit boundary. The Read channels use the MODE, PATTERN, and OFFSET registers in order to determine how to properly manipulate incoming stream data. The PPC Bus Interface uses the BURSTSIZE register in order to complete data tenure transactions appropriately. Finally, the TRANSFERSIZE register is used to write the correct number of words or double words (depending on the mode) into the Read channel FIFOs. This is useful in the case when the SAG has deemed that a burst read from memory is more efficient in terms of transaction time, but only wants a portion of the entire four double words to be processed by the RP. This is often the case at the end of a given input stream. The table below summarizes the control registers and gives a brief description of each. Description Specifies where in the pattern map a pattern is stored (bits 7-5) and what type of pattern it is: straight-through, decimate data-type by N, etc. (bits 4-0) Specifies by how many bytes the next data transfer is offset Specifies for which channel the next data transfer is destined. Register Name PATTERN<7:0> OFFSET<2:0> CHAN<1:0> 00=None, 01=Write0, 10=Read0, 1 l=Readl. BURSTSIZE TRANSFERSIZE<4:0> HIGH indicates a burst transfer, LOW indicates a single-beat transfer The number of valid 32 or 64 bit data (depending on the current mode) after any data shuffling operations have been executed. Negatively asserted reset bit. Applies only to the channel specified by /RESET the Chan<1:0> register SAGCTLWR Indicates that the SAG is writing the above control registers Table 14: SAG/DS Interface - Control Registers 7.2.2.1.2 Status Registers The Data Shuffler must provide the SAG with status information on the three physical channels. The SAG uses this data to determine for which channel it should make the next memory transaction. The Read channels signal that a burst transaction can be accommodated when it has processed the preceding data transaction, written the results in the FIFOs, and the corresponding FIFO still indicates that it still has room to hold data from a burst. The Write0 channel indicates that there is enough data for a memory write when any of its internal registers contain valid data. When all of its internal registers are valid, then it signals that there is enough data for a burst write to memory. The table below summarizes the status registers for the DS/SAG Interface. Signal Name ROAVAIL R1AVAIL WOAVAIL<1:0> Description Indicates that the ReadO physical channel is available for a burst write Indicates that the Readl physical channel is available for a burst write Indicates how much data the Write0 physical channel has: 00= no data, 01 =single, 10=burst, 11 =invalid Table 15: SAG/DS Interface - Request Mechanism 7.2.2.2 DS/PPCBus Interface The DS/PPC Bus Interface manages data flow in and out of the DS to and from memory. It monitors the values of the CHANNEL register along with PPC Bus signals in order to determine if input data is to be loaded into the Read Physical Channels or if data is to be read from the Write0 Channel. Below is a signal summary for the DS/PPC Bus Interface. Signal Name /TA I/O I/O /SAGDBG I /604DBB I CHAN<1:0> WO DAV<3:0> RO_LATCH_DATA I I O R1_LATCH_DATA O WO_LATCHED O Description Transfer Acknowledge. As an input indicates that the DS should latch the data on the PPC Data Bus. As an output indicates that the DS is driving valid data onto the PPC Data Bus. SAG Data Bus Grant. Indicates when the SAG has been granted the local data bus. 604 Data Bus Busy. Indicates when the PPC604 is using the data bus. Indicates the channel for the next transaction Indicates the amount of valid in the WriteO Physical Channel Indicates to the ReadO Channel the data latched from the PPC Data Bus is for it. Indicates to the Readl Channel the data latched from the PPC Data Bus is for it. Indicates to the WriteO Channel that data has been read from its register banks. Table 16: DS/PPC Bus Inteface Signal Descriptions 7.2.2.3 DS/External FIFOInterface The DS transmits and receives data to and from the RP through three physical channels. As mentioned before, these physical channels consist of both DS internals and external FIFOs. The three interfaces that allow the DS to control data flow to and from these FIFOs are detailed below. 7.2.2.3.1 ReadO External FIFO Interface The ReadO External FIFO Interface supports both stream and register interface transactions. Status signals indicate the state of the FIFO and of the output mailbox register, while control signals allow the DS to write the FIFO and input mailbox or read the output mailbox. The table below provides signal descriptions while Table 18 gives the ReadO External FIFO Interface function table. Signal Name RODATA<31:0> I/O I/O /ROMAILFLAG I /ROAF I /ROFF I ROENA O Description 32-bit data path for the ReadO Physical Channel. Used for both stream and register interface transactions. ReadO Mailbox Flag Input. Indicates when data is available in the Mailbox 2 register. ReadO Almost Full Flag. Indicates that the ReadO Physical Channel FIFO has between 1 and 8 available blocks. ReadO Full Flag. Indicates when the ReadO Physical Channel FIFO is completely full. ReadO Enable. Indicates that the ReadO Physical Channel FIFO is enabled. ROW/R O ROMAILENA O ReadO Write/Read signal. Indicates a Write or Read or the ReadO Physical Channel FIFO. ReadO Mailbox Enable. Indicates that the ReadO Physical Channel Mailbox Registers are enabled. Table 17: DS ReadO Physical Channel Signal Descriptions ROW/R H H L H ROEna H H H L ROMailEna L H H X Function FIFO Write Maill Write Mail2 Read High Z Table 18: DS ReadO Physical Channel Function Table 7.2.2.3.2 Readl External FIFO Interface Unlike the ReadO interface, the Readl External FIFO Interface only supports stream-processing transactions. As a result, the Readl interface is much less complex. The table below provides the signal descriptions for the Readl External FIFO Interface. Signal Name R1DATA<31:0> RIENA 1/O O O Description 32-bit data path used for stream transactions. Readl Enable. Indicates that the Readl Physical Channel FIFO is enabled. /RIAF I /R1FF I Readl Almost Full Flag. Indicates that the Readl Physical Channel FIFO has between 1 and 8 available blocks. Readl Full Flag. Indicates that the Readl Physical Channel FIFO is completely full. Table 19: DS Readl Physical Channel Signal Descriptions 7.2.2.3.3 WriteO External FIFO Interface The WriteO External FIFO Interface allows the DS to read the stream processing results from RP out of the FIFO in preparation for writing the data to the appropriate destination in memory. The table below provides the signal descriptions for the WriteO External FIFO Interface. Signal Name WODATA<63:0> WOENA I/O I O Description 64-bit data path used for stream transactions. WriteO Enable. Indicates that the WriteO Physical Channel FIFO is enabled. /WOAE I /WOEF I WriteO Almost Empty Flag. Indicates that the WriteO Physical Channel FIFO has between 1 and X blocks of valid data. WriteO Empty Flag. Indicates that the Write0 Physical Channel FIFO is completely empty. Table 20: DS WriteO Physical Channel Signal Descriptions 7.2.3 7.2.3.1 Input Channels - ReadO and Read Data Path The main input data path for the Data Shuffler allows any four bytes of two consecutive 64-bit words to be selected in any order. This type of flexibility is needed in order to implement the type of patterns necessary to support stream-processing applications in the RP. Both physical channels contain identical data paths. This is necessary in order to implement two independent physical channels for the Data Shuffler. The data path itself contains four sets of 64-bit registers. Four sets of registers were used because the control logic requires a four-clock cycle delay before it can compute the select bits for the four large multiplexers correctly. Conveniently, this also supports four-beat burst transactions into the Data Shuffler. Figure 16 below illustrates the main data path for the two input physical channels. I 0 T . 8 8 8 8 8 0 8 64 8 2 8 2 28 8 T4 T E 8 8 From Reg3 Reg64 Reg2 MaRegi Datapath Figure 16: Read0/Read multiplexers are used four8 large 128-to-8 Bus can be seen above, 8 in order to U perform the 8 8 8 1 6 1N 0-1 8 I6 8 7 RegO / RegI 8 No 6 -74-8 0 7 / Reg2 IN, op.-6 8 64 M U X 8 F 0 s Reg3 Figure 16: ReadO/Readi Datapath As can be seen above, four large 128-to-8 multiplexers are used in order to perform the appropriate selections on the input data stream. Each of these multiplexers takes as input the sixteen bytes from the last two register banks, Reg2 and Reg3. Each multiplexer then has the ability to select any one of the sixteen bytes. Used together, they create a 4-byte or 32-bit word that is passed onto to the external FIFO interface. Each of these large 128-to-1 multiplexers is actually made up of eight smaller 16-to-i multiplexers. Each of these 16-to-i multiplexers selects one bit from the sixteen corresponding bits (depending on which bit lane it occupies). Since these eight 16-to-1 multiplexers all use the same select bits, used together, they select the one of the corresponding 16 bytes from the Reg2 and Reg3 64-bit register banks. Figure 17 shows how the eight smaller 16-to-1 multiplexers are wired together to implement the larger 128-to-8 multiplexer as well as the details of each 16-to-i multiplexer at a bit level. 16MUXN-O MuxN select<3..0> 16MUXN 1 4 Reg2_ Reg2_n+8 Reg2_n+16 >,/ 16 From To External MU XN - 3 REG2 and 1 16 1 6 MUXN-4 1 16 MUXN_5 N- 16 MUXN_7 1 FIFOs Reg2_n+24 Reg2_n+32 Reg2_n+40 Reg2_n+48 Reg2_n+56 Regn+8 Reg3_n+8 Reg3_n+ 16 Reg3_n+24 Reg3_n+32 Reg3_n+4) Reg3_n+48 NUXNn bit n 16-to-1 MUX Reg3_n+56 Figure 17: Read0/Readl Multiplexer Configuration and 16-to-1 Multiplexer 7.2.3.2 ControlLogic The control logic for the Read datapaths can be divided into register and multiplexer control. Figure 18 below gives a simple block diagram that illustrates the different parts of the Read channel control logic. From SAG/DS 1 Address ' Interface 1 From PIc Bus Interface From Internal Registers oSMs 6 -CGenerator FSoi 1 /Register RAM 7o 1 Control Logic Assertion Control Logic 1 4 To Datapath Registers ^20 To Datapath Multiplexers Figure 18: Read Control Logic 7.2.3.2.1 Register Control Logic The Register Control Logic is responsible for maintaining the state of the four 64-bit register banks for the given physical channel. It generates the register enables for the four banks and maintains the corresponding valid bits. The Register Control Logic takes in three input signals: /RESET, ALL_CLEAR, and LATCH_DATA. When /RESET is asserted, all of the register banks are invalidated. The ALL_CLEAR signal comes from the multiplexer control logic and indicates if valid data in Reg3 needs to be held for an extra cycle. If so, then new data cannot be clocked into Reg3, but all other banks can still be clocked if their contents are invalid. In this way, bubbles in the datapath can be eliminated making the pipeline more efficient. Finally, the LATCH_DATA signal indicates that there is new data to be latched. Table 21 below summarizes the Register Control Logic signals. Signal Name /RESET ALL CLEAR LATCH DATA REGENA<3:0> VALID<3:0> I/O I I I O O Description Indicates that all data is to be invalidated Indicates that data in Reg3 does not need to be held for an extra cycle Indicates that there is data for the corresponding channel Register enables for the four 64-bit register banks Indicates which register banks contain valid data Table 21: Register Control Logic Signal Descriptions 7.2.3.2.2 Multiplexer Control Logic The Multiplexer Control Logic lies at the heart of the Data Shuffler. It is the controlling of how and when data is selected that implements the different modes, patterns, and offsets used to manipulate data streams for the RP. One way of designing this control logic would be to create a FSM that decodes for all the different modes, patterns, and offsets and asserts the appropriate control signals. However, one might imagine that this type of FSM would become extremely large and complex very quickly. As more and more patterns are implemented, this FSM would require more and more states. As the number of state bits required to implement such a large FSM increased, the number of LEs needed to compute the next state and the output signals would also increase. The performance of this FSM would diminish as computations for the next state and output signals became more complex since the combinational path between these registers increase. Therefore, such a design would increase in complexity, have poor performance, and have poor scalability. One alternative would be to take advantage of the embedded RAM blocks in the FLEX10K device. As illustrated in Figure 18, the design consists of three parts: address generator FSMs, RAM, and some assertion logic. In this type of design, the RAM contains the values of the control signals for each of the different patterns. In addition, these values can be loaded to support different patterns depending on the application. The top three bits of the PATTERN register determine where the pattern to be used is stored in RAM. Since the RAM consists of 256 locations, this segments the memory into eight 32-location blocks. The lower five bits determine how a particular block is to be accessed. In this way, flexibility is gained by decoupling how RAM is accessed and where the values are actually stored. This gives more freedom to the low-level software in determining how and which pattern values will be loaded. In order to access the RAM correctly, Address Generator FSMs are required. These FSMs decode the mode, pattern, and offset information, monitor the state of the datapath registers, and generate the appropriate address and control signals. These Address Generator FSMs are less complex than the FSMs mentioned previously, making them easier to design and implement. In addition, some patterns that have different values for control signals stored in RAM have the same RAM access pattern. This property allows one FSM to decode for multiple patterns, allowing this design to be more scalable. Finally, some assertion control logic is required. This logic selects between the outputs of RAM depending on what mode the Read physical channels are in. In addition, it controls when the enable signals are actually sent to the multiplexers as deemed by the Address Generator FSMs. By partitioning the design into these three sub-blocks, the multiplexer control logic gains flexibility and scalability over a more traditional design approach. The table below gives a signal description for the Address Generator FSM. Signal Name /RESET MODE PATTERN<7:0> OFFSET<2:0> VALID<3:0> LATCH_DATA I/O I I I I I I DAV O MUXENA<3:0> MUXSELO<3:0> MUXSEL1<3:0> MUXSEL2<3:0> MUXSEL3<3:0> ALL CLEAR O O O O O O Signal Description Indicates that all control logic/FSMs are to be reset Indicates the current mode: 1=64-bit mode, 0=32-bit mode Indicates the current pattern, 0-13 Indicates the current offset, 0-7 Indicates the validity of the data in each register bank Indicates that data will be latched by the Register Control Logic on the next clock cycle Indicates to the FIFO Interface that there will be data available two clock cycles later Multiplexer enables for Mux3-0 MuxO select bits Muxl select bits Mux2 select bits Mux3 select bits Indicates that data in Reg3 does not need to be held for an extra cycle Table 22: Multiplexer Control Logic Signal Descriptions 7.2.4 Output channel - Write0 Since the Write0 Physical Channel does not have to support any type of data manipulation, it is much less complex than its Read channel counterparts. It supports four 64-bit register banks for burst transactions on the PPC bus. In essence, the Write0 Physical Channel has the same register banks and control logic as the Read channels. The table below summarizes the relevant signals for the WriteO channel. Signal Name WODATA IN<63:0> ALL_CLEAR I/O I I DAV WODATA OUT<63:0> VALID<3:0> I O O Description 64-bit data bus from the WriteO FIFO Indicates that the PPC Bus Interface has latched the data in the Reg3 register bank Indicates that there is data in the WriteO FIFO to be read 64-bit data bus to the PPC Bus Interface Indicates the state of the register banks Table 23: WriteO Physical Channel Signal Descriptions 7.3 Implementation 7.3.1 Functionality and Performance The first phase of implementation for the Data Shuffler has been completed. The following functionality has been implemented and optimized to run at 66.66 MHz with the appropriate pin constraints: 1) ReadO and Readl datapaths (register banks and multiplexers) 2) ReadO and Readl register control logic 3) ReadO and Readl multiplexer control logic (for pattern only) 4) DS/SAG Interface In addition, simplified versions of the ReadO and Readl FIFO interface and Register Interface have been implemented. The sections that follow detail some of the issues encountered during implementation and optimization. 7.3.2 7.3.2.1 Implementation Details DS/SAG Interface To understand why the DS/SAG Interface is implemented the way it is, a brief summary of DS evolution is in order. During schematic entry, early estimates targeted the DS for a FLEX10K30 device. It was recognized at that time that since a 10K30 and 10K50 device both had the same package and pin out, upgrading later to a larger device, if necessary would not be difficult. Therefore, the schematics were created based on a 10K30 and the board went to layout. As DS implementation progressed, it was realized that due to the use of EABs and the systemtiming constraint, more LE and EAB resources were necessary. Therefore, during the assembly process a 10K50 was used to house the DS. By upgrading to a 10K50 LE (from 1728 to 2880), EAB (from 6 to 10), and user 1/O (from 246 to 274) resources increased. However, the increase in user 1/O could not be taken advantage of due to the fact that pin constraints were made during schematic entry based on a 10K30, not a 10K50. Since layout had already been completed by the time the decision to upgrade devices had been made, the board was already routed and the extra user I/Os could not be used. Therefore, given the number of signals that must be routed in and out of the Data Shuffler, the FLEX10K30 device was extremely pin constrained. The three 64-bit input, Read, and Write0 data busses, in addition to the corresponding control signals accounted for a large part of the 246 user I/Os available on a 10K30 device. After accounting for all other signals as well (Register Interface, clock signals, global reset, etc), only nine pins were available to implement the DS/SAG Interface. During schematic entry, this was deemed enough to implement the desired functionality. However, as both the DS and SAG designs matured, it became apparent that more signals were required. Recognizing that Register Interface and stream-processing transactions are not allowed to occur on the same clock cycle (register transactions take precedence), the Register Interface address bus provides fourteen more signals between the DS and SAG. Signals REGADDR<1:0> are not used because they are shared by the hexadecimal displays on the Chidi board. The write cycle for these displays is extremely long. Using these signals would force DS/SAG transactions to wait until writes to the displays had completed ("normal" Register Interface transactions complete in just a few clock cycles), a unnecessary delay. 7.3.2.2 Read Address GeneratorFSMs In order to achieve the type of performance necessary, the idea of using on large FSM for address generation had to be abandoned. Instead, the design must be partitioned into smaller pieces that implements the same functionality in order to meet the system timing requirement. Figure 19 below gives a block diagram of the design used for the address generation FSMs. From Register Ctrl Logic From SAG/DS IF Registers From Internal Registers and Assertion Logic 000 000 0.0 Outputs to Register CtrI Logic and Assertion Logic Addr Values to RAM Figure 19: Read Address Generator FSMs As can be seen above, a series of small FSMs that decode for a particular mode and pattern are used, making them mutually exclusive. Although all the FSMs generate the same output signals, only one FSM drives these signals at any one time. There are two main advantages of using such a design. First, it allows functionality to be added in a conceptually easy manner. Since the modes and patterns are mutually exclusive, adding new access patterns does not affect the functionality of those that have already been implemented. Second, achieving the desired performance is also easier. Since each FSM is only decoding for one mode and one pattern, the FSMs are compact in size, making them much easier to optimize. Both of these advantages speed up the design, implementation and debugging processes for the Address Generator FSMs. 7.3.3 Optimization 7.3.3.1 16-tol Multiplexer One of the time-constrained paths in the Data Shuffler design involved the 16-to-i multiplexers found in the ReadO and Readl physical channels. On its own, a 16-to-1 multiplexer can run at up to 81.96 MHz (12.2 ns clock period) and occupies only 10 LEs. In this case, the inputs, outputs, select bits, and the logic for the multiplexer all reside on the same row. Therefore, the row delays for LEs are small, allowing the combinational path through the multiplexer to be fast. When the LEs of a 16-tol multiplexer are not placed on the same row, the performance drops below the system-imposed timing constraint of 66.66 MHz (15.0 ns). With this in mind, when the entire datapath is considered, the multiplexer actually can no longer be clocked at such a high rate. There are two principle constraints that come into play. First, as described in section 7.2.3.1, each register in the Reg2 and Reg3 blocks actually fans out to four different multiplexers, muxN_c, where 0<=N<=3, 0<=c<=7, and c is constant. This increased fan out contributes to a slower path through the multiplexers. However, more important is the fact that this interdependency on input registers among multiplexers also complicates the layout process. In order for all the components of a multiplexer to be placed on the same row, sixteen input registers, four output registers, sixteen select bits (four for each multiplexer), and forty logic elements (to implement the multiplexers) now must all be placed on the same row. Based on this interdependency a total of 76 logic elements (out of a possible 288 per row) must be placed on the same row. The second constraint makes the number of LEs that must be placed on the same row impossibly high. The second constraint is due to the fact that the select bits are also shared among multiplexers. However, unlike the input registers which are shared between muxN_c multiplexers, the select bits are shared among muxC_n multiplexers, where 0<=C<=3, 0<=n<=7, and C is constant. This constraint requires that the eight multiplexers that form each resulting output byte from the ReadO or Readl datapath must all be placed in the same row. These two orthogonal constraints require all 16-to-1 multiplexers, the select bits, and the input and output registers to all be placed on the same row, a physical impossibility, in order to obtain performance at or greater than 66.66 MHz (15.0ns). In order to meet this the system-imposed timing constraint, the 16-to-1 multiplexers had to be pipelined, as seen in the figure below. seki<l:O> sel<3:2> data._in<7:0> data out data_in<15:8> Figure 20: Optimized 16-to-1 Multiplexer By inserting pipeline registers so that the 16-to 1 multiplexer is divided into two stages of multiple 4-tol registers, the combinational path is reduced by a factor two. As a result, the pipelined version of the 16-tol multiplexer can be clocked at 125 MHz (8.0 ns), the fastest frequency at which a -3 speed grade FLEX10K device can be clocked. In addition, the optimized version can be placed across different rows, which greatly eases placement constraints while allowing the 66.66 MHz timing constraint to be met when the entire datapath is placed and routed. Of course, nothing is gained without a price. As with many digital design problems, optimizing the 16-tol multiplexer was a speed vs. area trade-off. The optimized multiplexer is implemented with 18 LEs (versus 10 LEs for the unoptimized version). In addition, the data incurs an addition cycle of latency. However, given that the 15.0 ns clock period is a hard constraint, these costs are necessary and therefore justified. 7.3.3.2 Register and Logic Replication In order to meet the system timing constraint of 66.66 MHz, replication of both registers and logic were required in some cases. There are two main reasons for replication, routing delays and fan out considerations. First, routing delays can make up the bulk of any combinational path, while setup, hold and propagation delay times for LEs are small in comparison. As mentioned previously, same-row and same-column delays on a FLEX10K device are relatively small, while diagonal routing is expensive. However, as more and more chip resources are used, it becomes increasingly difficult, if not impossible, to route all the LEs associated with a particular function on the same row. One reason for this is because many times multiple modules fan in on shared input registers. In cases such as these, it becomes necessary to replicate the input register so that different versions of the same signal may be placed on separate rows. Second, fan out can also slow down time-constrained paths. In the FLEX 10K data sheet [10], worst case timing values are given for routing between the different types of resources found on the device. However, only in a separate application note, "Understanding FLEX10K Timing" [21], are these numbers qualified with the statement that they are only valid for resources with a fan of four loads. Therefore, for high fan out paths such as those found in the Register Control and Address Generation modules, logic and registers were replicated in order to decrease fan out and increase performance. 7.3.3.3 EAB Pipelining In order to meet the system timing constraint, the RAM used to store the assertion values for the multiplexer and control logic has to be fully synchronous. This means that both the inputs and outputs of the RAM have to be registered. This is due to the fact that both tEABAA (EAB Address Access Delay) and tEABRCCOM (EAB Asynchronous Read Cycle) are rated at 13.7 ns. Adding even a tSAMEROW delay of 3.3 ns for either inputs or outputs already violates the 15.0 ns timing constraint. Therefore, it is necessary to utilize the input and output registers located on the EAB unit to meet the system timing requirement. One result of this pipelining is that the results of a RAM read are not available until two clock cycles later. 7.3.4 Device Configuration The Data Shuffler is configured serially using a configuration EPROM from Altera Corporation. Below is the circuit diagram showing how the EPROM is wired to the FLEX 10K device. Vcc Vcc VcC FLEX 10K Device GND GND Figure 21: Configuration EPROM Scheme Circuit Diagram [6] Upon power-up, the Data Shuffler senses the low-to-high transition on the nCONFIG signal, which initiates the configuration process. Then, the DS drives CONF_DONE low. Next, nSTATUS is released by the FLEX 10K device, which is then pulled high to enable the configuration EPROM. The configuration EPROM then uses its internal oscillator to serially clock data into the DS. To summarize the configuration process, the timing diagram and a corresponding timing parameters table are given below. nCONFIG OE/nSTAITUS I 000 L__ I t I OE I t c" CH I t CONF_-ONE 00 I OEZX of r"- CL ,CS1 IL- I I I CC It I " DH Figure 22: Configuration EPROM Scheme Timing Waveform [6] Symbol Parameter Min Max Units 160 ns toEZX OE high to DATA output enabled tCH DCLK high time 50 250 ns tcL tDSU 50 30 0 250 tco DCLK low time Data setup time before rising edge on DCLK Data hold time after rising edge on DCLK DCLK to DATA out ns ns ns ns tOEW OE low pulse width to guarantee counter reset tCSH NCS low hold time after DCLK rising edge DCLK frequency tDH fMAX 30 ns 100 0 2 Table 24: Passive Serial Configuration Scheme Timing Parameters [6] 10 ns MHz 8 Future Work 8.1 DS Development Although development for the DS is well underway and a solid foundation has been established, there are still many portions of the design that need to be implemented. These additions can be categorized into three main areas. The first area of continued development is increasing the number of patterns the DS can support. Currently the RAM values for all thirteen patterns have been generated and can be referenced in Appendix 10.2. However, the only the Address Generator FSM for Patternl has been implemented. Functionally, adding these patterns is not an extremely difficult task. However, managing area and speed considerations will be somewhat challenging. The second area of continued development is the implementation of the various interfaces. Most of the development that has been completed has mainly involved the Data Shuffler internals. The internals were designed and implemented first because these lied at the core of the DS. However, the interfaces are also important in that the DS needs to be able to interact with the other modules on the Chidi board correctly. The final area of future work involves debugging the DS on the Chidi board. Debugging for the physical channel FIFOs has already commenced and should be followed with the inclusion of Data Shuffler internals. 8.2 RP Development Since RP Configuration has been completed, RP development can commence. This includes implementing the interfaces for the SAG, SRAM, and High-Speed I/O Port. RP Configuration itself can be improved from a performance standpoint. Currently, configuration is done with a clock with a 120 ns period. Configuration should be tried with a clock period closer to the theoretical limit of 100 ns. In this way, the overhead for configuring the RP can be cut by up to almost 24 ms, or over 1.5 million system clock cycles. By reducing the configuration overhead, applications targeted for the RP will more likely achieve a substantial performance increase over general computing solutions. 8.3 Application Development As the underlying interfaces for the RP are designed, implemented, and debugged, RP application development can commence. RP applications are the means to discover how feasible of an alternative Reconfigurable Computing is for general purpose computing systems. 9 Works Cited [1] Acosta, Edward K., V. Michael Bove, Jr., John A. Watlington, and Ross A. Yu, "Reconfigurable Processor for a Data-Flow Video Processing System," Proc. SPIE FPGAs for FastBoard Development and Reconfigurable Computing, 2607, October 1995, pp. 83-91. [2] Bove, V. Michael, Jr. and John A. Watlington, "Cheops: A Reconfigurable Data-Flow System for Video Processing," IEEE Transactions on Circuits and Systems for Video Technology, 5, April 1995, pp. 140-149. [3] "Chidi: The Flexible Media Processor," MIT Media Lab, Information and Entertainment Group, http://chidi.www.media.mit.edu/projects/chidi/index.html, 1997. [4] "Chidi 1394 Interface," Yuan-Min Liu, MIT Media Lab, Information and Entertainment Group, http://chidi.www.media.mit.edu/projects/chidi/ 394/1394.html, 1998. [5] "CMOS SyncFIFO 64 X 36: IDT723611," Integrated Device Technology, Inc., 1997. [6] "Configuring FLEX 10K Devices," Altera Corporation, Application Note 59, Ver.1, December 1995. [7] Dally, William J., "Virtual Channel Flow Control," IEEE Transactions on Parallel and Distributed Systems, 1992, pp. 194-205. [8] "DS90C363/DS90CF364: +3.3V Programmable LVDS Transmitter/Receiver 18-Bit Flat Panel Display (FPD) Link," National Semiconductor, July 1997. [9] "FLEX 10K Device Family," Altera Corporation, http://www.altera.com/html/products/fl0k.html, 1998. [10] "FLEX 10K: Embedded Programmable Logic Family," Altera Data Book 1996, Altera Corporation, Version 2, June 1996. [11] Hanser, John R. and John Wawrzynek, "Garp: A MIPs Processor with a Reconfigurable Coprocessor," Proc. Symposium on Field-ProgrammableCustom Computing Machines (FCCM), April 16-18, 1997, Napa Valley, CA. [12] Huq, Sued B., "An Overview of LVDS Technology, National Semiconductor, Application Note 971, November 1994. [13] Lewis, D., D. Galloway, M. van Ierssel, J. Rose, and P. Chow, "The Transmogrifier-2: A 1 Million Gate Rapid Prototyping System," in FPGA '97, ACM Symp. On FPGAs, Feb. 1997, pp.53-61. [14] "LVDS Owner's Manual and Design Guide," National Semiconductor, Spring 1997. [15] "MPC106 PCI Bridge/Memory Controller Technical Summary," Motorola, Rev. 1, August 1996. [16] "PowerPC Microprocessor Family: The Bus Interface for 32-bit Microprocessors," IBM and Motorola, Rev. 0, March 1997. [17] "PowerPC 604e RISC Microprocessor Family: PID9q-604e Hardware Specifications," IBM Microelectronics and Motorola, August 1997. [18] "PowerPC 604 RISC Microprocessor Technical Summary," IBM Microelectronics and Motorola, Rev. 1, May 1994. [19] Singh, Satnam and Pierre Bellec, "Virtual Hardware for Graphics Applications Using FPGAs," The University of Glasgow. [20] Trainor, D.W., J.P. Heron, and R.F. woods, "Implementation of the 2D DCT Using a Xilinx XC6264 FPGA," The Queen's University of Belfast. [21] "Understanding FLEX10K Timing," Altera Corporation, Application Note 91, Ver. 1, January 1998. [22] Watlington, John A. and V. Michael Bove, Jr., "Stream-Based Computing and Future Television," Proc. 137 h SMPTE Technical Conference, September 1995, pp. 69-79. [23] Yu, Ross A., "A Field Programmable Gate Array Based Stream Processor for the Cheops Imaging System", Master's Thesis, Massachusetts Institute of Technology, 1996. 10 Appendix 10.1 Acronyms ASIC - Application Specific Integrated Circuit BGA - Ball Grid Array CHRP - Common Hardware Reference Platform CPLD - Complex Programmable Logic Devices DRAM - Dynamic Random Access Memory DS - Data Shuffler DSP - Digital Signal Processing EAB - Embedded Array Block EDA - Engineering Design Automation EDO - Extended Data Out EMI - Electromagnetic Interference EPROM - Electrically Programmable Read Only Memory FPGA - Field Programmable Gate Array GPP - General Purpose Processor HDL - Hardware Description Language LAB - Logic Array Block LE - Logic Element LUT - Look Up Table LVDS - Low Voltage Differential Signaling PAL - Programmable Array Logic PGA - Pin Grid Array PLD - Programmable Logic Devices RAM - Random Access Memory RC - Reconfigurable Computing ROM - Read Only Memory RP - Reconfigurable Processor SAG - Stream Address Generator SRAM - Static Random Access Memory VHDL - VHSIC Hardware Description Language VHSIC - Very High Speed Integrated Circuit 10.2 Data Shuffler Patterns Given in this section is a listing of the values to be loaded into the pattern map for each type of data manipulation. Each pattern occupies 32 locations in the pattern map. Values for both the ReadO and Readl Physical Channels are given. Addr 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 Dav 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Valid 3only 1 1 1 0 1 0 1 0 1 0 0 0 0 0 0 0 Muxena (3:0) 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 Both modes 32-bit mode 64-bit mode Dav 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Valid 3only Muxena (3:0) 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 Mux selO Mux sell Mux sel2 Mux sel3 (3:0) (3:0) (3:0) (3:0) 0000 0100 0001 0101 0010 0110 0011 0111 0100 1000 0101 1001 0110 1010 0111 1011 0001 0101 0010 0110 0011 0111 0100 1000 0101 1001 0110 1010 0111 1011 1000 1100 0010 0110 0011 0111 0100 1000 0101 1001 0110 1010 0111 1011 1000 1100 1001 1101 0011 0111 0100 1000 0101 1001 0110 1010 0111 1011 1000 1100 1001 1101 1010 1110 Table 25: ReadO LUT for Patternl (Straight Through), offsets 0-7 Values for the Readl LUT for Patternl is exactly the same as that for ReadO. In addition, it requires only 16 locations to implement the Straight Through pattern. The other 16 locations (10000-11111) contain all zeros. The Decimate Bytes by 2 pattern, given in the next two tables, also only requires 16 locations to implement. Similarly, the other 16 locations (10000-11111) contain all zeros. 64-bit mode 32-bit mode Both modes Addr Dav Valid3 only Muxena (3:0) Dav Valid 3only Muxena (3:0) Mux selO Mux sell (3:0) 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 0000 0000 0001 0001 0010 0010 0011 0011 0100 0100 0101 0101 0110 0110 0111 0111 Mux sel2 Mux sel3 (3:0) (3:0) (3:0) 0010 0010 0011 0011 0100 0100 0101 0101 0110 0110 0111 0111 1000 1000 1001 1001 0100 0100 0101 0101 0110 0110 0111 0111 1000 1000 1001 1001 1010 1010 1011 1011 0110 0110 0111 0111 1000 0001 1001 1001 1010 1010 1011 1011 1100 1100 1101 1101 Table 26: ReadO LUT for Pattern2 (Decimate bytes by 2), offsets 0-7 64-bit mode 32-bit mode Addr Dav Valid 3only Muxena (3:0) Dav Valid 3 only Muxena (3:0) 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 Both modes Mux selO (3:0) 0000 0000 0001 0001 0010 0010 0011 0011 0100 0100 0101 0101 0110 0110 0111 0111 Mux sell (3:0) 0010 0010 0011 0011 0100 0100 0101 0101 0110 0110 0111 0111 1000 1000 1001 1001 Mux sel2 (3:0) 0100 0100 0101 0101 0110 0110 0111 0111 1000 1000 1001 1001 1010 1010 1011 1011 Table 27: Readl LUT for Pattern2 (Decimate bytes by 2), offsets 0-7 Mux sel3 (3:0) 0110 0110 0111 0111 1000 0001 1001 1001 1010 1010 1011 1011 1100 1100 1101 1101 Addr 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 Dav 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 0 Valid 3only 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Muxena (3:0) 1111 0000 0000 1111 0000 0000 1111 0000 0000 1111 0000 0000 1111 0000 0000 1111 0000 0000 1111 0000 0000 1110 0001 0000 Both modes 32-bit mode 64-bit mode Dav Valid 3only Muxena (3:0) 1 1 0 1 1 0 1 1 0 1 0 1 1 1 0 1 1 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1111 1111 0000 1111 1111 0000 1111 1111 0000 1111 1110 0001 1111 1111 0000 1111 1111 0000 1111 1111 0000 1110 0001 1111 Mux sell Mux sel2 Mux sel3 (3:0) (3:0) (3:0) (3:0) 0000 0100 xxxx 0001 0101 xxxx 0010 0110 xxxx 0011 0111 xxxx 0100 0000 xxxx 0101 0001 xxxx 0110 0010 xxxx 0111 xxxx 0011 0011 0111 xxxx 0100 1000 xxxx 0101 1001 xxxx 0110 1010 xxxx 0111 0010 xxxx 1000 0100 xxxx 1001 0101 xxxx 1010 xxxx 0110 0110 1010 xxxx 0111 1011 xxxx 1000 1100 xxxx 1001 1101 xxxx 1010 0110 xxxx 1011 0111 xxxx 1100 1000 xxxx 1101 xxxx 1001 1001 1101 xxxx 1010 1110 xxxx 1011 1111 xxxx 1100 xxxx 1000 1101 1001 xxxx 1110 1010 xxxx 1111 1011 xxxx xxxx 1000 1100 Mux selO Table 28: ReadO LUT for Pattern3 (Decimate bytes by 3/Extract one channel), offsets 0-7 Pattern 3 only requires 24 locations. The other 12 locations (11000-11111) are never accessed. This applies to both the ReadO and Read 1 LUTs for the Decimate bytes by 3/Extract one channel pattern. 64-bit mode 32-bit mode Addr Dav Valid 3only Muxena (3:0) Dav Valid 3only Muxena (3:0) 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0000 1111 0000 0000 1111 0000 0000 1111 0000 0000 1110 0001 0000 1111 0000 0000 1111 0000 0000 1111 0000 0000 0000 1111 1 1 0 1 1 0 1 1 0 1 0 1 1 1 0 1 1 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1111 1111 0000 1111 1111 0000 1111 1111 0000 1111 1110 0001 1111 1111 0000 1111 1111 0000 1111 1111 0000 1110 0001 1111 Both modes Mux selO Mux sell Mux sel2 Mux sel3 (3:0) (3:0) (3:0) (3:0) 0000 0100 xxxx 0001 0101 xxxx 0010 0110 xxxx 0011 0111 xxxx 0100 0000 xxxx 0101 0001 xxxx 0110 0010 xxxx 0111 xxxx 0011 0011 0111 xxxx 0100 1000 xxxx 0101 1001 xxxx 0110 1010 xxxx 0111 0010 xxxx 1000 0100 xxxx 1001 0101 xxxx 1010 xxxx 0110 0110 1010 xxxx 0111 1011 xxxx 1000 1100 xxxx 1001 1101 xxxx 1010 0110 xxxx 1011 0111 xxxx 1100 1000 xxxx 1101 xxxx 1001 1001 1101 xxxx 1010 1110 xxxx 1011 1111 xxxx 1100 xxxx 1000 1101 1001 xxxx 1110 1010 xxxx 1111 1011 xxxx xxxx 1000 1100 Table 29: Readl LUT for Pattern3 (Decimate bytes by 3/Extract one channel), offsets 0-7 Addr 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 11110 11111 Dav 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 Both modes 32-bit mode 64-bit mode Valid 3only Muxena (3:0) Dav Valid 3only Muxena (3:0) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1100 0011 0000 0000 1100 0011 0000 0000 1100 0011 0000 0000 1100 0011 0000 0000 1100 0011 0000 0000 1100 0011 0000 0000 1100 0011 0000 0000 1100 0011 0000 0000 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 Mux selO Mux sell Mux sel2 Mux sel3 (3:0) (3:0) (3:0) (3:0) 0000 xxxx 0000 xxxx 0001 xxxx 0001 xxxx 0010 xxxx 0010 xxxx 0011 xxxx 0011 xxxx 0100 xxxx 0100 xxxx 0101 xxxx 0101 xxxx 0110 xxxx 0110 xxxx 0111 xxxx 0111 xxxx 0100 xxxx 0100 xxxx 0101 xxxx 0101 xxxx 0110 xxxx 0110 xxxx 0111 xxxx 0111 xxxx 1000 xxxx 1000 xxxx 1001 xxxx 1001 xxxx 1010 xxxx 1010 xxxx 1011 xxxx 1011 xxxx xxxx 0000 xxxx 0000 xxxx 0001 xxxx 0001 xxxx 0010 xxxx 0010 xxxx 0011 xxxx 0011 xxxx 0100 xxxx 0100 xxxx 0101 xxxx 0101 xxxx 0110 xxxx 0110 xxxx 0111 xxxx 0111 xxxx 0100 xxxx 0100 xxxx 0101 xxxx 0101 xxxx 0110 xxxx 0110 xxxx 0111 xxxx 0111 xxxx 1000 xxxx 1000 xxxx 1001 xxxx 1001 xxxx 1010 xxxx 1010 xxxx 1011 xxxx 1011 Table 30: ReadO LUT for Pattern4 (Decimate bytes by 4), offsets 0-7 64-bit mode 32-bit mode Addr Dav Valid 3 only Muxena (3:0) Dav Valid 3only Muxena (3:0) 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 11110 11111 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0000 0000 1100 0011 0000 0000 1100 0011 0000 0000 1100 0011 0000 0000 1100 0011 0000 0000 1100 0011 0000 0000 1100 0011 0000 0000 1100 0011 0000 0000 1100 0011 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 Both modes Mux SelO Mux sell Mux sel2 (3:0) (3:0) (3:0) (3:0) 0000 xxxx 0000 xxxx 0001 xxxx 0001 xxxx 0010 xxxx 0010 xxxx 0011 xxxx 0011 xxxx 0100 xxxx 0100 xxxx 0101 xxxx 0101 xxxx 0110 xxxx 0110 xxxx 0111 xxxx 0111 xxxx 0100 xxxx 0100 xxxx 0101 xxxx 0101 xxxx 0110 xxxx 0110 xxxx 0111 xxxx 0111 xxxx 1000 xxxx 1000 xxxx 1001 xxxx 1001 xxxx 1010 xxxx 1010 xxxx 1011 xxxx 1011 xxxx xxxx 0000 xxxx 0000 xxxx 0001 xxxx 0001 xxxx 0010 xxxx 0010 xxxx 0011 xxxx 0011 xxxx 0100 xxxx 0100 xxxx 0101 xxxx 0101 xxxx 0110 xxxx 0110 xxxx 0111 xxxx 0111 xxxx 0100 xxxx 0100 xxxx 0101 xxxx 0101 xxxx 0110 xxxx 0110 xxxx 0111 xxxx 0111 xxxx 1000 xxxx 1000 xxxx 1001 xxxx 1001 xxxx 1010 xxxx 1010 xxxx 1011 xxxx 1011 Table 31: Readl LUT for Pattern4 (Decimate bytes by 4), offsets 0-7 Mux sel3 Addr 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 Dav 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 Valid 3only 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Muxena (3:0) 1100 0010 0001 0000 0000 0000 1100 0010 0001 0000 0000 0000 1000 0110 0001 0000 0000 0000 1000 0110 0001 0000 0000 0000 Both modes 32-bit mode 64-bit mode Dav 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 Valid 3only Muxena (3:0) Mux selO Mux sell Mux sel2 Mux sel3 (3:0) (3:0) (3:0) (3:0) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1100 0010 0001 1100 0010 0001 1100 0010 0001 1100 0010 0001 1000 0110 0001 1000 0110 0001 1000 0110 0001 1000 0110 0001 0000 xxxx xxxx 0000 xxxx xxxx 0001 xxxx xxxx 0001 xxxx xxxx 0010 xxxx xxxx 0010 xxxx xxxx 0011 xxxx xxxx 0011 xxxx xxxx 0110 xxxx xxxx 0110 xxxx xxxx 0111 xxxx xxxx 0111 xxxx xxxx xxxx 0000 xxxx xxxx 0000 xxxx xxxx 0001 xxxx xxxx 0001 xxxx xxxx 0100 xxxx xxxx 0100 xxxx xxxx 0101 xxxx xxxx 0101 xxxx xxxx 0110 xxxx xxxx 0110 xxxx xxxx 0111 xxxx xxxx 0111 xxxx xxxx xxxx 0010 xxxx xxxx 0010 xxxx xxxx 0011 xxxx xxxx 0011 xxxx xxxx 0100 xxxx xxxx 0100 xxxx xxxx 0101 xxxx xxxx 0101 Table 32: ReadO LUT for Pattern5 (Decimate bytes by 6/channels by 2), offset 0-3 64-bit mode 32-bit mode Addr Dav Valid 3only Muxena (3:0) Dav Valid 3only Muxena (3:0) 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1000 0100 0011 0000 0000 0000 1000 0100 0011 0000 0000 0000 1000 0100 0011 0000 0000 0000 1000 0100 0011 0000 0000 0000 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 0 1 1 0 1000 0100 0011 1000 0100 0011 1000 0100 0011 1000 0100 0011 1000 0100 0011 1000 0100 0011 1000 0100 0011 1000 0100 0011 Both modes Mux selO Mux sell Mux sel2 Mux sel3 (3:0) (3:0) (3:0) (3:0) 0100 xxxx xxxx 0100 xxxx xxxx 0101 xxxx xxxx 0101 xxxx xxxx 0110 xxxx xxxx 0110 xxxx xxxx 0111 xxxx xxxx 0111 xxxx xxxx xxxx 0010 xxxx xxxx 0010 xxxx xxxx 0011 xxxx xxxx 0011 xxxx xxxx 0100 xxxx xxxx 0100 xxxx xxxx 0101 xxxx xxxx 0101 xxxx xxxx xxxx 0000 xxxx xxxx 0000 xxxx xxxx 0001 xxxx xxxx 0001 xxxx xxxx 0010 xxxx xxxx 0010 xxxx xxxx 0011 xxxx xxxx 0011 xxxx xxxx 0110 xxxx xxxx 0110 xxxx xxxx 0111 xxxx xxxx 0111 xxxx xxxx 1000 xxxx xxxx 1000 xxxx xxxx 1001 xxxx xxxx 1001 Table 33: ReadO LUT for Pattern5 (Decimate bytes by 6/channels by 2), offsets 4-7 Both modes 32-bit mode 64-bit mode Addr Dav Valid 3only Muxena (3:0) Dav Valid 3only Muxena (3:0) 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0000 0000 0000 1100 0010 0001 0000 0000 0000 1100 0010 0001 0000 0000 0000 1000 0110 0001 0000 0000 0000 1000 0110 0001 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1100 0010 0001 1100 0010 0001 1100 0010 0001 1100 0010 0001 1000 0110 0001 1000 0110 0001 1000 0110 0001 1000 0110 0001 Mux sel3 Mux selO Mux sell Mux sel2 (3:0) (3:0) (3:0) (3:0) 0000 xxxx xxxx 0000 xxxx xxxx 0001 xxxx xxxx 0001 xxxx xxxx 0010 xxxx xxxx 0010 xxxx xxxx 0011 xxxx xxxx 0011 xxxx xxxx 0110 xxxx xxxx 0110 xxxx xxxx 0111 xxxx xxxx 0111 xxxx xxxx xxxx 0000 xxxx xxxx 0000 xxxx xxxx 0001 xxxx xxxx 0001 xxxx xxxx 0100 xxxx xxxx 0100 xxxx xxxx 0101 xxxx xxxx 0101 xxxx xxxx 0110 xxxx xxxx 0110 xxxx xxxx 0111 xxxx xxxx 0111 xxxx xxxx xxxx 0010 xxxx xxxx 0010 xxxx xxxx 0011 xxxx xxxx 0011 xxxx xxxx 0100 xxxx xxxx 0100 xxxx xxxx 0101 xxxx xxxx 0101 Table 34: Readl LUT for Pattern5 (Decimate bytes by 6/channels by 2), offsets 0-3 64-bit mode 32-bit mode Addr Dav Valid 3only Muxena (3:0) Dav Valid 3only Muxena (3:0) 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 0111 10000 10001 10010 10011 10100 10101 10110 10111 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0000 0000 0000 1000 0100 0011 0000 0000 0000 1000 0100 0011 0000 0000 0000 1000 0100 0011 0000 0000 0000 1000 0100 0011 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 0 1 1 0 1000 0100 0011 1000 0100 0011 1000 0100 0011 1000 0100 0011 1000 0100 0011 1000 0100 0011 1000 0100 0011 1000 0100 0011 Both modes Mux selO Mux sell Mux sel2 Mux sel3 (3:0) (3:0) (3:0) (3:0) 0100 xxxx xxxx 0100 xxxx xxxx 0101 xxxx xxxx 0101 xxxx xxxx 0110 xxxx xxxx 0110 xxxx xxxx 0111 xxxx xxxx 0111 xxxx xxxx xxxx 0010 xxxx xxxx 0010 xxxx xxxx 0011 xxxx xxxx 0011 xxxx xxxx 0100 xxxx xxxx 0100 xxxx xxxx 0101 xxxx xxxx 0101 xxxx xxxx xxxx 0000 xxxx xxxx 0000 xxxx xxxx 0001 xxxx xxxx 0001 xxxx xxxx 0010 xxxx xxxx 0010 xxxx xxxx 0011 xxxx xxxx 0011 xxxx xxxx 0110 xxxx xxxx 0110 xxxx xxxx 0111 xxxx xxxx 0111 xxxx xxxx 1000 xxxx xxxx 1000 xxxx xxxx 1001 xxxx xxxx 1001 Table 35: Readl LUT for PatternS (Decimate bytes by 6/channels by 2), offsets 4-7 Both modes 32-bit mode 64-bit mode Mux sel2 Mux sel3 Addr Dav Valid 3only Muxena (3:0) Dav Valid 3only Muxena (3:0) Mux selO Mux sell (3:0) (3:0) (3:0) (3:0) 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 0000 0000 0001 0001 0010 0010 0011 0011 0100 0100 0101 0101 0110 0110 0111 0111 0001 0001 0010 0010 0011 0011 0100 0100 0101 0101 0110 0110 0111 0111 1000 1000 0100 0100 0101 0101 0110 0110 0111 0111 1000 1000 1001 1001 1010 1010 1011 1011 0101 0101 0110 0110 0111 0111 1000 1000 1001 1001 1010 1010 1011 1011 1100 1100 Table 36: ReadO LUT for Pattern6 (Decimate shorts by 2), offsets 0-7 Both modes 32-bit mode 64-bit mode Mux sel3 Addr Day Valid 3only Muxena (3:0) Dav Valid 3only Muxena (3:0) Mux selO Mux sell Mux sel2 (3:0) (3:0) (3:0) (3:0) 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 0000 0000 0001 0001 0010 0010 0011 0011 0100 0100 0101 0101 0110 0110 0111 0111 0001 0001 0010 0010 0011 0011 0100 0100 0101 0101 0110 0110 0111 0111 1000 1000 0100 0100 0101 0101 0110 0110 0111 0111 1000 1000 1001 1001 1010 1010 1011 1011 0101 0101 0110 0110 0111 0111 1000 1000 1001 1001 1010 1010 1011 1011 1100 1100 Table 37: Readl LUT for Pattern6 (Decimate shorts by 2), offsets 0-7 64-bit mode 32-bit mode Both modes Addr Dav Valid 3only Muxena (3:0) Dav Valid 3only Muxena (3:0) Muxs el0 Mux sell Mux sel2 (3:0) (3:0) (3:0) (3:0) 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 11110 11111 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1100 0011 0000 0000 1100 0011 0000 0000 1100 0011 0000 0000 1100 0011 0000 0000 1100 0011 0000 0000 1100 0011 0000 0000 1100 0011 0000 0000 1100 0011 0000 0000 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 0000 xxxx 0000 xxxx 0001 xxxx 0001 xxxx 0010 xxxx 0010 xxxx 0011 xxxx 0011 xxxx 0100 xxxx 0100 xxxx 0101 xxxx 0101 xxxx 0110 xxxx 0110 xxxx 0111 xxxx 0111 xxxx 0001 xxxx 0001 xxxx 0010 xxxx 0010 xxxx 0011 xxxx 0011 xxxx 0100 xxxx 0100 xxxx 0101 xxxx 0101 xxxx 0110 xxxx 0110 xxxx 0111 xxxx 0111 xxxx 1000 xxxx 1000 xxxx xxxx 0000 xxxx 0000 xxxx 0001 xxxx 0001 xxxx 0010 xxxx 0010 xxxx 0011 xxxx 0011 xxxx 0100 xxxx 0100 xxxx 0101 xxxx 0101 xxxx 0110 xxxx 0110 xxxx 0111 xxxx 0111 xxxx 0001 xxxx 0001 xxxx 0010 xxxx 0010 xxxx 0011 xxxx 0011 xxxx 0100 xxxx 0100 xxxx 0101 xxxx 0101 xxxx 0110 xxxx 0110 xxxx 0111 xxxx 0111 xxxx 1000 xxxx 1000 Table 38: ReadO LUT for Pattern7 (Decimate shorts by 4), offsets 0-7 Mux sel3 Addr Dav 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 11110 11111 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 Valid 3only 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 Both modes 32-bit mode 64-bit mode Muxena (3:0) Dav Valid 3only Muxena (3:0) 0000 0000 1100 0011 0000 0000 1100 0011 0000 0000 1100 0011 0000 0000 1100 0011 0000 0000 1100 0011 0000 0000 1100 0011 0000 0000 1100 0011 0000 0000 1100 0011 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 1100 0011 Mux sel3 Mux selO Mux sell Mux sel2 (3:0) (3:0) (3:0) (3:0) 0000 xxxx 0000 xxxx 0001 xxxx 0001 xxxx 0010 xxxx 0010 xxxx 0011 xxxx 0011 xxxx 0100 xxxx 0100 xxxx 0101 xxxx 0101 xxxx 0110 xxxx 0110 xxxx 0111 xxxx 0111 xxxx 0001 xxxx 0001 xxxx 0010 xxxx 0010 xxxx 0011 xxxx 0011 xxxx 0100 xxxx 0100 xxxx 0101 xxxx 0101 xxxx 0110 xxxx 0110 xxxx 0111 xxxx 0111 xxxx 1000 xxxx 1000 xxxx xxxx 0000 xxxx 0000 xxxx 0001 xxxx 0001 xxxx 0010 xxxx 0010 xxxx 0011 xxxx 0011 xxxx 0100 xxxx 0100 xxxx 0101 xxxx 0101 xxxx 0110 xxxx 0110 xxxx 0111 xxxx 0111 xxxx 0001 xxxx 0001 xxxx 0010 xxxx 0010 xxxx 0011 xxxx 0011 xxxx 0100 xxxx 0100 xxxx 0101 xxxx 0101 xxxx 0110 xxxx 0110 xxxx 0111 xxxx 0111 xxxx 1000 xxxx 1000 Table 39: Readl LUT for Pattern7 (Decimate shorts by 4), offsets 0-7 64-bit mode 32-bit mode Addr Dav Valid 3only Muxena (3:0) Dav Valid 3only Muxena (3:0) 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 11110 11111 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 1 1 0 0 1 1 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 Both modes Mux selO Mux sell (3:0) 0000 0110 0100 0010 0001 0111 0101 0011 0010 0000 0110 0100 0011 0001 0111 0101 0100 0010 0000 0110 0101 0011 0001 0111 0110 0100 0010 0000 0111 0101 0011 0001 Mux sel2 Mux sel3 (3:0) (3:0) (3:0) 0001 0111 0101 0011 0010 1000 0110 0100 0011 0001 0111 0101 0100 0010 1000 0110 0101 0011 0001 0111 0110 0100 0010 1000 0111 0101 0011 0001 1000 0110 0100 0010 0011 1001 0111 0101 0100 1010 1000 0110 0101 0011 1001 0111 0110 0100 1010 1000 0111 0101 0011 1001 1000 0110 0100 1010 1001 0111 0101 0011 1010 1000 0110 0100 0100 1010 1000 0110 0101 1011 1001 0111 0110 0100 1010 1000 0111 0101 1011 1001 1000 0110 0100 1010 1001 0111 0101 1011 1010 1000 0110 0100 1011 1001 0111 0101 Table 40: ReadO LUT for Pattern8 (Extract two channels), offsets 0-7 Both modes 32-bit mode 64-bit mode Addr Dav Valid 3only Muxena (3:0) Dav Valid 3only Muxena (3:0) 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 11110 11111 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 0 1 1 1 0 1 1 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 1 1 0 0 1 1 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 Mux selO (3:0) 0000 0110 0100 0010 0001 0111 0101 0011 0010 0000 0110 0100 0011 0001 0111 0101 0100 0010 0000 0110 0101 0011 0001 0111 0110 0100 0010 0000 0111 0101 0011 0001 Mux sell (3:0) 0001 0111 0101 0011 0010 1000 0110 0100 0011 0001 0111 0101 0100 0010 1000 0110 0101 0011 0001 0111 0110 0100 0010 1000 0111 0101 0011 0001 1000 0110 0100 0010 Mux sel2 (3:0) 0011 1001 0111 0101 0100 1010 1000 0110 0101 0011 1001 0111 0110 0100 1010 1000 0111 0101 0011 1001 1000 0110 0100 1010 1001 0111 0101 0011 1010 1000 0110 0100 Table 41: Readl LUT for PatternS (Extract two channels), offsets 0-7 Mux sel3 (3:0) 0100 1010 1000 0110 0101 1011 1001 0111 0110 0100 1010 1000 0111 0101 1011 1001 1000 0110 0100 1010 1001 0111 0101 1011 1010 1000 0110 0100 1011 1001 0111 0101 64-bit mode 32-bit mode Addr Dav Valid 3only Muxena (3:0) Dav Valid 3only Muxena (3:0) 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 11110 11111 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 0 x 1 x 1 x 1 x 0 x 1 x 1 x 1 x 1 x 0 x 1 x 1 x 1 x 0 x 1 x 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 0 1 0 1 1 0 1 1 1 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 Both modes Mux selO Mux sell Mux sel2 Mux sel3 (3:0) (3:0) (3:0) (3:0) 0000 0011 0110 0001 0100 0111 0010 0101 0001 0100 0111 0010 0101 0000 0011 0110 0010 0101 0000 0011 0110 0001 0100 0111 0011 0110 0001 0100 0111 0010 0101 0000 0001 0100 0111 0010 0101 1000 0011 0110 0010 0101 1000 0011 0110 0001 0100 0111 0011 0110 0001 0100 0111 0010 0101 1000 0100 0111 0010 0101 1000 0011 0110 0001 0010 0101 1000 0011 0110 1001 0100 0111 0011 0110 1001 0100 0111 0010 0101 1000 0100 0111 0010 0101 1000 0011 0110 1001 0101 1000 0011 0110 1001 0100 0111 0010 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx Table 42: ReadO LUT for Pattern9 (Select Every Pixel), offsets 0-3 Addr 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 11110 11111 Dav 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x Valid 3only 1 x 1 x 1 x 0 x 1 x 1 x 1 x 0 x 0 x 1 x 1 x 1 x 0 x 1 x 1 x 1 x Muxena (3:0) 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx Both modes 32-bit mode 64-bit mode Dav 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Valid 3only 1 0 1 1 1 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 0 1 1 1 1 0 1 1 1 0 1 1 Muxena (3:0) 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 Mux sel2 Mux sel3 (3:0) (3:0) (3:0) 0101 1000 0011 0110 0001 0100 0111 0010 0110 0001 0100 0111 0010 0101 1000 0011 0111 0010 0101 1000 0011 0110 0001 0100 1000 0011 0110 0001 0100 0111 0010 0101 0110 1001 0100 0111 0010 0101 1000 0011 0111 0010 0101 1000 0011 0110 1001 0100 1000 0011 0110 1001 0100 0111 0010 0101 1001 0100 0111 0010 0101 1000 0011 0110 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx Mux selO Mux sell (3:0) 0100 0111 0010 0101 0000 0011 0110 0001 0101 0000 0011 0110 0001 0100 0111 0010 0110 0001 0100 0111 0010 0101 0000 0011 0111 0010 0101 0000 0011 0110 0001 0100 Table 43: ReadO LUT for Pattern9 (Select every pixel), offsets 4-7 64-bit mode 32-bit mode Both modes Addr Dav Valid 3only Muxena (3:0) Dav Valid 3only Muxena (3:0) Mux selO Mux sell Mux sel2 (3:0) (3:0) (3:0) (3:0) 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 11110 11111 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 0 x 1 x 1 x 1 x 1 x 0 x 1 x 1 x 1 x 0 x 0 x 1 x 1 x 1 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 0 1 0 1 1 0 1 1 1 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 0000 0011 0110 0001 0100 0111 0010 0101 0001 0100 0111 0010 0101 0000 0011 0110 0010 0101 0000 0011 0110 0001 0100 0111 0011 0110 0001 0100 0111 0010 0101 0000 0001 0100 0111 0010 0101 1000 0011 0110 0010 0101 1000 0011 0110 0001 0100 0111 0011 0110 0001 0100 0111 0010 0101 1000 0100 0111 0010 0101 1000 0011 0110 0001 0010 0101 1000 0011 0110 1001 0100 0111 0011 0110 1001 0100 0111 0010 0101 1000 0100 0111 0010 0101 1000 0011 0110 1001 0101 1000 0011 0110 1001 0100 0111 0010 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx Table 44: Readl LUT for Pattern9 (Select Every Pixel), offsets 0-3 Mux sel3 Addr 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 11110 11111 Dav x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 Valid 3only x 0 x 1 x 1 x 1 x 1 x 0 x 1 x 0 x 0 x 1 x 1 x 1 x 0 x 1 x 0 x 1 Muxena (3:0) xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 Both modes 32-bit mode 64-bit mode Dav 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Valid 3only 1 0 1 1 1 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 0 1 1 1 1 0 1 1 1 0 1 1 Muxena (3:0) 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 Mux sell Mux sel2 Mux sel3 (3:0) (3:0) (3:0) (3:0) 0100 0111 0010 0101 0000 0011 0110 0001 0101 0000 0011 0110 0001 0100 0111 0010 0110 0001 0100 0111 0010 0101 0000 0011 0111 0010 0101 0000 0011 0110 0001 0100 0101 1000 0011 0110 0001 0100 0111 0010 0110 0001 0100 0111 0010 0101 1000 0011 0111 0010 0101 1000 0011 0110 0001 0100 1000 0011 0110 0001 0100 0111 0010 0101 0110 1001 0100 0111 0010 0101 1000 0011 0111 0010 0101 1000 0011 0110 1001 0100 1000 0011 0110 1001 0100 0111 0010 0101 1001 0100 0111 0010 0101 1000 0011 0110 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx Mux selO Table 45: Readl LUT for Pattern9 (Select every pixel), offsets 4-7 64-bit mode 32-bit mode Addr Dav Valid 3only Muxena (3:0) Dav Valid 3only Muxena (3:0) 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 11110 11111 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 0 x 1 x 0 x 1 x 1 x 1 x 0 x 1 x 0 x 1 x 1 x 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 Both modes Muxs el0 Mux sell Mux sel2 (3:0) (3:0) (3:0) (3:0) 0000 0110 0100 0010 0001 0111 0101 0011 0010 0000 0110 0100 0011 0001 0111 0101 0100 0010 0000 0110 0101 0011 0001 0111 0110 0100 0010 0000 0111 0101 0011 0001 0001 0111 0101 0011 0010 1000 0110 0100 0011 0001 0111 0101 0100 0010 1000 0110 0101 0011 0001 0111 0110 0100 0010 1000 0111 0101 0011 0001 1000 0110 0100 0010 0010 1001 0110 0100 0011 1001 0111 0101 0100 0010 1000 0110 0101 0011 1001 0111 0110 0100 0010 1000 0111 0101 0011 1001 1000 0110 0100 0010 1001 0111 0101 0011 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx Table 46: ReadO LUT for PatternlO (Decimate pixels by 2), offsets 0-7 100 Mux sel3 Addr 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 11110 11111 Dav x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 Both modes 32-bit mode 64-bit mode Valid 3only Muxena (3:0) Dav Valid 3only Muxena (3:0) x 0 x 1 x 0 x 1 x 1 x 1 x 1 x 1 x 1 x 0 x 1 x 1 x 1 x 1 x 1 x 1 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 xxxx 1110 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 Mux selO Mux sell Mux sel2 (3:0) (3:0) (3:0) (3:0) 0000 0110 0100 0010 0001 0111 0101 0011 0010 0000 0110 0100 0011 0001 0111 0101 0100 0010 0000 0110 0101 0011 0001 0111 0110 0100 0010 0000 0111 0101 0011 0001 0001 0111 0101 0011 0010 1000 0110 0100 0011 0001 0111 0101 0100 0010 1000 0110 0101 0011 0001 0111 0110 0100 0010 1000 0111 0101 0011 0001 1000 0110 0100 0010 0010 1001 0110 0100 0011 1001 0111 0101 0100 0010 1000 0110 0101 0011 1001 0111 0110 0100 0010 1000 0111 0101 0011 1001 1000 0110 0100 0010 1001 0111 0101 0011 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx Table 47: Readl LUT for PatternlO (Decimate pixels by 2), offsets 0-7 101 Mux sel3 64-bit mode 32-bit mode Both modes Addr Dav Valid 3only Muxena (3:0) Dav Valid 3only Muxena (3:0) Mux selO (3:0) (3:0) (3:0) (3:0) 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1110 1110 0000 0000 0000 0000 1110 1110 0000 0000 0000 0000 1110 1110 0000 0000 0000 0000 1110 1110 0000 0000 0000 0000 1 1 0 x x x 1 1 0 x x x 1 1 0 x x x 1 1 0 x x x 1 1 1 x x x 1 1 1 x x x 1 0 1 x x x 1 0 1 x x x 1110 1110 0000 xxxx xxxx xxxx 1110 1110 0000 xxxx xxxx xxxx 1110 1110 0000 xxxx xxxx xxxx 1110 1110 0000 xxxx xxxx xxxx 0000 0100 xxxx 0000 0100 xxxx 0001 0101 xxxx 0001 0101 xxxx 0010 0110 xxxx 0010 0110 xxxx 0011 0111 xxxx 0011 0111 xxxx 0001 0101 xxxx 0001 0101 xxxx 0010 0110 xxxx 0010 0110 xxxx 001 0111 xxxx 001 0111 xxxx 0100 1000 xxxx 0100 1000 xxxx 0010 0110 xxxx 0010 0110 xxxx 0011 0111 xxxx 0011 0111 xxxx 0100 1000 xxxx 0100 1000 xxxx 0101 1001 xxxx 0101 1001 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx Mux sell Mux sel2 Table 48: ReadO LUT for Pattern1l (Decimate Pixels by 4), offsets 0-3 102 Mux sel3 Addr 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 Both modes 32-bit mode 64-bit mode Dav Valid 3only Muxena (3:0) Dav Valid 3only Muxena (3:0) 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 0 1 1 0 1 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 1110 1110 0000 0000 0000 0000 1110 1110 0000 0000 0000 0000 1110 1110 0000 0000 0000 0000 1110 1110 0000 0000 0000 0000 1 1 0 x x x 1 1 0 x x x 1 1 0 x x x 1 1 0 x x x 1 0 1 x x x 1 0 1 x x x 0 0 1 x x x 0 0 1 x x x 1110 1110 0000 0000 0000 0000 1110 1110 0000 0000 0000 0000 1110 1110 0000 0000 0000 0000 1110 1110 0000 0000 0000 0000 Mux sel2 Mux sel3 (3:0) (3:0) (3:0) 0101 1001 xxxx 0101 1001 xxxx 0110 1010 xxxx 0110 1010 xxxx 0111 1011 xxxx 0111 1011 xxxx 1000 1100 xxxx 1000 1100 xxxx 0110 1010 xxxx 0110 1010 xxxx 0111 1011 xxxx 0111 1011 xxxx 1000 1100 xxxx 1000 1100 xxxx 1001 1101 xxxx 1001 1101 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx Mux selO Mux sell (3:0) 0100 1000 xxxx 0100 1000 xxxx 0101 1001 xxxx 0101 1001 xxxx 0110 1010 xxxx 0110 1010 xxxx 0111 1011 xxxx 0111 1011 xxxx Table 49: ReadO LUT for Pattern1l (Decimate pixels by 4), offsets 4-7 103 Addr Dav 64-bit mode Valid Muxena 3only (3:0) Dav 32-bit mode Valid Muxena 3only (3:0) 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 0 1 1 0 1 1 1 0 x x x 1 1 0 x x x 1 1 0 x x x 1 1 0 x x x 1 1 1 x x x 1 1 1 x x x 1 0 1 x x x 1 0 1 x x x 0000 0000 0000 1110 1110 0000 0000 0000 0000 1110 1110 0000 0000 0000 0000 1110 1110 0000 0000 0000 0000 1110 1110 0000 1110 1110 0000 xxxx xxxx xxxx 1110 1110 0000 xxxx xxxx xxxx 1110 1110 0000 xxxx xxxx xxxx 1110 1110 0000 xxxx xxxx xxxx Mux selO Both modes Mux Mux sell sel2 (3:0) (3:0) (3:0) (3:0) 0000 0100 xxxx 0000 0100 xxxx 0001 0101 xxxx 0001 0101 xxxx 0010 0110 xxxx 0010 0110 xxxx 0011 0111 xxxx 0011 0111 xxxx 0001 0101 xxxx 0001 0101 xxxx 0010 0110 xxxx 0010 0110 xxxx 0011 0111 xxxx 0011 0111 xxxx 0100 1000 xxxx 0100 1000 xxxx 0010 0110 xxxx 0010 0110 xxxx 0011 0111 xxxx 0011 0111 xxxx 0100 1000 xxxx 0100 1000 xxxx 0101 1001 xxxx 0101 1001 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx Table 50: Readl LUT for Pattern1l (Decimate pixels by 4), offsets 0-3 104 Mux sel3 Addr 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 Dav 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 64-bit mode Valid Muxena (3:0) 3only 1 0 1 1 0 1 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0000 0000 0000 1110 1110 0000 0000 0000 0000 1110 1110 0000 0000 0000 0000 1110 1110 0000 0000 0000 0000 1110 1110 0000 Dav 32-bit mode Muxena Valid 3only (3:0) 1 1 0 x x x 1 1 0 x x x 1 1 0 x x x 1 1 0 x x x 1 0 1 x x x 1 0 1 x x x 0 0 1 x x x 0 0 1 x x x 1110 1110 0000 0000 0000 0000 1110 1110 0000 0000 0000 0000 1110 1110 0000 0000 0000 0000 1110 1110 0000 0000 0000 0000 Mux selO Both modes Mux Mux sel2 sell (3:0) (3:0) (3:0) (3:0) 0100 1000 xxxx 0100 1000 xxxx 0101 1001 xxxx 0101 1001 xxxx 0110 1010 xxxx 0110 1010 xxxx 0111 1011 xxxx 0111 1011 xxxx 0101 1001 xxxx 0101 1001 xxxx 0110 1010 xxxx 0110 1010 xxxx 0111 1011 xxxx 0111 1011 xxxx 1000 1100 xxxx 1000 1100 xxxx 0110 1010 xxxx 0110 1010 xxxx 0111 1011 xxxx 0111 1011 xxxx 1000 1100 xxxx 1000 1100 xxxx 1001 1101 xxxx 1001 1101 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx Table 51: Readl LUT for Patternil (Decimate pixels by 4), offsets 4-7 105 Mux sel3 64-bit mode 32-bit mode Both modes Addr Dav Valid 3only Muxena (3:0) Dav Valid 3only Muxena (3:0) Mux selO Mux sell (3:0) (3:0) (3:0) (3:0) 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 0 x 0 x 0 x 1111 xxxx 1111 xxxx 1111 xxxx 1111 xxxx 1111 xxxx 1111 xxxx 1111 xxxx 1111 xxxx 0000 0000 0001 0001 0010 0010 0011 0011 0100 0100 0101 0101 0110 0110 0111 0111 0001 0001 0010 0010 0011 0011 0100 0100 0101 0101 0110 0110 0111 0111 1000 1000 0010 0010 0011 0011 0100 0100 0101 0101 0110 0110 0111 0111 1000 1000 1001 1001 0011 0011 0100 0100 0101 0101 0110 0110 0111 0111 1000 1000 1001 1001 1010 1010 Mux sel2 Mux sel3 Table 52: ReadO LUT for Patternl2 (Decimate words by 2), offsets 0-7 64-bit mode 32-bit mode Addr Dav Valid 3only Muxena (3:0) Dav Valid 3only Muxena (3:0) 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 0 x 0 x 0 x 1111 xxxx 1111 xxxx 1111 xxxx 1111 xxxx 1111 xxxx 1111 xxxx 1111 xxxx 1111 xxxx Both modes Mux selO Mux sell Mux sel2 (3:0) (3:0) (3:0) (3:0) 0000 0000 0001 0001 0010 0010 0011 0011 0100 0100 0101 0101 0110 0110 0111 0111 0001 0001 0010 0010 0011 0011 0100 0100 0101 0101 0110 0110 0111 0111 1000 1000 0010 0010 0011 0011 0100 0100 0101 0101 0110 0110 0111 0111 1000 1000 1001 1001 0011 0011 0100 0100 0101 0101 0110 0110 0111 0111 1000 1000 1001 1001 1010 1010 Table 53: Readl LUT for Patternl2 (Decimate words by 2), offsets 0-7 106 Mux sel3 Both modes 32-bit mode 64-bit mode Addr Dav Valid 3only Muxena (3:0) Dav Valid 3only Muxena (3:0) 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 1 x 0 1 x 0 1 x 0 1 x 0 1 x 0 1 x 0 1 x 0 1 x 0 1 x 1 1 x 1 1 x 1 1 x 1 1 x 1 0 x 1 0 x 1 0 x 1 1111 xxxx 0000 1111 xxxx 0000 1111 xxxx 0000 1111 xxxx 0000 1111 xxxx 0000 1111 xxxx 0000 1111 xxxx 0000 1111 xxxx 0000 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 0 1 1 1111 1111 0000 1111 1111 0000 1111 1111 0000 1111 1111 0000 1111 1111 0000 1111 1111 0000 1111 1111 0000 1111 1111 0000 Mux sel2 Mux sell (3:0) (3:0) (3:0) (3:0) 0000 0100 xxxx 0001 0101 xxxx 0010 0110 xxxx 0011 0111 xxxx 0100 1000 xxxx 0101 1001 xxxx 0110 1010 xxxx 0111 1011 xxxx 0001 0101 xxxx 0010 0110 xxxx 0011 0111 xxxx 0100 1000 xxxx 0101 1001 xxxx 0110 1010 xxxx 0111 1011 xxxx 1000 1100 xxxx 0010 0110 xxxx 0011 0111 xxxx 0100 1000 xxxx 0101 1001 xxxx 0110 1010 xxxx 0111 1011 xxxx 1000 1100 xxxx 1001 1101 xxxx 0011 0111 xxxx 0100 1000 xxxx 0101 1001 xxxx 0110 1010 xxxx 0111 1011 xxxx 1000 1100 xxxx 1001 1101 xxxx 1010 1110 xxxx Table 54: ReadO LUT for Patternl3 (Decimate double words by 2), offsets 0-7 107 Mux sel3 Mux selO 64-bit mode 32-bit mode Addr Dav Valid 3only Muxena (3:0) Dav Valid 3only Muxena (3:0) 00000 00001 00010 00011 00100 00101 x 1 0 x 1 0 x 1 1 x 1 1 xxxx 1111 0000 xxxx 1111 0000 1 1 0 1 1 0 1 1 1 1 1 1 1111 1111 0000 1111 1111 0000 00110 x x xxxx 1 1 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 1 0 x 1 0 x 1 0 x 1 0 x 1 0 x 1 0 1 1 x 1 1 x 1 1 x 1 1 x 1 1 x 1 1 1111 0000 xxxx 1111 0000 xxxx 1111 0000 xxxx 1111 0000 xxxx 1111 0000 xxxx 1111 0000 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1 0 1 1 0 1 1 0 1 1 Both modes Mux SelO Mux Sell Mux sel2 Mux sel3 (3:0) (3:0) (3:0) (3:0) 0000 0100 xxxx 0001 0101 xxxx 0001 0101 xxxx 0010 0110 xxxx 0010 0110 xxxx 0011 0111 xxxx 0011 0111 xxxx 0100 1000 xxxx 1111 0010 0011 0100 0101 1111 0000 1111 1111 0000 1111 1111 0000 1111 1111 0000 1111 1111 0000 1111 1111 0000 0110 xxxx 0011 0111 xxxx 0100 1000 xxxx 0101 1001 xxxx 0110 1010 xxxx 0111 1011 xxxx 0111 xxxx 0100 1000 xxxx 0101 1001 xxxx 0110 1010 xxxx 0111 1011 xxxx 1000 1100 xxxx 1000 xxxx 0101 1001 xxxx 0110 1010 xxxx 0111 1011 xxxx 1000 1100 xxxx 1001 1101 xxxx 1001 xxxx 0110 1010 xxxx 0111 1011 xxxx 1000 1100 xxxx 1001 1101 xxxx 1010 1110 xxxx Table 55: Readl LUT for Patternl3 (Decimate double words by 2), offsets 0-7 108 10.3 RP Subsystem Parts List Name RP SAG DS SRAM FIFOs LVDS Transmitter LVDS Receiver LVDS Sync/Receiver Manufacturer Altera Altera Altera Micron IDT National National National Part Number EPF1OK100GC503-3 EPF10K50BC356-3 EPF10K50BC356-3 MT58LC256K16/18B3 IDT723611 DS90C363 DS90CF364 DS90C402 Table 56: RP Subsystem Parts List 109 Quantity 1 1 1 4 4 2 2 1 Package Type PGA BGA BGA TQFP PQFP TSSOP TSSOP SOIC