A Data Servicing Subsystem for the Chidi

A Data Servicing Subsystem
for the Chidi
Reconfigurable Processor
By
Mark Lee
Submitted to the Department of Electrical Engineering and Computer Science
in Partial Fulfillment of the Requirements for the Degree of
Master of Engineering in Electrical Engineering and Computer Science
at the Massachusetts Institute of Technology
August 6, 1998
) Copyright 1998 Massachusetts Institute of Technology. All rights reserved.
Author
Department of Electrical Engineering and Computer Science
August 6, 1998
Certified by
r f
b
V. Michael Bove, Jr.
Sesis Supervisor
Accepted by
Acepted
Arthur C. Smith
Chairman, Department Committee on Graduate Theses
MASSACHUSET S INSTITUTE
OF TECHNOLOGY
NOV 16 998
LIBRARIES
W1
A Data Servicing Subsystem
for the Chidi
Reconfigurable Processor
by
Mark C. Lee
Submitted to the
Department of Electrical Engineering and Computer Science
August 6, 1998
In Partial Fulfillment of the Requirements for the Degree of
Master of Engineering in Electrical Engineering and Computer Science
Abstract
Application Specific Integrated Circuits (ASICs) are often used to enhance system performance,
especially when a General Purpose Processor (GPP) is too inefficient or ill suited to perform a
specialized task. However, the time and hardware costs inherent in the development and
implementation of such a solution can be quite expensive. The use of Field Programmable Gate
Arrays (FPGAs) to implement a Reconfigurable Processor (RP) can help alleviate some of the
overhead encountered with ASIC development. The RP is a dynamic processing node that can be
configured in-circuit to compute any realizable function at run-time. After the function has
finished execution, the RP can be reconfigured to compute a different function. This concept is
illustrated with the reconfigurable, multimedia Chidi Processing System. A network of Chidi
boards, each with a closely coupled GPP and RP, is used to execute a sequence of multimedia
related functions. One of the main issues in utilizing a RP efficiently is the ability to provide it
with data effectively. The design and implementation of a data servicing subsystem for the Chidi
Reconfigurable Processor, in an effort to increase system performance, is the main focus of study.
This research is supported by the Digital Life Consortium at the MIT Media Laboratory.
Thesis Supervisor: V. Michael Bove, Jr.
Title: Principal Research Scientist, MIT Media Laboratory
Acknowledgements
The work documented in this thesis could not have been done without the guidance, assistance,
and support of many people.
I would like to thank Dr. V. M. Bove for giving me the opportunity to work on the Chidi project.
His faith, supervision, and understanding have made this an extremely rewarding experience.
John Watlington's patience when answering all of my questions, in addition to teaching me the
ropes for hardware design, implementation, and debugging have been invaluable. Much of what I
have learned about hardware can be directly attributed to Wad's expert tutelage. Wad's influence
can be seen in all aspects of the Chidi project and the work documented in this thesis is no
exception.
A hearty thank you goes out to Yuan-Min Liu for being a fellow Chidi hardware engineer.
Always available to discuss any hardware, software, or basketball related issue, Min helped keep
moral high and development moving even in the darkest moments and contributed to my sanity
and progress greatly.
Chris McEniry was directly involved in much of the design for FPGA interaction in the RP
subsystem. As chief engineer for the SAG, Chris deserves credit for a lot of what is documented
here.
Dr. Thomas Nwodoh was chief board engineer for the Chidi project and deserves credit for much
of the system-level work, in addition to the RP clock implementation documented in section
6.2.3.
Ken Kung assisted in all areas, from FPGA device details to software tools. His background and
experience always provided a unique perspective to the problem at hand. His patience in
answering my questions is greatly appreciated.
Thank you's are in order for Josh Stults, Peter Yang, Chris Yang, and Peggy Chen for their
continuing development of Chidi hardware. Specifically, Josh and Peter deserve credit for their
extensive contributions in the areas of RP Configuration and RP subsystem FIFO implementation
and debugging.
My parents, Paul and Lily, deserve more gratitude, appreciation, and love than I could ever give
for their encouragement, understanding, and unconditional support. None of my accomplishments
could have been possible without them and I can only hope that my completion of this thesis and
subsequent graduation gives them some of the happiness that they deserve.
Finally, I would like to thank Ms. Jenny Huang for giving me a reason to work as hard as I
possibly could these last five years.
I could not have completed the work documented here without these people, and countless others
that I may have forgotten to mention here. For this, I owe you a debt of gratitude. Thank you and
good night.
Table of Contents
1
PURPOSE AND SCOPE ..................................................................................................
2
RECONFIGURABLE COMPUTING OVERVIEW .....................................................
11
2.2 APPLICATIONS AND RESEARCH AREAS .......................................................................
11
2.3
THE CHIDI MULTIMEDIA PROCESSING SYSTEM ...............................................
OV ERV IEW .................................................................................................................... 15
3.2
ARCHITECTURE AND BUS SPECIFICATION ...................
3.3
FUNCTIONAL B LOCKS .................................................................................................
..............................................
PowerPC604e Microprocessor...................................
MPC106 PCIBridge/Memory Controller.............................................................
Reconfigurable Processor(RP)..................................................................
.....................
Stream Address Generator(SAG) .........................
......
Data Shuffler (DS) .......................................................................................
.. .............. ....... ........................
External Interfaces ......................................
FIFOs .............................................................................. ...................................
FPGA DEVICE DESCRIPTION AND DESIGN PROCESS .....................................
4.3
.............
FunctionalDesign/Design Entry...................................30
Compilationand FunctionalSimulation ............................................................... 30
.......................................... ........................ 31
L ogic Synthesis.................................
Place and R oute............................................ ...................................................... 31
Tim ing A nalysis .................................................... .............................................. 31
Post-Synthesis Simulation ...................... ............................................................. 32
Device Configuration .............................................................. 32
34
........................................
Physical Channels.....
Virtual Channels.................................................................................................... 35
DATA PROCESSING .........................................
5.3.1
5.3.2
...... .................
.................................
35
...................... 36
Data Request Mechanism............................................................
37
Interrupt Mechanism ........................................................................................
REGISTER INTERFACE .........................
........................ ........
RECONFIGURABLE PROCESSOR .....................................
6 .1
33
OV ERV IEW ...................................................................................................................... 33
PHYSICAL AND VIRTUAL CHANNELS ............................................................................ 34
5.2.1
5.2.2
5.4
22
........................ 29
..
RP SUBSYSTEM ............................................................................................................
5.3
18
18
18
19
19
20
20
21
23
.......................................
................
O verview.........................................
Logic Elem ent (LE)................................................. ............................................ 24
Logic Array Block (LAB)...........................................................25
Embedded Array Block...........................................................26
FastTrackInterconnect ............................................................. 28
DESIGN PROCESS ..................................
4.3.1
4.3.2
4.3.3
4.3.4
4.3.5
4.3.6
4.3.7
5 .1
5.2
16
GENERIC FPGA OVERVIEW.............................................22
ALTERA FLEX 10K DEVICE FAMILY ................................................................ 23
4.2.1
4.2.2
4.2.3
4.2.4
4.2.5
6
15
3.1
4.1
4.2
5
........................... 12
CHEOPS OVERVIEW ..............................................................................
3.3.1
3.3.2
3.3.3
3.3.4
3.3.5
3.3.6
3.3.7
4
11
B ACKGROU ND .....................................................................................................................
2.1
3
9
........
......
........................ 38
................. 40
OV ERV IEW ....................................................................................................................... 40
6.2
FUNCTIONAL BEHAVIOR .....................................................................
6.2.1
6.2.1.1
6.2.1.2
6.2.1.3
6.2.2
FunctionalBlocks................................................................................................41
RP
...........................................................................
SR A M ....................................................................................................................
High-Speed I/O Port ............................................................
Interfaces................................................................................
.................
46
46
46
47
48
6.2.3
RP Clock Circuitry..............................................................................................
6.2.4
Device Configuration.............................................. ........................................ 49
6.3
IMPLEMENTATION ........................................
6.3.1
6.3 .2
6.3.3
6.3.3.1
6.3.3.2
6.3.3.3
.... 51
Design Details .................................................................
52
C onfiguration Tim e.......................................................................................................................... 54
Performance ..............................................................................
54
7.2.1
OperationsSupported............................................ ........................
Interfaces...........................................
.........................................
.................. 57
......................... 58
7.2.2.1
DS/SAG Interface ...................................................................................
7.2.2.1.1
Control Registers ...................................
........ .......
7.2.2.1.2
Status Registers ...................................................................
7.2.2.2
DS/PPC Bus Interface .............................................................
7.2.2.3
DS/External FIFO Interface
................................ .................
7.2.2.3.1
ReadO External FIFO Interface...............................
.......................
7.2.2.3.2
Readl External FIFO Interface...............................
......................
7.2.2.3.3
Write0 External FIFO Interface...............................
......................
7.2.3
Input Channels - ReadO and Read] .......
.................................
7.2.4
Output channel - WriteO....................................
7.3.2.1
7.3.2.2
7.3.3
7.3.3.1
7.3.3.2
7.3.3.3
7.3.4
Functionalityand Performance............................
Implementation Details................................. ....
.....................
DS/SAG Interface ..................................................................
Read Address Generator FSMs ........................................
Optim ization ...........................................
62
64
65
66
.......................
IMPLEMENTATION .........................................................................
7.3.1
7.3.2
58
58
59
59
60
60
61
62
62
7.2.3.1
Data Path.................................................................
7.2.3.2
Control Logic ..........................................................................
7.2.3.2.1
R egister C ontrol Logic .......................................................................................................
7.2.3.2.2
Multiplexer Control Logic...................................
...................
............ ...........
67
68
....................... 68
.................. 68
.....................
68
69
................................. 71
16-tol Multiplexer ........................................
........................ 71
Register and Logic Replication ........................................
.....................
72
EA B Pipelining ...........................................................
...............................
73
Device Configuration.........
.....................
.............................................. 73
FUTURE W ORK .............................................................................................................
8.1
8.2
8.3
56
O V ERV IEW .....................................................................................................................
56
FUNCTIONAL DESCRIPTION ............................................
........................ 57
7.2.2
7.3
48
High Speed I/O Port...................................... ................................................... 51
R P......................................................................................
.................................. 5 1
RP Configuration................................................................................................ 52
DATA SHUFFLER ..........................................................................................................
7 .1
7.2
8
41
42
43
............................. 46
6.2.2.1
RP/SAG Interface ............................................................
6.2.2.2
RP/External FIFO Interface .....................................................................
6.2.2.2.1
ReadO Physical Channel Interface ........................................
......................
6.2.2.2.2
Readl Physical Channel Interface ........................................
...................
6.2.2.2.3
Write0 Physical Channel Interface .......................................
...................
7
41
DS DEVELOPMENT.................................
RP DEVELOPMENT.................................
APPLICATION DEVELOPMENT..........................
6
. ..........
76
................ 76
..................... 76
................................. 77
9
W ORKS CITED...............................................................................
...............................
78
APPENDIX ....................................................................................
..............................
80
10
10 .1
10.2
A CR ON YM S .................................................................................................................
DATA SHUFFLER PATTERNS ...............................................................
10.3
RP SUBSYSTEM PARTS LIST ............................................
80
81
109
Tables and Figures
TABLE 1: FLEX 10K50/10K100 DEVICE FEATURES [10] .......................................
...... 24
TABLE 2: EXTERNAL REGISTER INTERFACE CONTROL SIGNAL TRUTH TABLE........................39
TABLE 3: SRAM CONTROL SIGNAL DESCRIPTIONS ................................................. 42
TABLE 4: SRAM CONTROL SIGNAL TRUTH TABLE.................................................................. 43
TABLE 5: HIGH-SPEED I/O PORT SIGNAL DESCRIPTIONS ................................ ......................... 44
TABLE 6: SAG/RP INTERFACE - SIGNAL DESCRIPTIONS ........................................................ ... 46
TABLE 7: RP READO PHYSICAL CHANNEL FIFO SIGNAL DESCRIPTIONS .................................. 47
TABLE 8: RP READO PHYSICAL CHANNEL FIFO FUNCTION TABLE ............................................. 47
TABLE 9: RP READ1 PHYSICAL CHANNEL FIFO SIGNAL DESCRIPTIONS .................................. 48
.......... 48
TABLE 10: RP WRITEO PHYSICAL CHANNEL FIFO SIGNAL DESCRIPTIONS ..................
.................................................... 49
TABLE 11: RP CLK FREQUENCIES ......................................
TABLE 12: CONFIGURATION EPROM SCHEME TIMING PARAMETERS [6] .................................
50
57
TABLE 13: OPERATIONS SUPPORTED .................................................................
TABLE 14: SAG/DS INTERFACE - CONTROL REGISTERS ............................................................ 59
TABLE 15: SAG/DS INTERFACE - REQUEST MECHANISM .................................................... 59
TABLE 16: DS/PPC BUS INTEFACE SIGNAL DESCRIPTIONS ............................................. 60
TABLE 17: DS READO PHYSICAL CHANNEL SIGNAL DESCRIPTIONS ............................................. 61
TABLE 18: DS READO PHYSICAL CHANNEL FUNCTION TABLE..................................................61
TABLE 19: DS READ 1 PHYSICAL CHANNEL SIGNAL DESCRIPTIONS..........................................61
TABLE 20: DS WRITEO PHYSICAL CHANNEL SIGNAL DESCRIPTIONS ..................................... 62
TABLE 21: REGISTER CONTROL LOGIC SIGNAL DESCRIPTIONS ................................................. 66
TABLE 22: MULTIPLEXER CONTROL LOGIC SIGNAL DESCRIPTIONS .......................................... 67
TABLE 23: WRITEO PHYSICAL CHANNEL SIGNAL DESCRIPTIONS .............................................. 68
TABLE 24: PASSIVE SERIAL CONFIGURATION SCHEME TIMING PARAMETERS [6].....................75
TABLE 25: READO LUT FOR PATTERN 1 (STRAIGHT THROUGH), OFFSETS 0-7 ............................. 81
TABLE 26: READO LUT FOR PATTERN2 (DECIMATE BYTES BY 2), OFFSETS 0-7........................... 82
TABLE 27: READ 1 LUT FOR PATTERN2 (DECIMATE BYTES BY 2), OFFSETS 0-7......................82
TABLE 28: READO LUT FOR PATTERN3 (DECIMATE BYTES BY 3/EXTRACT ONE CHANNEL),
OFFSETS 0-7 ...................................................................................................................
83
TABLE 29: READI LUT FOR PATTERN3 (DECIMATE BYTES BY 3/EXTRACT ONE CHANNEL),
OFFSETS 0-7 ..................................................................................
..................................... 84
TABLE 30: READO LUT FOR PATTERN4 (DECIMATE BYTES BY 4), OFFSETS 0-7......................
85
TABLE 31: READ 1 LUT FOR PATrERN4 (DECIMATE BYTES BY 4), OFFSETS 0-7......................86
TABLE 32: READO LUT FOR PATTERN5 (DECIMATE BYTES BY 6/CHANNELS BY 2), OFFSET 0-3..87
TABLE 33: READO LUT FOR PATTERN5 (DECIMATE BYTES BY 6/CHANNELS BY 2), OFFSETS 4-7 88
TABLE 34: READ 1 LUT FOR PATTERN5 (DECIMATE BYTES BY 6/CHANNELS BY 2), OFFSETS 0-3 89
TABLE 35: READI LUT FOR PATTERNS (DECIMATE BYTES BY 6/CHANNELS BY 2), OFFSETS 4-7 90
91
TABLE 37: READ 1 LUT FOR PATTERN6 (DECIMATE SHORTS BY 2), OFFSETS 0-7 ........................ 91
TABLE 36: READO LUT FOR PATTERN6 (DECIMATE SHORTS BY 2), OFFSETS 0-7 .....................
TABLE 38: READO LUT FOR PATTERN7 (DECIMATE SHORTS BY 4), OFFSETS 0-7 .....................
92
TABLE 39: READ 1ILUT FOR PATTERN7 (DECIMATE SHORTS BY 4), OFFSETS 0-7 .....................
TABLE 40: READO LUT FOR PATTERN8 (EXTRACT TWO CHANNELS), OFFSETS 0-7 ..................
TABLE 41: READ1 LUT FOR PATTERN8 (EXTRACT TWO CHANNELS), OFFSETS 0-7 ..................
TABLE
TABLE
TABLE
TABLE
TABLE
TABLE
TABLE
TABLE
TABLE
TABLE
TABLE
TABLE
42: READO LUT FOR
43: READO LUT FOR
44: READ 1 LUT FOR
45: READ 1 LUT FOR
46: READO LUT FOR
47: READ1 LUT FOR
48: READO LUT FOR
49: READO LUT FOR
50: READ1 LUT FOR
51: READ1 LUT FOR
52: READO LUT FOR
53: READ 1 LUT FOR
93
94
95
PATTERN9 (SELECT EVERY PIXEL), OFFSETS 0-3 ............................ 96
PATTERN9 (SELECT EVERY PIXEL), OFFSETS 4-7 ............................. 97
PATTERN9 (SELECT EVERY PIXEL), OFFSETS 0-3 ........................... 98
PATTERN9 (SELECT EVERY PIXEL), OFFSETS 4-7 ........................ 99
PATTERN10 (DECIMATE PIXELS BY 2), OFFSETS 0-7 ................... 100
PATTERN10 (DECIMATE PIXELS BY 2), OFFSETS 0-7 ................... 101
PATTERN11 (DECIMATE PIXELS BY 4), OFFSETS 0-3 .................. 102
PATTERN11 (DECIMATE PIXELS BY 4), OFFSETS 4-7 ................... 103
PATTERN11 (DECIMATE PIXELS BY 4), OFFSETS 0-3 ................... 104
PATTERN 11 (DECIMATE PIXELS BY 4), OFFSETS 4-7 ................... 105
PATTERN12 (DECIMATE WORDS BY 2), OFFSETS 0-7 .................. 106
PATTERN12 (DECIMATE WORDS BY 2), OFFSETS 0-7 ................... 106
TABLE 54: READO LUT FOR PATrERN13 (DECIMATE DOUBLE WORDS BY 2), OFFSETS 0-7 ....... 107
TABLE 55: READ1 LUT FOR PATTERN13 (DECIMATE DOUBLE WORDS BY 2), OFFSETS 0-7 ....... 108
TABLE 56: RP SUBSYSTEM PARTS LIST ..................................................................................... 109
FIGURE
FIGURE
FIGURE
FIGURE
FIGURE
FIGURE
FIGURE
FIGURE
FIGURE
FIGURE
FIGURE
FIGURE
FIGURE
FIGURE
FIGURE
FIGURE
FIGURE
FIGURE
FIGURE
FIGURE
FIGURE
1:
2:
3:
4:
5:
CHEOPS BLOCK DIAGRAM [1] .......................................
13
CHIDI BLOCK DIAGRAM [3] ..................................................... .........................
16
ALTERA FLEX 10K FAMILY ARCHITECTURE [9] ......................................
.... 23
ALTERA FLEX 10K FAMILY LOGIC ELEMENT [10]..................................
..... 24
ALTERA FLEX 10K FAMILY LOGIC ARRAY BLOCK [10]..........................................26
6: ALTERA FLEX 10K FAMILY EMBEDDED ARRAY BLOCK [10]..........................
27
7: LAB CONNECTIONS TO FASTTRACK INTERCONNECT [7] ............................................. 29
8: FPGA DESIGN FLOW...........................
......................... 30
9: RP SUBSYSTEM ...................................................
...... ................. 33
10: RP SUBSYSTEM DATA FLOW ............................................................... 37
11: RP BLOCK DIAGRAM .................................................................
41
12: HIGH-SPEED I/O PORT .....................................................................
45
13:
14:
15:
16:
17:
18:
19:
20:
21:
RP CONFIGURATION TIMING DIAGRAM [6] .....................................
........... 50
RP CONFIGURATION BLOCK DIAGRAM ........................................
53
DATA SHUFFLER BLOCK DIAGRAM .......................................................... 56
READO/READ1 DATAPATH.............................................................63
READO/READI MULTIPLEXER CONFIGURATION AND 16-TO-1 MULTIPLEXER .......... 64
READ CONTROL LOGIC
.................................................... 65
READ ADDRESS GENERATOR FSMS ......................
...... .......................
...
70
OPTIMIZED 16-TO-1 MULTIPLEXER ..........................................
72
CONFIGURATION EPROM SCHEME CIRCUIT DIAGRAM [6] .................................... 74
FIGURE 22: CONFIGURATION EPROM SCHEME TIMING WAVEFORM [6] ..................................
74
1 Purpose and Scope
This Master's Thesis concentrates on the issues related to designing, implementing, and
debugging of a data-servicing subsystem, or Data Shuffler (DS), for a Reconfigurable Processor
(RP) in the Chidi multimedia system. The main goal is to design an RP subsystem that has the
proper architecture so that it can operate efficiently and effectively in conjunction with a
microprocessor, or General Purpose Processor (GPP). Many systems that use RPs as coprocessors do not utilize both the RP and GPP at the same time, giving up the available
parallelism that could be used to improve system performance. This parallelism is something that
will be taken advantage of in Chidi.
The work for this thesis can be divided into four sections. First, the Data Shuffler (DS) for the RP
and the RP itself must be specified at a behavioral level. This step includes setting performance
and design goals for the two blocks. Second, the interfaces for the DS must be defined down to
the signal-level in order for it to interact with the RP and the system in an efficient manner. Third,
since the DS will be implemented in an FPGA, the design must be specified down to the gate
level and meet the functional and performance goals that were set at the beginning of the design
process. The design is to be verified first in software through the use of simulations. Finally, the
design is to be downloaded onto a physical board and verified that it functions properly in a
laboratory setting.
On a higher level, work must also be done at a system level. This includes preparing schematics
and layout notes, generating a netlist, and specifying the physical components that will comprise
the Chidi system. This information is given to a contractor that is responsible for layout,
fabrication, and assembly of the Chidi boards.
In order to understand the motivation behind designing such a system, an introduction to
Reconfigurable Computing (RC) needs to be provided. An overview of RC and some of the
research that has already been done in this area is provided in Section 2. Special consideration is
given to the Cheops Imaging System, the predecessor to Chidi, in this section. Section 3 provides
a system-level perspective on the Chidi Multimedia Processing System. Design goals, both
functional and in terms of performance, are presented in this section as well as a brief overview of
each functional block found in Chidi. Section 4 provides background information on both FPGAs
and the FPGA design process. This information is necessary in order to put the discussion on
implementation issues for the Data Shuffler in the proper context. Section 5 gives an overview of
the RP subsystem and the characteristics that allow it to implement a RC element effectively.
Section 6 discusses in more detail the design and implementation of the RP. Section 7 provides
the design and implementation details for the Data Shuffler, which is the main focus of this thesis.
Finally, a discussion of some potential areas of research that can be pursued in the future utilizing
the ideas and work outlined in this thesis is given in Section 8.
2 Reconfigurable Computing Overview
2.1 Background
Application Specific Integrated Circuits (ASICs) are commonly used to enhance system
performance when a microprocessor, or General Purpose Processor (GPP), is too inefficient or ill
suited to execute a time-constrained task. ASICs can be implemented as a co-processor or as an
independent processing node. ASICs have been extremely successful in application areas such as
digital signal processing (DSP) and computer graphics, among others. However, costs associated
with the development and implementation of an ASIC solution are usually quite high, both
temporally and economically.
One solution to the problem of the high cost associated with ASIC development is the use of
Field Programmable Gate Arrays (FPGAs). The FPGA architecture of programmable logic
elements and a programmable switching network, in addition to being in-circuit programmable,
allow changes to be made quickly and with little impact to the design process. Traditionally, these
attributes have made FPGAs ideal for prototyping an ASIC design. The design would first be
implemented on an FPGA and once it became stable and mature enough would then be ported
onto an ASIC. However, until recently, FPGAs have not had high enough densities, fast enough
speeds, and low enough configuration times to move into other application areas.
Currently, the FPGA market is one of the fastest growing segments of the semiconductor
industry. Vendors such as Altera, Xilinx, and Lucent, among others, are designing and
manufacturing many variations of these devices, all targeting different application areas and
different markets. The increase in the level of competition for this market has resulted in
significant improvements in density, speed, and configuration times. The devices have improved
in these three important areas to the point that using FPGAs as processing elements is now a
reality, giving birth to the relatively new field of Reconfigurable Computing (RC). Research into
using FPGAs in processing and co-processing systems has increased dramatically both in
academics and industry in an effort to determine how effective these devices are in supplementing
and/or replacing microprocessors and ASICs.
2.2 Applications and Research Areas
The applications for which FPGAs are being used are widely varying. The Transmogrifier-2
(TM-2) [13], being developed at the University of Toronto, is a powerful prototyping system.
Although prototyping is a more traditional application for FPGA systems, a full TM-2 system
contains over one million useable gates, which is far from traditional. A full TM-2 system
contains 16 boards each containing two Altera 10K50 FPGAs, which have approximately 35K
user gates (this paper does not account for the embedded RAM found on Altera's 10K family of
devices). This magnitude of available gates dwarfs those found in previous prototyping systems
and allows for a much more rapid and powerful process.
But there is still the question of whether FPGA-based systems are suitable for applications
outside of prototyping. Researchers at the Queen's University of Belfast [20] showed that the
FPGA architecture is quite capable of implementing DSP applications, in their case a 2D DCT.
Their implementation used a single Xilinx XC6264 device and operated at 25 frames per second
with VGA resolution. Singh and Bellec [19], at the University of Glasgow, showed that FPGAs
are quite good at implementing both simple and complex computer graphics algorithms. They
found that FPGA systems performed worse than specialized graphics chips, but better than
general-purpose processors with specialized graphics instruction sets. They concluded that the
advantage to using an FPGA-based system is the ability to execute many different algorithms on
the same piece of hardware, a gain that outweighs the increased speed factor found when using
specialized graphics hardware.
2.3 Cheops Overview
The Cheops Imaging System, the predecessor to Chidi, developed at the MIT Media Lab by the
Information and Entertainment Group, also investigates the use of reconfigurable processors [2].
Cheops uses one general-purpose processor, an Intel i960, and many specialized stream
processors to implement a modular platform for acquisition, processing, and display of digital
video sequences and model-based representations of moving scenes. Some of the functions that
these processors implement include transposition, filtering, DCT, motion estimation, color space
conversion, remapping, superposition, and sequencing operations. In Cheops, the data to be
processed is stored in one of the eight blocks of VRAM. The data is then streamed out of the
VRAM into one or more specialized stream processors that performs some manipulation of or
computation on the data stream. The data is then streamed into the appropriate destination in
VRAM. Below is a simplified version of the system-level block diagram.
Nile Buses
(block transfers)
Figure 1: Cheops Block Diagram [1]
As the system matured and the demand for time-constrained applications increased, more and
more specialized stream processors were needed. Each time a function that required a hardware
implementation was found, a new processor board had to be designed and manufactured. In order
to combat the need of having so many different specialized processors, a reconfigurable
processor, called the State Machine was designed and built. The State Machine's goal was to
realize many functions on one piece of hardware, thereby implementing functions for which a
specialized processor is not available and removing the need for so many different specialized
processors.
In order to realize its main goals of being specialized and flexible, the State Machine was
designed using FPGAs as well as a microprocessor [1] [23]. It utilizes two 40,000 gate SRAMbased FPGAs, a pair of AT&T 2c40 ORCA devices from Lucent Technologies, as well as a
microprocessor, a PowerPC603 from IBM Microelectronics. Each of the three processing
elements has a 1Mbyte of SRAM that is closely coupled to it, but this memory block can also be
accessed by the other two processing elements. Cheops is very much a data-flow processing
system and the State Machine conforms to this model of processing. Data is streamed into the
State Machine through the FPGAs into one of the SRAM blocks. Next, the data is processed by
one or more of the processing elements on board. The State Machine then signals to the main
Cheops processor that it has completed the operation and the data is streamed out into the
appropriate VRAM location or to another stream processor. The State Machine was designed to
be a very versatile processing node with the ability to handle functions that were suitable for both
microprocessor and specialized hardware.
Although it met most of its design goals, the State Machine was plagued by two major problems.
The first problem was that the different processing elements were not being used efficiently. For
example, consider when the contents of one SRAM needed to be accessed by multiple processing
elements. On the State Machine, interrupting the FPGAs during the processing of stream data is
expensive. During these times, the microprocessor is essentially idle if it needs to access data in
the same physical block of SRAM. Therefore, maximizing processor utilization for all three
processing units on the State Machine is difficult, which is actually a common trait of many
FPGA-based co-processing systems.
The second problem with Cheops and the State Machine is that they support only two basic data
types: 16-bit and 24-bit data. Although when necessary, Cheops is able to handle other data types
(32-bit integers and floating point numbers), the microprocessor is required; there is no hardware
support for such data types. Therefore, dealing with non-16 or 24-bit data is expensive in Cheops
and on the State Machine in terms of processing time and effort. One of the main goals of the data
processing subsystem on Chidi is to eliminate this constraint and support a much wider range of
data types.
3 The Chidi Multimedia Processing System
3.1 Overview
Chidi is a reconfigurable, multimedia processor being developed by the Information and
Entertainment Group at the MIT Media Laboratory under the supervision of Dr. V. Michael
Bove, Jr. The main motivations behind designing the Chidi system are to:
1) Investigate the incorporation of specialized/reconfigurable processing elements into
otherwise general purpose computing systems.
2) Design a system that is able to process large data sets, like those found in audio, video,
and holographic applications, in real time.
3) Design a system that scales easily using existing network infrastructure [3].
Chidi couples a microprocessor, or General Purpose Processor (GPP), with a Reconfigurable
Processor (RP) subsystem, both sharing a single bus and a single block of DRAM memory. Also
on board are units that assist in this coupling and in the transfer of data to and from the RP, the
Stream Address Generator (SAG) and the Data Shuffler (DS). The board communicates through
any one of three possible interfaces, the PCI interface, the FireWire or IEEE 1394 interface, or
the High-Speed I/O port. Below is a simplified block diagram for a single Chidi board.
PCI
Figure 2: Chidi Block Diagram [3]
Being PCI-compliant, Chidi will plug into any UNIX workstation, Macintosh, or LINUX PC. A
simple Chidi system will only include one host with one Chidi board. Usually, Chidi will obtain
data through the PCI interface, although for certain applications, the FireWire or high-speed I/0
port will be used. Chidi scales by adding multiple Chidi boards and multiple networked host
systems.
Chidi is designed to be a high-bandwidth processing node. The main Chidi data bus is 64-bits
wide running at 66 MHz. The GPP operates at speeds equal or greater than 266 MHz. The RP
processor speed is design dependent, but most designs will run between 32 MHz and 66 MHz.
The PCI and FireWire interfaces are both 32-bits wide and runs at 32 MHz, while the High-Speed
I/O Port is 32-bits wide and runs up to 65 MHz.
3.2 Architecture and Bus Specification
Architecturally, Chidi adheres to the common hardware reference platform (CHRP), jointly
specified by Apple, IBM, and Motorola. This reference platform serves as the foundation for all
PowerPC-based system design. A subset of this specification is the PowerPC bus interface for 32bit microprocessors. Four of the features of this bus specification contribute to Chidi's ability to
process data efficiently at extremely high rates.
First, the PowerPC bus interface supports decoupled address and data busses. In systems that
have coupled address and data busses, arbitration for the busses occurs once. As a result, the
address transaction does not complete until the data transaction corresponding to that address has
completed. By decoupling the two busses, the address and data busses are arbitrated for
independently. Therefore, arbitration for a second address transaction can begin while the data
corresponding to the first address transaction is processed, making the bus more efficient for bus
transactions with multiple masters.
Second, the bus interface supports pipelined bus transactions. Therefore, instead of just
arbitrating for the second address transaction while the data corresponding to the first address is
being processed, a bus grant can actually be issued and a second address transaction can begin.
The bus interface supports pipelining of two addresses before the completion of one data
transaction.
Third, the PowerPC bus interface supports multiprocessor configurations. This allows up to four
PowerPC microprocessors or PowerPC microprocessor emulators, to share the same address and
data busses. Since the RP is purely a slave device, not having multiprocessor support would
imply that the microprocessor would have to manage all memory accesses for the RP, an
extremely inefficient use of the GPP. Chidi takes advantage of this feature of the bus interface by
implementing a PowerPC bus interface on the SAG, allowing it to service all memory accesses
for the RP. This allows the microprocessor to be used much more efficiently even when the RP is
processing large data sets.
Finally, the bus interface supports both single word and four-beat burst transactions. In a normal
single word transaction, one address returns a 64-bit data word as the result from memory. In
four-beat burst transactions, one address returns not only the 64-bit word at that address, but also
the next three 64-bit words in sequential order in memory. Burst transactions reduce the overhead
needed for sequential memory address access, the type of access used for processing large data
streams, thereby increasing the bandwidth of the main data bus. In addition, this allows for large
data transactions to occur (such as frames of video data) without giving sole possession of the bus
to any one master. In the Chidi system, this allows one processing element to still execute useful
tasks (which most likely involve accessing main memory) while another performs data-intensive
computations.
3.3 Functional Blocks
3.3.1
PowerPC 604e Microprocessor
Each Chidi contains one GPP that can perform general data processing operations as well as
execute other tasks such as running the Chidi operating system, configuring the RP, and
managing the 1394 Interface. The GPP is implemented with a PowerPC 604e microprocessor
from Motorola in a 255-pin Ball Grid Array (BGA) package.
The 604e is an implementation of the PowerPC family of reduced instruction set computing
(RISC) microprocessors [17]. The 604e microprocessor implements the 32-bit addressed version
of the PowerPC architecture. This version supports 32-bit effective, or logical, addresses, integer
data types of 8, 16, and 32 bits, and floating-point data types of 32 and 64 bits, providing single
and double precision.
The PowerPC 604e is also a superscalar processor. It can issue four instructions simultaneously
and as many as seven instructions can be executed in parallel. A 64-bit data bus and a 32-bit
address bus provide the external interface. In addition, the 604e supports single-beat as well as
burst data transfers for main memory and memory-mapped I/0 accesses. Chidi systems are
available with microprocessors that execute at 266 or 300 MHz.
Aside from the features that the PowerPC 604e offers, one of the main reasons that this processor
was chosen is that it conforms to the same programming model and bus interface as the
PowerPC603, the microprocessor used in the State Machine. Familiarity with these two aspects of
the microprocessor helps minimize mistakes during development as well as reduce the design
time.
3.3.2
MPC 106 PCI Bridge/Memory Controller
The MPC106 provides a PowerPC common hardware reference platform (CHRP) compliant
bridge between the Chidi local/internal bus and the Peripheral Component Interconnect (PCI)
[15]. For the Chidi local bus, this means that the MPC106 supports multiple, up to four, 604
processors, a 32-bit address bus and a 64-bit data bus, full memory coherency, 604 local bus slave
support, and decoupled address and data busses for pipelining of 604 accesses. For the PCI bus,
the MPC106 supports all accesses to the PCI address space, big or little-endian operation, and bus
speeds up to 33 MHz.
In addition to being a Chidi local to PCI bus interface, the MPC106 also serves as the memory
controller to Chidi main memory. The MPC106 supports 1 Gbyte of RAM, 16 Mbytes of ROM,
and fast page mode or extended data out (EDO) DRAMs. By utilizing the MPC106 to implement
the memory controller as well as the PCI bus interface, more of the design effort can be
concentrated on developing the RP and investigating the other aspects of the Chidi system.
3.3.3
Reconfigurable Processor (RP)
The main goal of incorporating a Reconfigurable Processor in the Chidi Multimedia System is to
provide a processing element that can perform computations in hardware and still possess the
flexibility of being able to perform more than one function. RP configuration is initiated by the
microprocessor to allow for dynamic reprogramming. In addition, each time the RP is configured
for a different application, a different clock frequency can be used based on the speed at which
that particular design runs. This allows the RP to run at frequencies at or below 66 MHz
independently of the rest of the system. In addition, the RP has 2 Mbytes of SRAM for local
storage. The RP is implemented with a FLEX10K 100 FPGA; a 503-pin PGA packaged device
from Altera Corporation. The FLEX 10K100 is a 100,000 gate FPGA, which consists of 4992
programmable Logic Elements (LEs) and 12 Embedded Array Blocks (EABs). Unlike the
PowerPC 604e, the RP is purely a slave device. A discussion on general and Altera FLEX10K
family specific FPGA architecture is presented in Section 4.1. Design goals and details for the RP
are discussed in Section 6.
3.3.4
Stream Address Generator (SAG)
As mentioned above, the RP is a slave device, meaning that it cannot make accesses to main
Chidi memory to load or store data itself. The RP only processes data when data is made
available to it. Therefore, the Stream Address Generator handles memory accesses for the
Reconfigurable Processor. After the RP is configured by the microprocessor, it notifies the SAG
that it is ready to begin processing data for a particular stream. The SAG then generates the
address for that particular stream and writes the appropriate information to the Data Shuffler via
the SAG/DS interface.
In addition to handling the addressing needs of the RP, the SAG also processes all register
interface accesses for itself, the DS, the RP, 1394 Interface, and the two four-character
alphanumeric displays. All 1394 control mechanisms also reside within the SAG.
The SAG is implemented with a FLEX 10K50 FPGA from Altera. The purpose of using an
FPGA for this functional block is different from that of the RP. In the SAG case, an FPGA is
used for the traditional application of prototyping, not that of reconfigurable computing. The
FLEX 10K50 is a 50,000 gate FPGA, which consists of 2880 LEs and 10 EABs and comes in a
356-pin BGA package.
3.3.5
Data Shuffler (DS)
The Data Shuffler is similar to the SAG in that it services the RP's memory accesses. However,
the Data Shuffler handles the data instead of the address phase of the transaction for both RP
stream and register accesses. The main purpose of the DS is to manipulate the data as it comes
out of from memory and present it in a format that the RP can process. The DS is designed to
handle bytes, shorts (16-bit), packed RGB (24-bit), words (32-bit), and double words (64-bit)
data. In addition, the DS can handle data offsets from 64-bit boundaries.
As mentioned above, when the SAG generates the address for the current transaction it writes
certain values to the SAG/DS interface registers. These registers include information about the
type of transfer (single or burst), which channel the data is destined for, how many bytes the data
is offset from the 64-bit boundary, and what type of manipulation, or pattern, the DS should
employ for the current transfer. After the DS has completed processing the data appropriately, it
writes the data to the DS/RP external FIFOs. After this operation, the RP can read the data from
these FIFOs, using its own internal clock frequency.
The DS is also implemented with a FLEX 10K50 FPGA from Altera for prototyping purposes.
Design goals and details for the DS are discussed in Section 7.
3.3.6
External Interfaces
In addition to the PCI interface, Chidi has two external interfaces that allow for data transactions
to occur. The first of these interfaces is the IEEE 1394 FireWire Interface. This serial interface
can transfer data at 200 Mbits/sec from other 1394-compliant devices, such as digital cameras and
digital video camcorders [4]. The second interface allows for data streams directly in and out of
the RP. Utilizing LVDS (Low Voltage Differential Signaling) technology, the High-Speed I/O
Port Interface provides a 32-bit data bus that can be clocked at up to 65 MHz. In addition, this
interface provides a synchronization signal back into the RP, allowing for multiple Chidi boards
to be synchronized for the processing of large data sets, for applications such as Holovideo.
3.3.7
FIFOs
Although FIFOs are not technically a functional block, they are useful because they decouple the
data transmitter and receiver. This is important in Chidi because some of the functional blocks
operate at different frequencies. FIFOs are used for input and output for both the RP and the
FireWire Interface. This means that the RP, which runs at a variable clock speed, and the
FireWire Interface, which runs at 32 MHz, will not affect the bandwidth of the 66 MHz data bus.
Burst reads and writes will actually be to the FIFOs, allowing the RP and the FireWire Interface
to process that data using their own clock rate. This allows the FIFOs to handle the clock
boundary that exists between the main data bus and the RP and FireWire Interfaces.
4 FPGA Device Description and Design Process
4.1 Generic FPGA Overview
In order to discuss the issues involved with implementing a complex design on an FPGA in the
proper context, an understanding of the target device must first be developed. The discussion
begins with one of the simplest and most familiar Programmable Logic Devices (PLDs), the PAL
(Programmable Array Logic), designed and manufactured by AMD (Advanced Micro Devices).
Architecturally, these devices have a programmable AND-plane, a fixed OR-plane, and
programmable registers. Capitalizing on the success of PALs, another class of PLDs, Complex
Programmable Logic Devices (CPLDs), were introduced. CPLDs can be thought of as an array of
PAL-like structures, or cells, that are connected via some type of programmable interconnect. The
cells of these CPLDs usually offer more flexibility and more logic capacity than a standard PAL,
but still have essentially the same architecture.
SRAM-based FPGAs are fundamentally different from CPLDs. FPGAs contain a twodimensional array of logic cells or elements. However, unlike CPLDs, FPGAs do not contain
programmable logic planes. Instead, FPGA cells are built around N-input Look-Up Tables
(LUTs) that can be configured to implement any N-input functions. Usually, these cells also
contain programmable registers, multiplexers, and random logic in addition to the LUT. FPGA
logic cells are always connected by some type of programmable interconnect.
Although there are many manufacturers of FPGAs, the three main companies in the market are
Xilinx, Altera Corporation, and Lucent Technologies. Not surprisingly, each company
implements the logic cells and the programmable interconnect differently. The Altera FLEX 10K
devices were selected over devices from the other FPGA vendors for three main reasons. First,
the FLEX 10K FPGAs are the most inexpensive devices in terms of gates per dollar. Minimizing
cost is always desirable, especially in a system with one 100,000 gate FPGA and two 50,000 gate
FPGAs. Second, designers for the Chidi project already have experience using Altera's software
package, Max+Plus II. Having a good understanding of the FPGA software development tools
allows design, debugging, and optimization to be completed in a more timely manner. Finally, the
FLEX 10K devices offer something architecturally not found in other devices in the form of
Embedded Array Blocks (EABs), the advantages of which are discussed below.
4.2 Altera FLEX 10K Device Family
4.2.1
Overview
The FLEX 10K family of devices from Altera Corporation features 10,000 to 250,000 gate
FPGAs. Each member of the device family, regardless of size, possesses the same architecture. A
column of Embedded Array Blocks (EABs) serves as the spine of the device that segments the
Logic Array Blocks (LABs) into two halves. These components are connected to each other and
the Input/Output Elements by the FastTrack programmable interconnect system. Figure 3
illustrates the architecture of the FLEX 10K family by presenting the relationship between the
different components for a portion of the device.
I/OElemel [JOE)
T-V-
t-t-
R0w
Interon.tet
/~kKbRr
c~sB
-
U
~LOOdK Nent
LEI
[tilrEl i
II
Logic Aray
86k [LAB]
a~i
LoI~ecalnt~
L
"uwnn~
"a~lei
Embed
Array
, -I-::
l! l
'
m
Uot kray
...
Figure 3: Altera FLEX 10K Family Architecture [9]
In the Chidi system, the SAG and the DS are implemented using FLEX 10K50s while the RP is
implemented using a 10K100 device. The SAG and DS were targeted to implement a set of
specified functions. With the use of some rough preliminary estimates, the 10K50 devices were
selected because they provided the appropriate number of user I/O pins as well as enough gates to
design with comfortably. In contrast, the RP is used as a RC device and not for prototyping.
Therefore, it must be large enough to be able to implement complex algorithms and designs that
will be specified at a later time. Although Altera now manufacturers 250,000 gate FPGAs, at the
time Chidi was designed the 10K100 was the largest device in the FLEX 10K family, hence its
selection to implement the RP. Table 1 summarizes some of the features that differentiate these
two devices.
Feature
Device
Typical Gates
LEs
LABs
EABs
Total RAM bits
User I/O Pins
10K50
10K100
50,000
2,880
360
10
20,480
274
100,000
4992
624
12
24,576
406
Table 1: FLEX 10K50/10K100 Device Features [10]
A discussion of the general building blocks for FLEX 10K devices is given in the sections below.
4.2.2
Logic Element (LE)
The logic cell is the basic building block for an FPGA. It normally contains some type of
programmable LUT, a programmable register or registers, and some random logic. The Altera
logic cell is called a Logic Element (LE) and is depicted in Figure 4.
Carry-ln
Cascade-In
DA 'A 1
to Fastl'rck
'Interconnect
DAT44
to LXB Local
- nterconnect
IABCI RL I
IABCIfRI 2
Chip-Wide
Reset
IABITRI 3
LABCI RL4
'_
Cart -Out Cascade-Out
Figure 4: Altera FLEX 10K Family Logic Element [10]
As can be seen in Figure 4, the LE contains a four-input LUT, which can compute any four-input
function. In addition, the LE contains a programmable register, that can be configured as a D, T,
JK, or SR flip-flop. For combinational functions, the register can be bypassed, allowing the LUT
to drive the output of the LE. The LE can be routed to and from both LEs in the same LAB,
adjacent LABs, and chip-wide row and column interconnects.
Another feature of the FLEX 10K LEs are the carry and cascade chains. These chains provide
high-speed interconnectivity between adjacent LEs without using the local LAB interconnect
resources. These chains are useful when implementing high-speed adders, counters, and wide fanin functions.
4.2.3
Logic Array Block (LAB)
The LEs in a FLEX 10K device are arranged in groups of eight in what are called Logic Array
Blocks (LABs). A two-dimensional array of LABs provides the architectural structure for FLEX
10K devices. The LAB provides fast local routing between the 8 resident in addition to routing to
the row and column connects and adjacent LABs. Figure 5 below shows the block diagram of a
LAB.
Dedicated Inputs
Row lnterconnect
LAB Local
n rterconnec
Note (2)
Column-to-Row
,Interconnect
LAB Control
Signals
Column
Carr-Out &
Cascade-Out
Notes:
(1) EPF10K50 devices have 22 inputs to the LAB local interconnect channel from the row; EPF10K00 devices have 26.
(2) EPF10K5O devices have 30 LAB local interconnect channels; EPFO0K100 devices have 34.
Figure 5: Altera FLEX 10K Family Logic Array Block [10]
4.2.4
Embedded Array Block
One of the features that differentiate the FLEX 10K family of devices from other FPGAs is the
use of what Altera calls Embedded Array Blocks (EABs). Most FPGAs only contain the twodimensional array of LABs described in the previous section. The FLEX 10K family also has
additional configurable RAM cells embedded in each device. The use of these embedded RAM
blocks frees up more LEs for other functions and increases performance for most designs.
The most common use of EABs is to implement large LUTs. With LEs, implementing logic
functions with a large number of inputs does not scale efficiently due to the routing associated
with connecting a large number of LEs together. Using EABs to implement functions such as
multipliers or those found in DSP applications is much more efficient in terms of both speed and
area.
EABs are extremely flexible. Each EAB can be configured as one block of 256x8, 512x4,
1024x2, or 2048x1 RAM. In addition, EABs can be connected together to form deeper and wider
RAM blocks. As can be seen from Figure 6 below, each EAB also contains not only a RAM
block, but also registers and bypass paths. This allows the EAB to be configured to implement
synchronous, asynchronous or, data outputs only, control signals only, or some combination of
the three RAM block.
LAB Local lnterconnect. Note (1I
Note: EPFIOK50devices have 22 EAB local interconnect channels;EPF1OKO00 devices have 26
Figure 6: Altera FLEX 10K Family Embedded Array Block [10]
4.2.5
FastTrack Interconnect
The FastTrack Interconnect system provides the ability to connect any device component (LE,
EAB, or I/O element) to any other. One of the most important differences between FastTrack and
other programmable interconnects is that FastTrack uses row and column connects that span the
whole device. Most other programmable interconnects contain segmented resources that must
then pass through a series of programmable switching matrixes. By using continuous
interconnects, FastTrack provides predictable routing delays, even for complex designs. A
dedicated row channel serves each row of LABs, while a dedicated column channel serves each
column of LABs. Figure 7 shows the interconnections between the rows and columns of the
FastTrack Interconnect as well as to and from the LAB.
Column
Channels Note (2)
I
To Other
Columns
Rowv
Channels
Note (1)
Fro-m
Adjacent
lTo
Adjacent
1AB
I
to AB Local Interconnect
I
To Other Rows
Notes:
(1) EPF1OK50 devices have 216 channels per row; EPFIOK00 devices have 312.
(2) EPF1OKSO and EPFIOKI00 devices both have 24 channels per column.
Figure 7: LAB Connections to FastTrack Interconnect [7]
4.3 Design Process
Taking a design from the conceptual state to actually configuring an FPGA with that particular
design is a challenging, although sometimes tedious, process. Figure 8 shows the general design
flow used for FPGAs. The following sections elaborate on each aspect of the design process in
more detail. Discussion of general design techniques is presented as well as those aspects that are
particular to FPGA design for the Chidi project in the Information and Entertainment Group at
the MIT Media Lab.
optntization and design chang s
,ntr.
Text
-1
unt rional
simulation
timing
(ontri
5
Synthesis
(Synops)i
Route
nala
t
with no
netlist
Iiuning
file
Information
m
thdl
output file
withtming
,
in
-
Simulation
configuratio
(Ahera)r
I
inlormnallon
Figure 8: FPGA Design Flow
4.3.1
Functional Design/Design Entry
The first step in developing designs for FPGAs is to specify the design in a way that is
appropriate for a particular design environment. Specifically, this involves translating the design
from state diagrams, state tables, or other design descriptions into a format that can be processed
by the particular EDA tools or software packages used by the designer. This specification can
take the form of a text-based Hardware Description Language (HDL), usually Verilog or VHDL
(VHSIC, or Very High Speed Integrated Circuit, HDL), or the more visual format found in
schematic entry. For the Chidi project, designers write VHDL-87 compliant code to specify their
FPGA designs.
4.3.2
Compilation and Functional Simulation
After design entry, the design needs to be compiled and simulated at a functional level. This step
in the design process is completed with the aid of an EDA software package. This package checks
for syntax errors in the design code and allows the designer to test the design functionally. Since
the EDA tool does not have any information about the target device or how the design is placed in
this step of the design process, the designer can only verify that the design behaves as he or she
expects by providing the appropriate test vectors. Incorrect simulation results are usually caused
by inherent design flaws or by inappropriate test vectors. Designers targeting Chidi FPGAs use
the Synopsys VHDL/FPGA Design Analyzer for compilation and the Synopsys VHDL/FPGA
Simulator and Debugger for functional simulation.
4.3.3
Logic Synthesis
After functional simulation is completed, the next step in the design flow is logic synthesis.
Synthesis is the process of translating a design description into the actual registers, multiplexers,
and gates needed for implementation. Logic synthesis is done automatically using an EDA
software package. The designer's role is to assign the appropriate timing and area constraints for
the synthesizer to obtain the most optimal design. The result of this step is some type of netlist
that describes how the components of the design are wired together. Chidi designers use
Synopsys, recognized as the industry leader in logic synthesis, to generate the netlist in EDIF
format.
4.3.4
Place and Route
The next step in the design process is to map the logic generated during synthesis onto a
particular device. Since each device has its particular architecture, logic cell, routing, and delay
characteristics, this is usually done automatically with software provided from the FPGA chip
vendor. The designer's role during place and route is to provide the appropriate parameters, such
as the logic synthesis style, optimizations for speed or area, and device selection. Since the FLEX
10K Family devices are used in the Chidi system, designers use the Max+Plus II software from
Altera for place and route.
4.3.5
Timing Analysis
After obtaining an initial mapping onto the desired device, the speed at which the design
functions needs to be determined. Combinational delays and registered performance both need to
be determined for each design. This information is obtained automatically from the place and
route information generated by the vendor specific tool. With the speed demands of the Chidi
system (the DS and SAG both need to operate at 66.66 MHz or 15.0 ns), initial designs usually
fall short of the timing requirements. Chidi designers use the Timing Analyzer tool in the Altera
Max+Plus II software package to determine the critical path of the design. The designer must
then optimize the design by minimizing the critical path delay in an effort to get the design to run
at a faster speed.
One strategy to obtain a faster design is to pipeline the critical path. This involves going back to
the VHDL design and inserting registers appropriately. This is usually done if the design
inherently has too much combinational logic between registers. A second approach is to place
timing constraints on the design. This can be done globally for the entire design or locally for one
particular path. A third approach is to assign placement constraints. In this case, the designer
specifies to the place and route tool that certain LEs are to be placed together in close proximity.
In the FLEX 10K architecture, the delay on the same row is much smaller than that between
different rows. Optimizations usually include employing a combination of these strategies in
order to meet the timing requirements. After these constraints or design changes are made, the
process must begin again, either at the place and route stage or the compilation stage.
4.3.6
Post-Synthesis Simulation
After the timing constraints for the design are met, the next step is to execute post-synthesis
simulation. This is to verify that the design functions properly after all of the timing information
for the target device is taken into account. The timing information includes setup and hold times
for registers, propagation delays for registers and gates, and interconnect delays, among others.
Since functional simulation performed earlier in the design process does not take this information
into account, post-synthesis simulation in necessary to confirm that the design will operate as
expected after it is configured into a device.
4.3.7
Device Configuration
After post-synthesis simulation has been completed and the design has been verified to meet all
the timing requirements of the target device, the device must then be configured with that
particular design. Device configuration can be done in two ways, one using some type of
configuration ROM, the other using a microprocessor. For Chidi, both the SAG and DS are
configured automatically at power up using the Altera EPC1PC8 Configuration EPROMs.
However, since the RP is a dynamic processing element, it is configured by the microprocessor,
via the SAG, at run-time. Device configuration details for the DS and RP can be found in their
respective sections.
5 RP Subsystem
5.1 Overview
One of the main areas of study in the Chidi system is the coupling of a RP and GP to form an
efficient computing element. As the State Machine project illustrated, two of the main problems
usually encountered when designing a system that incorporates RC along with general computing
elements are:
1) Keeping the RC element supplied with enough data.
2) Utilizing both the GP and RP in an efficient manner.
Chidi attempts to solve these problems by incorporating an RP subsystem that not only includes a
reconfigurable processor, but address and data servicing entities as well. This architecture is used
in the hope that an infrastructure can be established to make the development of applications for
the RP easy, while providing both flexibility and the necessary data throughput to make such a
system useful. Below is a simplified block diagram for the RP subsystem found in Chidi.
~~
--------------------------------------------PowerPC
Bus Control
Main Addr
SRAM
RP Config
-k
SAG
SAG/RP
Interface
Main Data
SAG/S
SAG/I
10
24
2
Register Interface
4
H
FIFO
FIFO
Figure 9: RP Subsystem
As can be seen from the figure above, the SAG serves as the main interface for the RP subsystem
to the rest of Chidi. The SAG handles the address phase of all transactions to and from main
memory for the RP. For stream processing transactions, the SAG arbitrates for the main address
bus, generates the appropriate address in main, and notifies the DS how it should process the
corresponding data. The SAG is able to generate the appropriate addresses for each stream
because it contains registers that hold the mapping information between streams and their
locations. This stream to address mapping information is managed by the microprocessor based
on how applications are scheduled to be configured on the RP and how the streams are scheduled
be processed by each application. In addition, it also handles all register interface accesses for the
RP, itself, and the DS.
On the other hand, the DS is responsible for handling the data portion of any RP transactions. For
input streams, the DS manipulates the data appropriately as it is received by the subsystem from
main memory and presents it to the RP in a format that it can process correctly. For register
writes, the DS passes the data along to the RP so that it can process it internally. For output
streams or register reads, the DS waits for the SAG to complete arbitration for the data bus before
asserting the proper values onto the main data bus.
There are several characteristics of the RP subsystem that allow it to process data in an efficient
manner. First, the use of virtual channels allows multiple streams of data to be multiplexed over a
fixed physical channel. Second, the RP subsystem can act as a second PowerPC 604 as seen by
the main bus arbiter, the MPC106. This allows the RP subsystem to take advantage of the features
of the PowerPC bus interface. Finally, the RP subsystem provides control mechanisms back to the
microprocessor so that it may instruct and manage the subsystem efficiently.
5.2 Physical and Virtual Channels
5.2.1
Physical Channels
The RP Subsystem contains one 64-bit input and one 64-bit output path through which data flows
in and out of the RP. These input and output paths are implemented using both the Data Shuffler
and the external FIFOs between the DS and RP. The 64-bit input path can be configured as one
64-bit or two independent 32-bit paths called ReadO and Readl. The output path is strictly a 64bit output path called Write0. These paths are referred to as physical channels.
The Data Shuffler portion of the input physical channel, or channels, performs manipulations on
data coming from memory destined for the RP. These functions include data realignment, data
extraction, or no data manipulation at all. The output physical channel does not alter the data in
any way. The DS portion of the physical channel operates at 66.66 MHz and interfaces with the
Chidi local bus.
The physical channels also consist of external FIFOs, which reside between the DS and RP. The
FIFOs' main purpose is to serve as a buffer between two entities that operate at different clock
frequencies. The FIFO can be written to and read from using two independent clock signals. This
allows the DS to write to the FIFO at 66.66 MHz while the RP processes the data from it at
whatever speed the application dictates. In most applications, the DS should keep the 64-bit wide,
64 word deep FIFOs mostly filled so that the RP idle time can be minimized.
5.2.2
Virtual Channels
For some applications, the RP only requires one or two input data streams in order to generate an
output data stream. In these situations, the input physical channel can be configured so that each
input data stream has a dedicated channel through which data can be transmitted. However, for
other applications, more than two input data streams are required to generate an output data
stream. For these cases, virtual channels are used to deliver multiple streams to the RP for
processing.
The RP supports eight virtual input channels and four virtual output channels. This allows the RP
to implement applications that process up to eight streams and generate up to four streams.
Virtual channels are time multiplexed over physical channels, with up to four virtual channels
mapping to one physical channel.
Virtual channels are implemented using FIFOs internal to the RP. Flow control mechanisms that
provide the appropriate mapping between physical and virtual channels must also be provided.
RP application designers will choose from a library of interface designs, that differ depending on
the number of input and output streams and the mapping mechanism, in order to provide data to
their designs correctly.
5.3 Data Processing
The main function of the RP subsystem is to efficiently process large data sets, or stream data, at
a high rate. The RP subsystem can process stream data using two different methods, which is
application dependent. For time-constrained applications, data is read in, processed immediately,
and written to the appropriate destination. In this case, portions of the stream will already have
been processed while other portions are still being read into the RP. For less time-intensive
applications, the entire data stream (or up to 2 Mbytes) can be "flooded" into the RP SRAM. The
RP will then process this data and write the result back into the SRAM after which the stream can
be "flooded" out to the appropriate destination. For both of these cases, data can be obtained over
the Chidi local bus. This would be done if the source data resides in Chidi main memory or in
host memory (this would also require data transactions over the PCI bus interface). Obtaining
data over the Chidi local bus requires the RP subsystem to emulate a PowerPC 604 processor for
bus arbitration purposes. Data can also be obtained from an off-board source and streamed
directly into the RP via the High-Speed I/0 Port, thereby bypassing the Chidi local bus, the SAG,
and the DS.
5.3.1
Data Request Mechanism
Providing data to the RP in order to maximize utilization is the primary goal of the RP subsystem.
The DS provides the means to format the data appropriately while the SAG generates addresses
for memory accesses for the streams that are to be processed by the RP. The data request
mechanism provides the necessary information to the SAG for it to decide which streams it
should retrieve from or write to main memory. The signals used for the data request mechanism is
implemented using the SAG/RP and SAG/DS interfaces.
For applications where only one or two input streams and one output stream are required, only
physical channel information is used by the SAG to determine when the next memory access will
occur. For input physical channels, the DS notifies the SAG whenever it is able to process another
four-beat data burst transaction. For the output channel, the DS notifies the SAG whenever there
is data that needs to be written back into main memory.
When virtual channels are required, the request mechanism becomes more complicated. Each
input virtual channel must notify the SAG when it is able to accommodate another four-beat data
burst transaction. However, the SAG can only issue an address for that virtual channel, or stream,
if the corresponding physical channel also indicates that it is available to process the same
amount of data. Similarly, output virtual channels can only be serviced if the output physical
channel is able to process the memory write. Figure 10 below shows channel relationships and
request mechanism entities for the RP Subsystem.
4
1Fi
64 X 04
FIFO1
Figure 10: RP Subsystem Data Flow
5.3.2
Interrupt Mechanism
As discussed before, for applications that are not time constrained, data is flooded into the
SRAM, processed by the RP, and the results written back into the SRAM. In order to notify the
microprocessor that it has finished processing data, the RP subsystem utilizes an interrupt
mechanism.
When the RP has completed a data processing task, it will enable an interrupt signal to the SAG.
The SAG will then write an interrupt register with the value corresponding to an RP interrupt and
mask all future interrupts until the current one is cleared. Masking the interrupt register is
required because the SAG services interrupts not only for the RP, but also for itself and the 1394
interface as well. The SAG will then in turn interrupt the microprocessor. The interrupt handler
on the PowerPC 604 will then read the interrupt register on the SAG to determine what caused
the interrupt, process the interrupt appropriately, and finally clear the interrupt and unmask
interrupts on the SAG.
5.4 Register Interface
Before the RP subsystem can be used for data processing, certain control information must be
specified by the microprocessor. This information might include:
1)
The location of the RP configuration file
2)
When to begin RP configuration
3)
Source and destination information for various data streams
4)
Which patterns are to be used by the DS for which streams
In addition, control registers are also used for flow control during stream processing.
The Register Interface provides the mechanism for writing control registers on the SAG, DS, and
RP (the 1394 interface and alphanumeric display registers are also serviced by the Register
Interface. For the purposes of this discussion, since they physically reside in the SAG, they are
considered SAG control registers). Unlike stream processing mode, the RP subsystem is
considered a slave device on the Chidi local bus when processing Register Interface transactions.
The control logic that processes all Register Interface transactions resides physically on the SAG.
If the register being accessed is located on the SAG also, the SAG processes the transaction
internally. The SAG generates independent read, write, and chip select signals for each entity's
control registers (SAG, 1394, and alphanumeric registers). However, for DS and RP registers, the
SAG must assert the proper control signals on the external Register Interface with the appropriate
address. Since the DS and RP share read, write, and chip select signals the SAG must generate
dependent signals that conform to the truth table given below.
/CS
0
0
0
0
1
1
1
1
/WR
0
0
1
1
0
0
1
1
/RD
0
1
0
1
0
1
0
1
Function
Invalid
Write DS Registers
Read DS Registers
No function
Invalid
Write RP Registers
Read RP Registers
No function
Table 2: External Register Interface Control Signal Truth Table
The Data Shuffler Register Interface consists of the three control signals mentioned above and 17
address signals. This provides addressing to 131,072 locations, more than enough for the 2,800
LEs and the 20,480 RAM bits within the DS.
However, unlike the DS, the RP Register Interface consists of four control signals and 20 address
bits. The four control signals include the /CS, /WR, and /RD signals mentioned previously, as
well as a signal called /REGAACK. The DS runs on the same 66.66 MHz clock as the SAG,
allowing it to latch the signals for a register transaction in one clock cycle. However, the RP runs
on a variable clock signal that is application dependent. Therefore, depending on the clock speed
of the RP, it might require more than one clock period to process Register Interface transactions.
The /REGAACK signal is the acknowledgement indicating to the SAG that the Register Interface
transaction has been completed. Before the SAG detects that this signal has been asserted, it must
maintain the validity of all control and address signals for the RP Register Interface. The 20
address bits provide 64-bit word addressing for the 2 Mbyte SRAM as well as bit addressing for
the 4,992 LEs and the 24,576 RAM bits in the 10K100 device.
6 Reconfigurable Processor
6.1
Overview
By definition, the RP is a dynamic processing element. Unlike most processors implemented
using dedicated hardware, the RP can be configured to compute any function any number of
times by the microprocessor at run time. The main purpose of the RP block is to provide an
infrastructure that allows application engineers to develop functions in VHDL that requires only a
limited understanding of the underlying Chidi system. By providing an added layer of abstraction,
applications can be developed at a much faster rate, reducing both development time and cost.
Therefore, functional and behavioral specifications for the RP must be application independent.
The goal is to provide a structure under which all foreseeable applications can be executed, given
reasonable constraints. The following is a summary of the design requirements the RP block must
support:
1) 64-bit input and output data busses
2) a variable clock frequency for different designs of various applications
3) interface capabilities between it and DS and SAG
4) two physical input channels and up to eight virtual channels
5) one physical output channel and up to four virtual output channels
6) addressing to all registers and RAM bits inside the RP as well as to the 2 MByte SRAM
7) serial configuration mode via the SAG
8) a 2 MByte SRAM that is supports byte and word writes, burst transactions, and a 64-bit data
bus
9) a 1-2 Gbit/sec High-Speed I/O Port
10) additional signals for debugging purposes
Below is a block diagram for the RP block, which includes the RP, SRAM, High-Speed I/O Port,
and input/output FIFOs. It illustrates the interconnection between the RP sub-blocks, the number
of signals necessary for inter-block communication, and the number of signals required to
interface with other parts of the Chidi system.
INI
Config CtrlStt
IR
10
tr
/:San
a
(A4
KCSC~
- n V_%_4I
_IUJA
ILN
Input
FIFO
~9
c
Iata
a
Cirl/SuRevnftgurable
Iw
Ctrl/Sta
2
Proessor
6€4
I
W_
mob
36
O /N
I
a-
yanta
IDaa
twoo
,€l
U
°
IynC
ACrlStat
2
!Strat
Oulput3COutput
A CtrliStat
h
O
4
gn
N.
t4t~
' 20
12
-4
Lb
Q
4
2)
b
)
-4
Figure 11: RP Block Diagram
6.2 Functional Behavior
A description of the RP can be divided into functional blocks and interface descriptions. As can
be seen in Figure 11, the RP consists of not only the 100,000 gate FPGA, but also a 2 Mbyte
SRAM and a High-Speed I/O Port. In addition, the RP provides interfaces to the SAG, external
FIFOs, and the Register Interface. Below is a description of each block and interface for the RP.
6.2.1
6.2.1.1
Functional Blocks
RP
The RP itself is a large 100,000 gate FPGA. At any one time, it needs to support not only the
application it is running but also all of the logic necessary to implement the many different
interfaces between itself and the SRAM, High-Speed I/O Port, the SAG, external FIFOs, and
Register Interface. The RP is implemented using a FLEX 10K100 FPGA from Altera and is
configured by the microprocessor, via the SAG, at run-time. Device details can be found in
Section 4.2 while device configuration information is located in Section 6.2.5.
6.2.1.2
SRAM
The SRAM is a 2Mbyte synchronous memory element that interfaces directly with the RP. Its
main purpose is to store incoming or outgoing data in some stream processing applications. The
interface to the SRAM supports a 64-bit data bus and 17-bit addressing (64-bit word addressing)
with 8 byte enables (byte addressing). Table 3 below gives a brief description for the signals that
comprise the RP/SRAM interface.
Signal Name
DATA<63:0>
ADDR<16:0>
/WE
/WEH#
/WEL#
/CE
/OE
/ADSC
/ADV
ZZ
Description
64-bit data bus
17-bit address bus
Write Enable
High Byte Write Enable
Low Byte Write Enable
Chip Enable
Output Enable
Address Status Controller. Initiates and extends single and burst reads and
writes
Address Advance. Extends burst reads and writes
Snooze Enable
Note: #=1, 2, 3, or 4
Table 3: SRAM Control Signal Descriptions
The signals described above are used to control the SRAM in a variety of ways. The SRAM
supports power-saving modes as well as burst capabilities for read and write transactions. Burst
capabilities are extremely useful when accessing data into and out of the SRAM for stream
processing applications. The table below summarizes the operations supported by the SRAM.
Data
ZZ
/ADSC
/ADV
/CE
/WE
/OE
None
L
L
X
H
X
X
High-Z
None
H
X
X
X
X
X
High-Z
External
L
L
X
L
L
X
WRITE Data
External
L
L
X
L
H
L
READ Result
Next
L
H
L
X
L
X
WRITE Data
Next
L
H
L
X
H
L
READ Result
Current
L
H
H
X
L
X
WRITE Data
Current
L
H
H
X
H
L
READ Result
Address
Operation
Used
Deselected Cycle,
Power-down
Snooze Mode,
Power-down
Write Cycle,
Begin Burst
Read Cycle,
Begin Burst
Write Cycle,
Continue Burst
Read Cycle,
Continue Burst
Write Cycle,
Suspend Burst
Read Cycle,
Suspend Burst
Note: Values only valid on risingedge of CLK signal
Table 4: SRAM Control Signal Truth Table
6.2.1.3
High-Speed I/O Port
The High-Speed I/O Port serves as an interface directly in and out of the RP from an off-board
source. This port will mostly be used when the overhead required and non-deterministic latency
for the Chidi local bus and the PCI bus are unacceptable for certain data-intensive applications.
The High-Speed I/O Port supports two independent 32-bit data paths with four control/status
signals in and out of the RP. This interface has an operating frequency of up to 65 MHz, allowing
for up to 2.34 Gbits/sec throughput in both directions. In addition, the High-Speed I/O port
supports two special synchronization signals that will allow multiple Chidi's to be used to process
different parts of large data sets and still remain synchronized. Table 5 gives a description for the
signals that make up the RP/High-Speed I/O interface.
Signal Name
HSOUT<31..0>
HSOUTCLK
HSOUTC/S<9..0>
HSSYNC<1..0>
HSIN<31..0>
HSINCLK<1..0>
HSINC/S<9..0>
I/O
O
O
O
I
I
I
I
Description
High-Speed 32-bit data output
Output clock
Output Control and Status Signals
Synchronization signals
32-bit data input
Input clock
Input Control and Status Signals
Table 5: High-Speed I/O Port Signal Descriptions
The High-Speed I/O Port is implemented using the LVDS chip set from National Semiconductor
for many reasons. First, as the name suggests, LVDS (Low Voltage Differential Signaling)
technology takes advantage of differential signaling techniques to transfer large amounts of data
at high rates and with low power [14]. Differential signaling is less susceptible to common mode
noise and can migrate to lower supply voltages with greater ease than traditional single-ended
signaling schemes. A second reason that the LVDS signaling convention, as specified by the
IEEE 1596.3 standard, was chosen over other standards was that it did not impose any data type
constraints on our system. Other chip sets from other manufacturers supported standards that
supported bit sizes not easily compatible with the Chidi word or double word (such as the 10 or
20 bit format required by Fibre Channel). Finally, termination consists of only one 100 ohm
resistor between each signal pair in LVDS technology, simplifying implementation. Below is a
simple diagram illustrating how the High-Speed I/O is implemented using LVDS transmitter and
receiver chips from National Semiconductor.
RP
r.4F
18
_
2
data
TRANS
0o
0
cU
CD)
0
1E
Note: Terminating resistors have a value of 100 ohms
Figure 12: High-Speed I/O Port
The transmitters take in TTL/CMOS signals and convert them into LVDS signals. Six
TTL/CMOS signals are transmitted over one LVDS signal pair while one clock signal is
transported over one LVDS signal pair. Similarly, the receivers convert LVDS signals into
TTL/CMOS signals. In the figure above, the SYNC chip is just a smaller version of the receiver.
As a result, 16 LVDS signals (8 pairs) come into the High-Speed I/O Port from an on-board
connector, while 20 LVDS signal (10 pairs) go to the connector.
6.2.2
Interfaces
The RP communicates with the rest of the Chidi system through two main interfaces, the RP/SAG
interface and the RP/External FIFO interface. Since the RP operates at a variable clock frequency,
which is application dependent, it does not communicate directly with entities not in the RP
subsystem. The two RP interfaces are described in detail in the sections that follow.
6.2.2.1
RP/SAG Interface
The RP/SAG Interface is actually two independent sets of signals. Signals going to the SAG from
the RP indicate which virtual channels are in need of data for processing. The signals going to the
RP from the SAG specify the destination of the next data transfer as well as how many words
(either 32-bit or 64-bit words) are in the transfer. The table below gives a more detailed signal
description.
Signal Name
REQCHAN<3..0>
REQSIZE
REQSTROBE
WRITECHAN<3..0>
TRANSFERSIZE<4..O>
I/0
O
O
O
I
I
Description
Indicates the virtual channel making the data request
Indicates the size of the request
Request valid signal
Virtual channel destination
Size of transfer in words (32 or 64-bit words, depending on the
mode)
Table 6: SAG/RP Interface - Signal Descriptions
6.2.2.2
RP/ExternalFIFO Interface
The RP receives and processes stream and register interface data through three physical channels.
The interfaces that allow the RP to control data flow to and from these physical channels is
detailed below.
6.2.2.2.1
ReadO Physical Channel Interface
The ReadO Physical Channel is one of the two 32-bit channels that is used to receive stream input
data. It can be configured as one independent 32-bit channel or as the lower half of a 64-bit
channel.
In addition, the ReadO channel also supports Register Interface transactions. Register Interface
reads and writes are supported through the use of two "mailbox" registers These mailbox registers
are located physically within the FIFO. They provide a way to use the same physical channel to
perform read and writes that can be accessed independently of the contents of the FIFO. This
allows stream processing to continue after it has been interrupted for a Register Interface
transaction.
Another way of supporting Register Interface transactions is to use the ReadO Physical Channel to
process RP register writes (input data) and the WriteO Physical Channel to process register reads
(output data). This allows the ReadO channel to only be used for data flowing into the RP, while
the WriteO channel only services data flowing out of the RP. The only reason that this solution
was not implemented involves pin-usage efficiency. It requires more overall signals to configure
the ReadO and WriteO channels to both support input and output mailbox registers, respectively,
than just the ReadO channel for both input and output. Therefore, the ReadO Physical Channel
supports all Register Interface transactions, regardless of type. Table 7 below gives the signal
descriptions while Table 8 provides the function table for the ReadO Physical Channel.
Signal Name
RODATA<31..0>
/ROEF
/ROW/RB
ROMBB
/ROMBF1
I/O
I/O
I
O
O
I
Signal Description
32-bit data bus for stream input data and register input/output data.
ReadO Empty Flag. When /ROEF is LOW, the FIFO is empty and reads
from its memory are disabled. /ROEF is forced LOW when the device
is reset and is set HIGH by the second LOW-to-HIGH transition of
RPCLK after data is loaded into empty FIFO memory
ReadO Write/Read Select. See table below for usage details.
ReadO Mailbox Select. See table below for usage details.
ReadO Mailbox Flag. Low when data is valid in the Mailbox Register
(implies the end of a RP Register WRITE). HIGH-to-LOW transition,
synchronous to DSCLK66. LOW-to-HIGH transition, synchronous to
RPCLK.
Table 7: RP ReadO Physical Channel FIFO Signal Descriptions
ROW/RB
H
L
L
H
MBB
H
L
H
L
Function
Mail2 Write
FIFO Read
Maill Read
High-Z
Table 8: RP ReadO Physical Channel FIFO Function Table
6.2.2.2.2
Readl Physical Channel Interface
The Readl Physical Channel is the other 32-bit channel that is used to receive stream input data.
Like the ReadO channel, it can be configured as one independent 32-bit channel or as the upper
half of a 64-bit channel. The table below gives the signal descriptions for the Readl Physical
Channel. Since the Readl channel is purely an input channel and does not support Register
Interface transactions, the only function that is required is the ability to read from the FIFO,
which is controlled by the R1W/RB signal.
Signal Name
R1DATA<31..0>
R1W/RB
I/O
I
O
Signal Description
32-bit data bus for stream input data.
Readl Write/Read Select. LOW indicates a read from the FIFO, HIGH
indicates High-Z.
/R1EF
Readl Empty Flag. When /R1EF is LOW, the FIFO is empty and reads
from its memory are disabled. /ROEF is forced LOW when the device
is reset and is set HIGH by the second LOW-to-HIGH transition of
I
RPCLK after data is loaded into empty FIFO memory.
Table 9: RP Readl Physical Channel FIFO Signal Descriptions
6.2.2.2.3
WriteO Physical Channel Interface
The WriteO Physical Channel is the 64-bit output channel used for stream processing. The table
below gives the signal descriptions for this channel. Since the Write0 Physical Channel is purely
an output channel and does not support any Register Interface transactions, the only function that
is required is the ability to write to the FIFO, which is controlled by the WOENA signal.
Signal Name
WODATA<63..0>
WOENA
I/O
O
O
/WOAF1
I
/WOFF1
I
/WOAF2
I
Signal Description
64-bit data bus for stream output data.
Write0 Enable. HIGH indicates a FIFO Write, LOW indicates HighZ.
WriteO Almost Full Flag. Programmable signal synchronized to
DSCLK66. LOW when the number of empty locations in the FIFO is
less than or equal to the value in the offset register.
WriteO Full Flag. LOW when the FIFO is full and writes to the FIFO
are disabled. Forced LOW when the device is reset and is set HIGH
by the second LOW-to-HIGH transition of DSCLK66 after reset.
Used for debugging purposes only to verify that both FIFOs are in the
same state.
/WOFF2
I
Used for debugging purposes only to verify that both FIFOs are in the
same state.
Table 10: RP WriteO Physical Channel FIFO Signal Descriptions
6.2.3
RP Clock Circuitry
In order to support many possible algorithms with different performance specifications, the RP
requires a variable chip clock. This allows the application engineer to decide at what speed
functions targeted for the RP should run. In addition, it allows the application engineer to decide
how much effort should be put into design optimization.
A register in the SAG that is written by the microprocessor sets the RP clock frequency. The
following table provides the mapping of control bits to RP clock frequencies.
RP CLK
Frequency/Period
SETRPCLK<3:0>
(MHz/ns)
66.66/15.00
60.00/16.66
55.00/18.18
50.00/20.00
48.00/20.83
33.33/30.00
30.00/33.33
27.50/36.36
25.00/40.00
14.32/69.82
0010
0001
0011
0000
Olxx
1110
1101
1111
1100
10xx
Table 11: RP CLK Frequencies
Essentially, this allows the RP to operate at 10 different possible frequencies, relatively evenly
distributed between 66.666 and 14.32 MHz. As a result, this gives the application designer much
more flexibility in deciding how fast a particular algorithm should run and how much time should
be budgeted for and spent on optimizations.
6.2.4
Device Configuration
Unlike the SAG and DS, the RP is not configured at power-up. Since the RP is a dynamically
programmable device, it needs to have the ability to be programmed by the microprocessor at any
time with any configuration an indefinite number of times after power-up. Although the
microprocessor initiates the RP configuration process, it is the SAG that contains all of the logic
necessary to actually program the RP. The microprocessor provides the starting address of the
configuration file in memory and instructs the SAG to begin RP configuration. Then, the SAG
handles the remainder of the configuration details and interrupts the microprocessor when the
process is complete.
A FLEX 10K family device can be configured in a number of different ways. In order to perform
configuration via the microprocessor and the SAG, the MSEL<1:0> and nCE bits must be tied to
ground. In this configuration scheme, the configuration file must be in the RBF format (Raw
Binary File) instead of the POF format (Programming Object File) used in the configuration
EPROM scheme.
The RP itself only requires a few signals for the configuration process. To initiate configuration,
the RP requires a low-to-high transition on the nCONFIG signal. Then, data is clocked in serially
(at or below 10 MHz) until the CONF_DONE signal is pulled high. After CONF_DONE goes
high, the RP requires another 10 periods of the configuration clock (DCLK), before passing from
configuration to device initialization. Figure 13 below provides the timing diagram for RP
configuration while Table 12 gives a description of the timing parameters.
CFG
nCONFIG
--
I
nSTATUS
t t
CON
DONE
000
lCK
I - ST!,TtS
I '
I
-0-
C
II
CF2ST I
I
-C0H1 i
t
I
-
t
CLK
I
I
!-
000I
I
I
DCLK
r"t CL
I D
DATA
I
D2
1 ~4 t DH
D4
D3
go•*
I
I)5
Figure 13: RP Configuration Timing Diagram [6]
Symbol
tCF2CD
tCF2ST
tCFG
tSTATUS
tCF2CK
tDSU
tDH
tCH
tCL
tCLK
fMax
Parameter
nCONFIG low to CONF_DONE low
nCONFIG low to nSTATUS low
nCONFIG low pulse width
Min
2
Max Units
us
1
us
1
us
nSTATUS low pulse width
nCONFIG high to first rising edge on DCLK
2.5
5
us
us
Data setup time before rising edge on DCLK
Data hold time after rising edge on DCLK
30
0
ns
ns
DCLK high time
DCLK low time
50
50
ns
ns
DCLK period
DCLK maximum frequency
100
Table 12: Configuration EPROM Scheme Timing Parameters [6]
10
ns
MHz
After device initialization has been complete, the RP enters what is called user mode. At this
time, the RP can begin to execute the function or algorithm for which it was configured.
6.3 Implementation
At the time of writing, the Chidi multimedia system has completed the layout, fabrication, and
assembly process. Debugging has commenced in all areas and is progressing. Therefore, this
necessarily means that all physical RP specifications have been implemented. A parts list that
summarizes each physical part used to implement not only the RP, but also the entire RP
subsystem is given in the Appendix, Section 10.3.
6.3.1
High Speed I/O Port
Special attention was given to the LVDS signals, used in the High Speed I/O Port, during the
layout process. Since LVDS signals are differential in nature, it was necessary to route them in
matching pairs and to use heavy chamfering. These steps were necessary to reduce skew and
phase differences between signal pairs, as well as minimize signal reflection and EMI. In
addition, the board layout and fabrication engineers were instructed to control LVDS signal trace
impedance's to 100 ohms +/- 10 ohms. Along with 100 ohm termination resistors and 106 ohm
cables, there is an approximate 100 ohm impedance throughout the LVDS signal path. Finally,
placing the transmitters and receivers extremely close to the off-board connector minimized
LVDS signal trace lengths. These design techniques were employed during the layout process, as
specified by the "LVDS Owner's Manual and Design Guide" [14], in order to implement LVDS
technology correctly.
6.3.2
RP
Besides the physical design outlined in the previous sections, design must also be done for the
FLEX 10K100 FPGA that is the centerpiece of the RP block. There are many roles that engineers
must assume when designing and developing the RP. RP development can be viewed at three
levels:
1) External interface design - this specifies how the signals are connected between
blocks
2) Internal interface design - VHDL-level specification of interface behavior.
3) Application design - VHDL-level specification of particular functions or algorithms.
Although these three levels are listed separately, the work done in one area must incorporate
some level of understanding of the other two areas. External interface design has been completed
for the RP and has been documented in this chapter. However, it could not have been done
correctly without specifying at a behavioral level the internal interface design and taking into
account how applications will use the RP.
An internal interface design engineer must then take the physical and timing constraints as given
and create a type of "wrapper" that will be used in applications. This wrapper will actually be an
entire library of wrappers that engineers will select from based on the number of virtual channels
that are required for the application. Different wrappers can also be chosen based on if the SRAM
or High-Speed 1/O Port is needed for this particular algorithm.
Finally, when an application engineer targets a design for the RP, he/she must select the
appropriate wrapper. In addition, he/she should have some level of understanding of how data is
transferred to the RP from Chidi main memory or host memory in order to specify other
information vital to successful implementation. This information might include which patterns the
DS is to use for manipulating which streams, among others. In this manner, the burden of
implementing system specific logic in order to obtain data correctly is lifted from the application
designer. He/she only has to have enough of an understanding to pick the appropriate interface
logic and set the correct parameters, thereby making application development easier and more
efficient.
6.3.3
6.3.3.1
RP Configuration
Design Details
RP Configuration has been implemented on the SAG. Figure 14 below provides a simple block
diagram of the logic required for RP configuration.
SAG
PowerPC
Bus
Control
Signals
,
PPC
Bus
Interface
RP
I
Stat
21
Vcc
VcC
Ajdr
32
Dta
Main
64
Data
B
-s-/-0
Address
Generator
&
RP Config/
PPC Bus
Interface
Ctrli
Str
Stat
La
Configuration
FN
CONF_DONE
nSTATUS
INITDONE
nCONFIG
DCLK
6
Register Bank
(64x4)&
Control Logic
1
DKIAT
nCE
N-SEL I
MSELO
GN
Figure 14: RP Configuration Block Diagram
Since the final version of the Address Generator and Register Interface are being developed
independently by other engineers and were not available, the design for RP configuration uses its
own scaled down versions of these entities. The design was partitioned so that it would require
minimal or no redesign when the final versions of the Address Generator and Register Interface
are completed.
As mentioned previously, before configuration can begin the microprocessor must first write two
RP Configuration registers. The first register contains the starting address of the configuration
file. The second register signals to the RP Configuration design to start configuration. After
configuration is complete, the SAG is suppose to interrupt the microprocessor signaling that the
RP has been configured. In this implementation of the design, the interrupt mechanism was not
available. Therefore, the RP Configuration design resets the register that indicates that
configuration is to begin. The microprocessor then polls this register in order to determine when
configuration has finished.
The design that actually configures the RP is made up of three main blocks. The first block, the
clock generation circuit, is a simple clock divider that generates the 8.33 MHz (120 ns) signal
necessary for configuration from the 66.66 MHz (15 ns) system clock. The second block is the
Configuration FSM, which generates the control signals and regulates the data necessary for
configuration. Finally, the third block is the register bank and corresponding control logic. One
register is a 64-bit bit shifter that allows each byte of data to be serially shifted, LSB first, into the
RP. The second 64-bit register holds the next eight bytes of data for configuration. Therefore, the
only constraint placed on the design is that the next eight bytes of data have to be fetched in 512
system clock cycles (remember that each bit is clocked into the RP every 120 ns), which is more
than enough time in most conceivable cases.
6.3.3.2
ConfigurationTime
Knowing the configuration time is extremely important for scheduling purposes. The
configuration time can be calculated as follows:
tCONFIG
= tCFG + tCF2CK + tCLK(#
of bytes of configuration data)(8 bits/byte) + lO(tCLK)
The theoretical minimum time, using timing and configuration file size information provided by
Altera, is:
tCONFIG,MIN
= 2 us + 5 us + (100lns)(149,134 bytes)(8 bits/byte) + 10(100 ns)
= 119.3152 ms
Since a 8.33 MHz (120 ns) clock was used in this implementation, the expected configuration
time is:
tCONFIG, THEOR
= 2 us + 5 us + (120ns)(149,134 bytes)(8 bits/byte) + 10(120 ns)
= 143.17684 ms
Experimentally, a configuration time of 143.2 ms was observed, which matches closely to the
expected value (The HP1660AS logic analyzer used has 2 ns precision, but only .1 ms display
precision when measuring on a millisecond scale).
6.3.3.3
Performance
One of the constraints for any design implemented on the SAG is that it must run at 66.66 MHz,
since it interfaces with the Chidi local bus. One trouble spot involved integrating all of the
different modules together. It became necessary to insert registers in long combinational delay
paths between the modules to improve performance.
Another problem area dealt with fan out. In particular, when dealing with a large number of
registers, like those found in the register bank, it became necessary to replicate the control logic.
Also, FSMs that drove a high number of outputs also experienced fan out problems. In these
cases, the state bits were replicated so that one set was used to feed state transitions, while the
other set or sets were used to drive the outputs.
Finally, large FSMs also posed performance problems. Because each LE in a FLEX 10K device
can only accommodate four inputs, FSMs with more than three state bits (eight states) almost
always require at least two LEs to compute the next state (the next state depends on the current
state and inputs). For this reason, the configuration FSM is actually divided into two smaller
FSMs. One FSM handles the beginning of configuration, such as asserting the nCONFIG signal
for at least 2 us, checking that the nSTATUS signal transitions from low-to-high while the
CONF_DONE signal remains low, and waiting at least 5 us before notifying the second FSM
should begin actual device configuration. The second FSM then handles the configuration data
and completes the process. By utilizing the above three strategies, the design was able to run at
68.49 MHz (14.6 ns).
7 Data Shuffler
7.1 Overview
The Data Shuffler services the data portion of any transaction going to or coming from the
Reconfigurable Processor. These transactions include main memory accesses for stream
processing as well as register interface accesses. For register interface transactions, the data is
passed directly to the RP via mailbox registers, while for stream processing, the DS performs
some type of manipulation on the data before loading it into the external FIFOs so that it may be
processed by the RP accordingly. A block diagram for the Data Shuffler is given below.
1 ,
SAG/DS Interface
ReadO
ea
. To/From
SAG
".
FIFO
I/F
Physical
Channel
IIFO
Bus
Ctrl
4-1 S1Data Bus
U/P
Data
4
Bus
4*-*,
04
U
ReadlRed
Physical
Channel
Contro
64
II
12t
F1 FO
C I/Frol
3
Read
I
FIFO
IiL
Figure 15: Data Shuffler Block Diagram
The DS will need to "shuffle" the stream data going from main memory to the RP for two main
reasons. First, the data stream might be offset from a 64-bit word boundary coming out of main
memory. The DS will align this data before allowing the RP to begin processing. Second, the data
might need to be parsed in some particular manner based on the application that is running in the
RP. Some examples include extracting every other piece of data (decimate by 2) and extracting
every third piece of data (decimate by 3 or extracting one particular channel from packed rgb
data), among others. The Data Shuffler supports byte, short (16-bit), pack rgb (24-bit), word (32bit), and double word (64-bit) data types.
7.2 Functional Description
For stream processing, the Data Shuffler can operate in one of two modes. The first mode
implements two independent 32-bit wide input physical channels. Named with respect to the RP,
these channels are called ReadO and Readl. Each input physical channel is made up of a section
that resides internally to the DS and a section implemented using external FIFOs. The DS can
also be configured for 64-bit mode. In this mode, the two physical channels are used as one 64-bit
input channel. For either mode, there is a 64-bit output channel, Write0, which holds the output
streams generated by the RP.
7.2.1
Operations Supported
Conceivably, the Data Shuffler should be able to perform any type of ordering or selecting of the
input data stream for the RP. However, for the first version of the design, only thirteen of these
manipulations, or patterns, are supported. These patterns were selected based on the frequency of
probable use in stream processing. Table 13 below lists the initial thirteen patterns that are
supported by the DS.
Operation
Straight Through
Decimate bytes by 2
Decimate bytes by 3/Extract one channel
Decimate bytes by 4
Decimate bytes by 6/ Channels by 2
Decimate shorts by 2
Decimate shorts by 4
Extract two channels
Extract every pixel
Decimate pixels by 2
Decimate pixels by 4
Decimate words by 2
Decimate double words by 2
Table 13: Operations Supported
Pattern #
1
2
3
4
5
6
7
8
9
10
11
12
13
Due to implementation details, the DS can only support eight different patterns at one time (or
less, depending on the pattern). However, this pattern map can be loaded with different patterns
by the microprocessor via the register interface. Theoretically, this allows the DS to support an
unlimited number of patterns for input data streams. Practically, due the area constraints, the
number of patterns obviously cannot increase indefinitely.
The Data Shuffler internals can be categorized into interfaces and input and output physical
channels. The interfaces allow the DS to receive relevant status signals from other parts of the
system as well as assert any necessary control signals. The physical channels perform
manipulations on the data as well as serving as intermediate storage between system elements.
Descriptions for each are given in the sections below.
7.2.2
Interfaces
7.2.2.1
DS/SAG Interface
The DS/SAG Interface consists of both control and status registers. The following two sections
present the needs and uses of these two sets of registers in detail.
7.2.2.1.1
Control Registers
The control registers hold information from the SAG about the next data transfer to or from
memory. The SAGCTLWR signal indicates when new values are being written into the control
registers. The physical channel controllers will only latch these new values if the channel
registers contain the corresponding physical channel values. The /RESET register is used to clear
a physical channel of any residual data values that are no longer deemed necessary by the SAG.
This occurs at the end of a stream when more bytes were retrieved than necessary if accesses did
not end on a 64-bit boundary. The Read channels use the MODE, PATTERN, and OFFSET
registers in order to determine how to properly manipulate incoming stream data. The PPC Bus
Interface uses the BURSTSIZE register in order to complete data tenure transactions
appropriately. Finally, the TRANSFERSIZE register is used to write the correct number of words
or double words (depending on the mode) into the Read channel FIFOs. This is useful in the case
when the SAG has deemed that a burst read from memory is more efficient in terms of
transaction time, but only wants a portion of the entire four double words to be processed by the
RP. This is often the case at the end of a given input stream. The table below summarizes the
control registers and gives a brief description of each.
Description
Specifies where in the pattern map a pattern is stored (bits 7-5) and
what type of pattern it is: straight-through, decimate data-type by N,
etc. (bits 4-0)
Specifies by how many bytes the next data transfer is offset
Specifies for which channel the next data transfer is destined.
Register Name
PATTERN<7:0>
OFFSET<2:0>
CHAN<1:0>
00=None, 01=Write0, 10=Read0, 1 l=Readl.
BURSTSIZE
TRANSFERSIZE<4:0>
HIGH indicates a burst transfer, LOW indicates a single-beat transfer
The number of valid 32 or 64 bit data (depending on the current
mode) after any data shuffling operations have been executed.
Negatively asserted reset bit. Applies only to the channel specified by
/RESET
the Chan<1:0> register
SAGCTLWR
Indicates that the SAG is writing the above control registers
Table 14: SAG/DS Interface - Control Registers
7.2.2.1.2
Status Registers
The Data Shuffler must provide the SAG with status information on the three physical channels.
The SAG uses this data to determine for which channel it should make the next memory
transaction. The Read channels signal that a burst transaction can be accommodated when it has
processed the preceding data transaction, written the results in the FIFOs, and the corresponding
FIFO still indicates that it still has room to hold data from a burst. The Write0 channel indicates
that there is enough data for a memory write when any of its internal registers contain valid data.
When all of its internal registers are valid, then it signals that there is enough data for a burst
write to memory. The table below summarizes the status registers for the DS/SAG Interface.
Signal Name
ROAVAIL
R1AVAIL
WOAVAIL<1:0>
Description
Indicates that the ReadO physical channel is available for a burst write
Indicates that the Readl physical channel is available for a burst write
Indicates how much data the Write0 physical channel has: 00= no data,
01 =single, 10=burst, 11 =invalid
Table 15: SAG/DS Interface - Request Mechanism
7.2.2.2
DS/PPCBus Interface
The DS/PPC Bus Interface manages data flow in and out of the DS to and from memory. It
monitors the values of the CHANNEL register along with PPC Bus signals in order to determine
if input data is to be loaded into the Read Physical Channels or if data is to be read from the
Write0 Channel. Below is a signal summary for the DS/PPC Bus Interface.
Signal Name
/TA
I/O
I/O
/SAGDBG
I
/604DBB
I
CHAN<1:0>
WO DAV<3:0>
RO_LATCH_DATA
I
I
O
R1_LATCH_DATA
O
WO_LATCHED
O
Description
Transfer Acknowledge. As an input indicates that the DS should
latch the data on the PPC Data Bus. As an output indicates that the
DS is driving valid data onto the PPC Data Bus.
SAG Data Bus Grant. Indicates when the SAG has been granted
the local data bus.
604 Data Bus Busy. Indicates when the PPC604 is using the data
bus.
Indicates the channel for the next transaction
Indicates the amount of valid in the WriteO Physical Channel
Indicates to the ReadO Channel the data latched from the PPC
Data Bus is for it.
Indicates to the Readl Channel the data latched from the PPC
Data Bus is for it.
Indicates to the WriteO Channel that data has been read from its
register banks.
Table 16: DS/PPC Bus Inteface Signal Descriptions
7.2.2.3
DS/External FIFOInterface
The DS transmits and receives data to and from the RP through three physical channels. As
mentioned before, these physical channels consist of both DS internals and external FIFOs. The
three interfaces that allow the DS to control data flow to and from these FIFOs are detailed
below.
7.2.2.3.1
ReadO External FIFO Interface
The ReadO External FIFO Interface supports both stream and register interface transactions.
Status signals indicate the state of the FIFO and of the output mailbox register, while control
signals allow the DS to write the FIFO and input mailbox or read the output mailbox. The table
below provides signal descriptions while Table 18 gives the ReadO External FIFO Interface
function table.
Signal Name
RODATA<31:0>
I/O
I/O
/ROMAILFLAG
I
/ROAF
I
/ROFF
I
ROENA
O
Description
32-bit data path for the ReadO Physical Channel. Used for both stream
and register interface transactions.
ReadO Mailbox Flag Input. Indicates when data is available in the
Mailbox 2 register.
ReadO Almost Full Flag. Indicates that the ReadO Physical Channel
FIFO has between 1 and 8 available blocks.
ReadO Full Flag. Indicates when the ReadO Physical Channel FIFO is
completely full.
ReadO Enable. Indicates that the ReadO Physical Channel FIFO is
enabled.
ROW/R
O
ROMAILENA
O
ReadO Write/Read signal. Indicates a Write or Read or the ReadO
Physical Channel FIFO.
ReadO Mailbox Enable. Indicates that the ReadO Physical Channel
Mailbox Registers are enabled.
Table 17: DS ReadO Physical Channel Signal Descriptions
ROW/R
H
H
L
H
ROEna
H
H
H
L
ROMailEna
L
H
H
X
Function
FIFO Write
Maill Write
Mail2 Read
High Z
Table 18: DS ReadO Physical Channel Function Table
7.2.2.3.2
Readl External FIFO Interface
Unlike the ReadO interface, the Readl External FIFO Interface only supports stream-processing
transactions. As a result, the Readl interface is much less complex. The table below provides the
signal descriptions for the Readl External FIFO Interface.
Signal Name
R1DATA<31:0>
RIENA
1/O
O
O
Description
32-bit data path used for stream transactions.
Readl Enable. Indicates that the Readl Physical Channel FIFO is
enabled.
/RIAF
I
/R1FF
I
Readl Almost Full Flag. Indicates that the Readl Physical Channel
FIFO has between 1 and 8 available blocks.
Readl Full Flag. Indicates that the Readl Physical Channel FIFO is
completely full.
Table 19: DS Readl Physical Channel Signal Descriptions
7.2.2.3.3
WriteO External FIFO Interface
The WriteO External FIFO Interface allows the DS to read the stream processing results from RP
out of the FIFO in preparation for writing the data to the appropriate destination in memory. The
table below provides the signal descriptions for the WriteO External FIFO Interface.
Signal Name
WODATA<63:0>
WOENA
I/O
I
O
Description
64-bit data path used for stream transactions.
WriteO Enable. Indicates that the WriteO Physical Channel FIFO is
enabled.
/WOAE
I
/WOEF
I
WriteO Almost Empty Flag. Indicates that the WriteO Physical Channel
FIFO has between 1 and X blocks of valid data.
WriteO Empty Flag. Indicates that the Write0 Physical Channel FIFO
is completely empty.
Table 20: DS WriteO Physical Channel Signal Descriptions
7.2.3
7.2.3.1
Input Channels - ReadO and Read
Data Path
The main input data path for the Data Shuffler allows any four bytes of two consecutive 64-bit
words to be selected in any order. This type of flexibility is needed in order to implement the type
of patterns necessary to support stream-processing applications in the RP. Both physical channels
contain identical data paths. This is necessary in order to implement two independent physical
channels for the Data Shuffler. The data path itself contains four sets of 64-bit registers. Four sets
of registers were used because the control logic requires a four-clock cycle delay before it can
compute the select bits for the four large multiplexers correctly. Conveniently, this also supports
four-beat burst transactions into the Data Shuffler. Figure 16 below illustrates the main data path
for the two input physical channels.
I
0
T
.
8
8
8
8
8
0
8 64
8 2
8 2 28
8
T4
T
E
8
8
From
Reg3
Reg64
Reg2
MaRegi
Datapath
Figure 16: Read0/Read
multiplexers
are used
four8 large 128-to-8
Bus
can be seen above,
8 in order to
U perform the
8
8
8
1 6
1N 0-1
8
I6
8
7
RegO
/
RegI
8
No 6 -74-8
0 7 /
Reg2
IN,
op.-6
8
64
M
U
X
8
F
0
s
Reg3
Figure 16: ReadO/Readi Datapath
As can be seen above, four large 128-to-8 multiplexers are used in order to perform the
appropriate selections on the input data stream. Each of these multiplexers takes as input the
sixteen bytes from the last two register banks, Reg2 and Reg3. Each multiplexer then has the
ability to select any one of the sixteen bytes. Used together, they create a 4-byte or 32-bit word
that is passed onto to the external FIFO interface.
Each of these large 128-to-1 multiplexers is actually made up of eight smaller 16-to-i
multiplexers. Each of these 16-to-i multiplexers selects one bit from the sixteen corresponding
bits (depending on which bit lane it occupies). Since these eight 16-to-1 multiplexers all use the
same select bits, used together, they select the one of the corresponding 16 bytes from the Reg2
and Reg3 64-bit register banks. Figure 17 shows how the eight smaller 16-to-1 multiplexers are
wired together to implement the larger 128-to-8 multiplexer as well as the details of each 16-to-i
multiplexer at a bit level.
16MUXN-O
MuxN
select<3..0>
16MUXN 1
4
Reg2_
Reg2_n+8
Reg2_n+16
>,/
16
From
To
External
MU XN - 3
REG2
and
1
16
1
6 MUXN-4 1
16 MUXN_5
N-
16
MUXN_7
1
FIFOs
Reg2_n+24
Reg2_n+32
Reg2_n+40
Reg2_n+48
Reg2_n+56
Regn+8
Reg3_n+8
Reg3_n+ 16
Reg3_n+24
Reg3_n+32
Reg3_n+4)
Reg3_n+48
NUXNn
bit n
16-to-1
MUX
Reg3_n+56
Figure 17: Read0/Readl Multiplexer Configuration and 16-to-1 Multiplexer
7.2.3.2
ControlLogic
The control logic for the Read datapaths can be divided into register and multiplexer control.
Figure 18 below gives a simple block diagram that illustrates the different parts of the Read
channel control logic.
From
SAG/DS
1
Address
'
Interface
1
From
PIc Bus
Interface
From
Internal
Registers
oSMs
6
-CGenerator
FSoi
1
/Register
RAM
7o
1
Control
Logic
Assertion
Control
Logic
1
4
To Datapath
Registers
^20
To Datapath
Multiplexers
Figure 18: Read Control Logic
7.2.3.2.1
Register Control Logic
The Register Control Logic is responsible for maintaining the state of the four 64-bit register
banks for the given physical channel. It generates the register enables for the four banks and
maintains the corresponding valid bits.
The Register Control Logic takes in three input signals: /RESET, ALL_CLEAR, and
LATCH_DATA. When /RESET is asserted, all of the register banks are invalidated. The
ALL_CLEAR signal comes from the multiplexer control logic and indicates if valid data in Reg3
needs to be held for an extra cycle. If so, then new data cannot be clocked into Reg3, but all other
banks can still be clocked if their contents are invalid. In this way, bubbles in the datapath can be
eliminated making the pipeline more efficient. Finally, the LATCH_DATA signal indicates that
there is new data to be latched. Table 21 below summarizes the Register Control Logic signals.
Signal Name
/RESET
ALL CLEAR
LATCH DATA
REGENA<3:0>
VALID<3:0>
I/O
I
I
I
O
O
Description
Indicates that all data is to be invalidated
Indicates that data in Reg3 does not need to be held for an extra cycle
Indicates that there is data for the corresponding channel
Register enables for the four 64-bit register banks
Indicates which register banks contain valid data
Table 21: Register Control Logic Signal Descriptions
7.2.3.2.2
Multiplexer Control Logic
The Multiplexer Control Logic lies at the heart of the Data Shuffler. It is the controlling of how
and when data is selected that implements the different modes, patterns, and offsets used to
manipulate data streams for the RP.
One way of designing this control logic would be to create a FSM that decodes for all the
different modes, patterns, and offsets and asserts the appropriate control signals. However, one
might imagine that this type of FSM would become extremely large and complex very quickly.
As more and more patterns are implemented, this FSM would require more and more states. As
the number of state bits required to implement such a large FSM increased, the number of LEs
needed to compute the next state and the output signals would also increase. The performance of
this FSM would diminish as computations for the next state and output signals became more
complex since the combinational path between these registers increase. Therefore, such a design
would increase in complexity, have poor performance, and have poor scalability.
One alternative would be to take advantage of the embedded RAM blocks in the FLEX10K
device. As illustrated in Figure 18, the design consists of three parts: address generator FSMs,
RAM, and some assertion logic.
In this type of design, the RAM contains the values of the control signals for each of the different
patterns. In addition, these values can be loaded to support different patterns depending on the
application. The top three bits of the PATTERN register determine where the pattern to be used is
stored in RAM. Since the RAM consists of 256 locations, this segments the memory into eight
32-location blocks. The lower five bits determine how a particular block is to be accessed. In this
way, flexibility is gained by decoupling how RAM is accessed and where the values are actually
stored. This gives more freedom to the low-level software in determining how and which pattern
values will be loaded.
In order to access the RAM correctly, Address Generator FSMs are required. These FSMs decode
the mode, pattern, and offset information, monitor the state of the datapath registers, and generate
the appropriate address and control signals. These Address Generator FSMs are less complex than
the FSMs mentioned previously, making them easier to design and implement. In addition, some
patterns that have different values for control signals stored in RAM have the same RAM access
pattern. This property allows one FSM to decode for multiple patterns, allowing this design to be
more scalable.
Finally, some assertion control logic is required. This logic selects between the outputs of RAM
depending on what mode the Read physical channels are in. In addition, it controls when the
enable signals are actually sent to the multiplexers as deemed by the Address Generator FSMs.
By partitioning the design into these three sub-blocks, the multiplexer control logic gains
flexibility and scalability over a more traditional design approach. The table below gives a signal
description for the Address Generator FSM.
Signal Name
/RESET
MODE
PATTERN<7:0>
OFFSET<2:0>
VALID<3:0>
LATCH_DATA
I/O
I
I
I
I
I
I
DAV
O
MUXENA<3:0>
MUXSELO<3:0>
MUXSEL1<3:0>
MUXSEL2<3:0>
MUXSEL3<3:0>
ALL CLEAR
O
O
O
O
O
O
Signal Description
Indicates that all control logic/FSMs are to be reset
Indicates the current mode: 1=64-bit mode, 0=32-bit mode
Indicates the current pattern, 0-13
Indicates the current offset, 0-7
Indicates the validity of the data in each register bank
Indicates that data will be latched by the Register Control Logic on the
next clock cycle
Indicates to the FIFO Interface that there will be data available two
clock cycles later
Multiplexer enables for Mux3-0
MuxO select bits
Muxl select bits
Mux2 select bits
Mux3 select bits
Indicates that data in Reg3 does not need to be held for an extra cycle
Table 22: Multiplexer Control Logic Signal Descriptions
7.2.4
Output channel - Write0
Since the Write0 Physical Channel does not have to support any type of data manipulation, it is
much less complex than its Read channel counterparts. It supports four 64-bit register banks for
burst transactions on the PPC bus. In essence, the Write0 Physical Channel has the same register
banks and control logic as the Read channels. The table below summarizes the relevant signals
for the WriteO channel.
Signal Name
WODATA IN<63:0>
ALL_CLEAR
I/O
I
I
DAV
WODATA OUT<63:0>
VALID<3:0>
I
O
O
Description
64-bit data bus from the WriteO FIFO
Indicates that the PPC Bus Interface has latched the data in the
Reg3 register bank
Indicates that there is data in the WriteO FIFO to be read
64-bit data bus to the PPC Bus Interface
Indicates the state of the register banks
Table 23: WriteO Physical Channel Signal Descriptions
7.3
Implementation
7.3.1
Functionality and Performance
The first phase of implementation for the Data Shuffler has been completed. The following
functionality has been implemented and optimized to run at 66.66 MHz with the appropriate pin
constraints:
1) ReadO and Readl datapaths (register banks and multiplexers)
2) ReadO and Readl register control logic
3) ReadO and Readl multiplexer control logic (for pattern only)
4) DS/SAG Interface
In addition, simplified versions of the ReadO and Readl FIFO interface and Register Interface
have been implemented. The sections that follow detail some of the issues encountered during
implementation and optimization.
7.3.2
7.3.2.1
Implementation Details
DS/SAG Interface
To understand why the DS/SAG Interface is implemented the way it is, a brief summary of DS
evolution is in order. During schematic entry, early estimates targeted the DS for a FLEX10K30
device. It was recognized at that time that since a 10K30 and 10K50 device both had the same
package and pin out, upgrading later to a larger device, if necessary would not be difficult.
Therefore, the schematics were created based on a 10K30 and the board went to layout.
As DS implementation progressed, it was realized that due to the use of EABs and the systemtiming constraint, more LE and EAB resources were necessary. Therefore, during the assembly
process a 10K50 was used to house the DS. By upgrading to a 10K50 LE (from 1728 to 2880),
EAB (from 6 to 10), and user 1/O (from 246 to 274) resources increased. However, the increase in
user 1/O could not be taken advantage of due to the fact that pin constraints were made during
schematic entry based on a 10K30, not a 10K50. Since layout had already been completed by the
time the decision to upgrade devices had been made, the board was already routed and the extra
user I/Os could not be used.
Therefore, given the number of signals that must be routed in and out of the Data Shuffler, the
FLEX10K30 device was extremely pin constrained. The three 64-bit input, Read, and Write0 data
busses, in addition to the corresponding control signals accounted for a large part of the 246 user
I/Os available on a 10K30 device. After accounting for all other signals as well (Register
Interface, clock signals, global reset, etc), only nine pins were available to implement the
DS/SAG Interface. During schematic entry, this was deemed enough to implement the desired
functionality. However, as both the DS and SAG designs matured, it became apparent that more
signals were required.
Recognizing that Register Interface and stream-processing transactions are not allowed to occur
on the same clock cycle (register transactions take precedence), the Register Interface address bus
provides fourteen more signals between the DS and SAG. Signals REGADDR<1:0> are not used
because they are shared by the hexadecimal displays on the Chidi board. The write cycle for these
displays is extremely long. Using these signals would force DS/SAG transactions to wait until
writes to the displays had completed ("normal" Register Interface transactions complete in just a
few clock cycles), a unnecessary delay.
7.3.2.2
Read Address GeneratorFSMs
In order to achieve the type of performance necessary, the idea of using on large FSM for address
generation had to be abandoned. Instead, the design must be partitioned into smaller pieces that
implements the same functionality in order to meet the system timing requirement. Figure 19
below gives a block diagram of the design used for the address generation FSMs.
From Register Ctrl Logic
From SAG/DS IF Registers
From Internal Registers
and Assertion Logic
000
000
0.0
Outputs to Register
CtrI Logic and
Assertion Logic
Addr Values
to RAM
Figure 19: Read Address Generator FSMs
As can be seen above, a series of small FSMs that decode for a particular mode and pattern are
used, making them mutually exclusive. Although all the FSMs generate the same output signals,
only one FSM drives these signals at any one time.
There are two main advantages of using such a design. First, it allows functionality to be added in
a conceptually easy manner. Since the modes and patterns are mutually exclusive, adding new
access patterns does not affect the functionality of those that have already been implemented.
Second, achieving the desired performance is also easier. Since each FSM is only decoding for
one mode and one pattern, the FSMs are compact in size, making them much easier to optimize.
Both of these advantages speed up the design, implementation and debugging processes for the
Address Generator FSMs.
7.3.3
Optimization
7.3.3.1
16-tol Multiplexer
One of the time-constrained paths in the Data Shuffler design involved the 16-to-i multiplexers
found in the ReadO and Readl physical channels. On its own, a 16-to-1 multiplexer can run at up
to 81.96 MHz (12.2 ns clock period) and occupies only 10 LEs. In this case, the inputs, outputs,
select bits, and the logic for the multiplexer all reside on the same row. Therefore, the row delays
for LEs are small, allowing the combinational path through the multiplexer to be fast. When the
LEs of a 16-tol multiplexer are not placed on the same row, the performance drops below the
system-imposed timing constraint of 66.66 MHz (15.0 ns).
With this in mind, when the entire datapath is considered, the multiplexer actually can no longer
be clocked at such a high rate. There are two principle constraints that come into play. First, as
described in section 7.2.3.1, each register in the Reg2 and Reg3 blocks actually fans out to four
different multiplexers, muxN_c, where 0<=N<=3, 0<=c<=7, and c is constant. This increased fan
out contributes to a slower path through the multiplexers. However, more important is the fact
that this interdependency on input registers among multiplexers also complicates the layout
process. In order for all the components of a multiplexer to be placed on the same row, sixteen
input registers, four output registers, sixteen select bits (four for each multiplexer), and forty logic
elements (to implement the multiplexers) now must all be placed on the same row. Based on this
interdependency a total of 76 logic elements (out of a possible 288 per row) must be placed on the
same row.
The second constraint makes the number of LEs that must be placed on the same row impossibly
high. The second constraint is due to the fact that the select bits are also shared among
multiplexers. However, unlike the input registers which are shared between muxN_c
multiplexers, the select bits are shared among muxC_n multiplexers, where 0<=C<=3, 0<=n<=7,
and C is constant. This constraint requires that the eight multiplexers that form each resulting
output byte from the ReadO or Readl datapath must all be placed in the same row. These two
orthogonal constraints require all 16-to-1 multiplexers, the select bits, and the input and output
registers to all be placed on the same row, a physical impossibility, in order to obtain performance
at or greater than 66.66 MHz (15.0ns).
In order to meet this the system-imposed timing constraint, the 16-to-1 multiplexers had to be
pipelined, as seen in the figure below.
seki<l:O>
sel<3:2>
data._in<7:0>
data out
data_in<15:8>
Figure 20: Optimized 16-to-1 Multiplexer
By inserting pipeline registers so that the 16-to 1 multiplexer is divided into two stages of multiple
4-tol registers, the combinational path is reduced by a factor two. As a result, the pipelined
version of the 16-tol multiplexer can be clocked at 125 MHz (8.0 ns), the fastest frequency at
which a -3 speed grade FLEX10K device can be clocked. In addition, the optimized version can
be placed across different rows, which greatly eases placement constraints while allowing the
66.66 MHz timing constraint to be met when the entire datapath is placed and routed.
Of course, nothing is gained without a price. As with many digital design problems, optimizing
the 16-tol multiplexer was a speed vs. area trade-off. The optimized multiplexer is implemented
with 18 LEs (versus 10 LEs for the unoptimized version). In addition, the data incurs an addition
cycle of latency. However, given that the 15.0 ns clock period is a hard constraint, these costs are
necessary and therefore justified.
7.3.3.2
Register and Logic Replication
In order to meet the system timing constraint of 66.66 MHz, replication of both registers and
logic were required in some cases. There are two main reasons for replication, routing delays and
fan out considerations.
First, routing delays can make up the bulk of any combinational path, while setup, hold and
propagation delay times for LEs are small in comparison. As mentioned previously, same-row
and same-column delays on a FLEX10K device are relatively small, while diagonal routing is
expensive. However, as more and more chip resources are used, it becomes increasingly difficult,
if not impossible, to route all the LEs associated with a particular function on the same row. One
reason for this is because many times multiple modules fan in on shared input registers. In cases
such as these, it becomes necessary to replicate the input register so that different versions of the
same signal may be placed on separate rows.
Second, fan out can also slow down time-constrained paths. In the FLEX 10K data sheet [10],
worst case timing values are given for routing between the different types of resources found on
the device. However, only in a separate application note, "Understanding FLEX10K Timing"
[21], are these numbers qualified with the statement that they are only valid for resources with a
fan of four loads. Therefore, for high fan out paths such as those found in the Register Control
and Address Generation modules, logic and registers were replicated in order to decrease fan out
and increase performance.
7.3.3.3
EAB Pipelining
In order to meet the system timing constraint, the RAM used to store the assertion values for the
multiplexer and control logic has to be fully synchronous. This means that both the inputs and
outputs of the RAM have to be registered. This is due to the fact that both tEABAA (EAB Address
Access Delay) and tEABRCCOM (EAB Asynchronous Read Cycle) are rated at 13.7 ns. Adding even
a tSAMEROW delay of 3.3 ns for either inputs or outputs already violates the 15.0 ns timing
constraint. Therefore, it is necessary to utilize the input and output registers located on the EAB
unit to meet the system timing requirement. One result of this pipelining is that the results of a
RAM read are not available until two clock cycles later.
7.3.4
Device Configuration
The Data Shuffler is configured serially using a configuration EPROM from Altera Corporation.
Below is the circuit diagram showing how the EPROM is wired to the FLEX 10K device.
Vcc
Vcc
VcC
FLEX 10K Device
GND
GND
Figure 21: Configuration EPROM Scheme Circuit Diagram [6]
Upon power-up, the Data Shuffler senses the low-to-high transition on the nCONFIG signal,
which initiates the configuration process. Then, the DS drives CONF_DONE low. Next,
nSTATUS is released by the FLEX 10K device, which is then pulled high to enable the
configuration EPROM. The configuration EPROM then uses its internal oscillator to serially
clock data into the DS. To summarize the configuration process, the timing diagram and a
corresponding timing parameters table are given below.
nCONFIG
OE/nSTAITUS
I
000
L__
I
t
I
OE
I
t c"
CH
I
t
CONF_-ONE
00
I
OEZX
of
r"-
CL
,CS1
IL-
I
I I
CC
It I
"
DH
Figure 22: Configuration EPROM Scheme Timing Waveform [6]
Symbol
Parameter
Min
Max
Units
160
ns
toEZX
OE high to DATA output enabled
tCH
DCLK high time
50
250
ns
tcL
tDSU
50
30
0
250
tco
DCLK low time
Data setup time before rising edge on DCLK
Data hold time after rising edge on DCLK
DCLK to DATA out
ns
ns
ns
ns
tOEW
OE low pulse width to guarantee counter reset
tCSH
NCS low hold time after DCLK rising edge
DCLK frequency
tDH
fMAX
30
ns
100
0
2
Table 24: Passive Serial Configuration Scheme Timing Parameters [6]
10
ns
MHz
8 Future Work
8.1 DS Development
Although development for the DS is well underway and a solid foundation has been established,
there are still many portions of the design that need to be implemented. These additions can be
categorized into three main areas.
The first area of continued development is increasing the number of patterns the DS can support.
Currently the RAM values for all thirteen patterns have been generated and can be referenced in
Appendix 10.2. However, the only the Address Generator FSM for Patternl has been
implemented. Functionally, adding these patterns is not an extremely difficult task. However,
managing area and speed considerations will be somewhat challenging.
The second area of continued development is the implementation of the various interfaces. Most
of the development that has been completed has mainly involved the Data Shuffler internals. The
internals were designed and implemented first because these lied at the core of the DS. However,
the interfaces are also important in that the DS needs to be able to interact with the other modules
on the Chidi board correctly.
The final area of future work involves debugging the DS on the Chidi board. Debugging for the
physical channel FIFOs has already commenced and should be followed with the inclusion of
Data Shuffler internals.
8.2 RP Development
Since RP Configuration has been completed, RP development can commence. This includes
implementing the interfaces for the SAG, SRAM, and High-Speed I/O Port.
RP Configuration itself can be improved from a performance standpoint. Currently, configuration
is done with a clock with a 120 ns period. Configuration should be tried with a clock period
closer to the theoretical limit of 100 ns. In this way, the overhead for configuring the RP can be
cut by up to almost 24 ms, or over 1.5 million system clock cycles. By reducing the configuration
overhead, applications targeted for the RP will more likely achieve a substantial performance
increase over general computing solutions.
8.3 Application Development
As the underlying interfaces for the RP are designed, implemented, and debugged, RP application
development can commence. RP applications are the means to discover how feasible of an
alternative Reconfigurable Computing is for general purpose computing systems.
9 Works Cited
[1] Acosta, Edward K., V. Michael Bove, Jr., John A. Watlington, and Ross A. Yu,
"Reconfigurable Processor for a Data-Flow Video Processing System," Proc. SPIE FPGAs for
FastBoard Development and Reconfigurable Computing, 2607, October 1995, pp. 83-91.
[2] Bove, V. Michael, Jr. and John A. Watlington, "Cheops: A Reconfigurable Data-Flow System
for Video Processing," IEEE Transactions on Circuits and Systems for Video Technology, 5,
April 1995, pp. 140-149.
[3] "Chidi: The Flexible Media Processor," MIT Media Lab, Information and Entertainment
Group, http://chidi.www.media.mit.edu/projects/chidi/index.html, 1997.
[4] "Chidi 1394 Interface," Yuan-Min Liu, MIT Media Lab, Information and Entertainment
Group, http://chidi.www.media.mit.edu/projects/chidi/ 394/1394.html, 1998.
[5] "CMOS SyncFIFO 64 X 36: IDT723611," Integrated Device Technology, Inc., 1997.
[6] "Configuring FLEX 10K Devices," Altera Corporation, Application Note 59, Ver.1,
December 1995.
[7] Dally, William J., "Virtual Channel Flow Control," IEEE Transactions on Parallel and
Distributed Systems, 1992, pp. 194-205.
[8] "DS90C363/DS90CF364: +3.3V Programmable LVDS Transmitter/Receiver 18-Bit Flat
Panel Display (FPD) Link," National Semiconductor, July 1997.
[9] "FLEX 10K Device Family," Altera Corporation,
http://www.altera.com/html/products/fl0k.html, 1998.
[10] "FLEX 10K: Embedded Programmable Logic Family," Altera Data Book 1996, Altera
Corporation, Version 2, June 1996.
[11] Hanser, John R. and John Wawrzynek, "Garp: A MIPs Processor with a Reconfigurable
Coprocessor," Proc. Symposium on Field-ProgrammableCustom Computing Machines (FCCM),
April 16-18, 1997, Napa Valley, CA.
[12] Huq, Sued B., "An Overview of LVDS Technology, National Semiconductor, Application
Note 971, November 1994.
[13] Lewis, D., D. Galloway, M. van Ierssel, J. Rose, and P. Chow, "The Transmogrifier-2: A 1
Million Gate Rapid Prototyping System," in FPGA '97, ACM Symp. On FPGAs, Feb. 1997,
pp.53-61.
[14] "LVDS Owner's Manual and Design Guide," National Semiconductor, Spring 1997.
[15] "MPC106 PCI Bridge/Memory Controller Technical Summary," Motorola, Rev. 1, August
1996.
[16] "PowerPC Microprocessor Family: The Bus Interface for 32-bit Microprocessors," IBM and
Motorola, Rev. 0, March 1997.
[17] "PowerPC 604e RISC Microprocessor Family: PID9q-604e Hardware Specifications," IBM
Microelectronics and Motorola, August 1997.
[18] "PowerPC 604 RISC Microprocessor Technical Summary," IBM Microelectronics and
Motorola, Rev. 1, May 1994.
[19] Singh, Satnam and Pierre Bellec, "Virtual Hardware for Graphics Applications Using
FPGAs," The University of Glasgow.
[20] Trainor, D.W., J.P. Heron, and R.F. woods, "Implementation of the 2D DCT Using a Xilinx
XC6264 FPGA," The Queen's University of Belfast.
[21] "Understanding FLEX10K Timing," Altera Corporation, Application Note 91, Ver. 1,
January 1998.
[22] Watlington, John A. and V. Michael Bove, Jr., "Stream-Based Computing and Future
Television," Proc. 137 h SMPTE Technical Conference, September 1995, pp. 69-79.
[23] Yu, Ross A., "A Field Programmable Gate Array Based Stream Processor for the Cheops
Imaging System", Master's Thesis, Massachusetts Institute of Technology, 1996.
10 Appendix
10.1 Acronyms
ASIC - Application Specific Integrated Circuit
BGA - Ball Grid Array
CHRP - Common Hardware Reference Platform
CPLD - Complex Programmable Logic Devices
DRAM - Dynamic Random Access Memory
DS - Data Shuffler
DSP - Digital Signal Processing
EAB - Embedded Array Block
EDA - Engineering Design Automation
EDO - Extended Data Out
EMI - Electromagnetic Interference
EPROM - Electrically Programmable Read Only Memory
FPGA - Field Programmable Gate Array
GPP - General Purpose Processor
HDL - Hardware Description Language
LAB - Logic Array Block
LE - Logic Element
LUT - Look Up Table
LVDS - Low Voltage Differential Signaling
PAL - Programmable Array Logic
PGA - Pin Grid Array
PLD - Programmable Logic Devices
RAM - Random Access Memory
RC - Reconfigurable Computing
ROM - Read Only Memory
RP - Reconfigurable Processor
SAG - Stream Address Generator
SRAM - Static Random Access Memory
VHDL - VHSIC Hardware Description Language
VHSIC - Very High Speed Integrated Circuit
10.2 Data Shuffler Patterns
Given in this section is a listing of the values to be loaded into the pattern map for each type of
data manipulation. Each pattern occupies 32 locations in the pattern map. Values for both the
ReadO and Readl Physical Channels are given.
Addr
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
Dav
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Valid
3only
1
1
1
0
1
0
1
0
1
0
0
0
0
0
0
0
Muxena
(3:0)
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
Both modes
32-bit mode
64-bit mode
Dav
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Valid
3only
Muxena
(3:0)
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
Mux
selO
Mux
sell
Mux
sel2
Mux
sel3
(3:0)
(3:0)
(3:0)
(3:0)
0000
0100
0001
0101
0010
0110
0011
0111
0100
1000
0101
1001
0110
1010
0111
1011
0001
0101
0010
0110
0011
0111
0100
1000
0101
1001
0110
1010
0111
1011
1000
1100
0010
0110
0011
0111
0100
1000
0101
1001
0110
1010
0111
1011
1000
1100
1001
1101
0011
0111
0100
1000
0101
1001
0110
1010
0111
1011
1000
1100
1001
1101
1010
1110
Table 25: ReadO LUT for Patternl (Straight Through), offsets 0-7
Values for the Readl LUT for Patternl is exactly the same as that for ReadO. In addition, it
requires only 16 locations to implement the Straight Through pattern. The other 16 locations
(10000-11111) contain all zeros.
The Decimate Bytes by 2 pattern, given in the next two tables, also only requires 16 locations to
implement. Similarly, the other 16 locations (10000-11111) contain all zeros.
64-bit mode
32-bit mode
Both modes
Addr
Dav
Valid3
only
Muxena
(3:0)
Dav
Valid
3only
Muxena
(3:0)
Mux
selO
Mux
sell
(3:0)
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
0000
0000
0001
0001
0010
0010
0011
0011
0100
0100
0101
0101
0110
0110
0111
0111
Mux
sel2
Mux
sel3
(3:0)
(3:0)
(3:0)
0010
0010
0011
0011
0100
0100
0101
0101
0110
0110
0111
0111
1000
1000
1001
1001
0100
0100
0101
0101
0110
0110
0111
0111
1000
1000
1001
1001
1010
1010
1011
1011
0110
0110
0111
0111
1000
0001
1001
1001
1010
1010
1011
1011
1100
1100
1101
1101
Table 26: ReadO LUT for Pattern2 (Decimate bytes by 2), offsets 0-7
64-bit mode
32-bit mode
Addr
Dav
Valid
3only
Muxena
(3:0)
Dav
Valid
3 only
Muxena
(3:0)
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
Both modes
Mux
selO
(3:0)
0000
0000
0001
0001
0010
0010
0011
0011
0100
0100
0101
0101
0110
0110
0111
0111
Mux
sell
(3:0)
0010
0010
0011
0011
0100
0100
0101
0101
0110
0110
0111
0111
1000
1000
1001
1001
Mux
sel2
(3:0)
0100
0100
0101
0101
0110
0110
0111
0111
1000
1000
1001
1001
1010
1010
1011
1011
Table 27: Readl LUT for Pattern2 (Decimate bytes by 2), offsets 0-7
Mux
sel3
(3:0)
0110
0110
0111
0111
1000
0001
1001
1001
1010
1010
1011
1011
1100
1100
1101
1101
Addr
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
10000
10001
10010
10011
10100
10101
10110
10111
Dav
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
Valid
3only
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Muxena
(3:0)
1111
0000
0000
1111
0000
0000
1111
0000
0000
1111
0000
0000
1111
0000
0000
1111
0000
0000
1111
0000
0000
1110
0001
0000
Both modes
32-bit mode
64-bit mode
Dav
Valid
3only
Muxena
(3:0)
1
1
0
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
1
1
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1111
1111
0000
1111
1111
0000
1111
1111
0000
1111
1110
0001
1111
1111
0000
1111
1111
0000
1111
1111
0000
1110
0001
1111
Mux
sell
Mux
sel2
Mux
sel3
(3:0)
(3:0)
(3:0)
(3:0)
0000
0100
xxxx
0001
0101
xxxx
0010
0110
xxxx
0011
0111
xxxx
0100
0000
xxxx
0101
0001
xxxx
0110
0010
xxxx
0111
xxxx
0011
0011
0111
xxxx
0100
1000
xxxx
0101
1001
xxxx
0110
1010
xxxx
0111
0010
xxxx
1000
0100
xxxx
1001
0101
xxxx
1010
xxxx
0110
0110
1010
xxxx
0111
1011
xxxx
1000
1100
xxxx
1001
1101
xxxx
1010
0110
xxxx
1011
0111
xxxx
1100
1000
xxxx
1101
xxxx
1001
1001
1101
xxxx
1010
1110
xxxx
1011
1111
xxxx
1100
xxxx
1000
1101
1001
xxxx
1110
1010
xxxx
1111
1011
xxxx
xxxx
1000
1100
Mux
selO
Table 28: ReadO LUT for Pattern3 (Decimate bytes by 3/Extract one channel), offsets 0-7
Pattern 3 only requires 24 locations. The other 12 locations (11000-11111) are never accessed.
This applies to both the ReadO and Read 1 LUTs for the Decimate bytes by 3/Extract one channel
pattern.
64-bit mode
32-bit mode
Addr
Dav
Valid
3only
Muxena
(3:0)
Dav
Valid
3only
Muxena
(3:0)
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
10000
10001
10010
10011
10100
10101
10110
10111
0
1
0
0
1
0
0
1
0
0
0
1
0
1
0
0
1
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0000
1111
0000
0000
1111
0000
0000
1111
0000
0000
1110
0001
0000
1111
0000
0000
1111
0000
0000
1111
0000
0000
0000
1111
1
1
0
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
1
1
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1111
1111
0000
1111
1111
0000
1111
1111
0000
1111
1110
0001
1111
1111
0000
1111
1111
0000
1111
1111
0000
1110
0001
1111
Both modes
Mux
selO
Mux
sell
Mux
sel2
Mux
sel3
(3:0)
(3:0)
(3:0)
(3:0)
0000
0100
xxxx
0001
0101
xxxx
0010
0110
xxxx
0011
0111
xxxx
0100
0000
xxxx
0101
0001
xxxx
0110
0010
xxxx
0111
xxxx
0011
0011
0111
xxxx
0100
1000
xxxx
0101
1001
xxxx
0110
1010
xxxx
0111
0010
xxxx
1000
0100
xxxx
1001
0101
xxxx
1010
xxxx
0110
0110
1010
xxxx
0111
1011
xxxx
1000
1100
xxxx
1001
1101
xxxx
1010
0110
xxxx
1011
0111
xxxx
1100
1000
xxxx
1101
xxxx
1001
1001
1101
xxxx
1010
1110
xxxx
1011
1111
xxxx
1100
xxxx
1000
1101
1001
xxxx
1110
1010
xxxx
1111
1011
xxxx
xxxx
1000
1100
Table 29: Readl LUT for Pattern3 (Decimate bytes by 3/Extract one channel), offsets 0-7
Addr
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
10000
10001
10010
10011
10100
10101
10110
10111
11000
11001
11010
11011
11100
11101
11110
11111
Dav
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
Both modes
32-bit mode
64-bit mode
Valid
3only
Muxena
(3:0)
Dav
Valid
3only
Muxena
(3:0)
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1100
0011
0000
0000
1100
0011
0000
0000
1100
0011
0000
0000
1100
0011
0000
0000
1100
0011
0000
0000
1100
0011
0000
0000
1100
0011
0000
0000
1100
0011
0000
0000
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
Mux
selO
Mux
sell
Mux
sel2
Mux
sel3
(3:0)
(3:0)
(3:0)
(3:0)
0000
xxxx
0000
xxxx
0001
xxxx
0001
xxxx
0010
xxxx
0010
xxxx
0011
xxxx
0011
xxxx
0100
xxxx
0100
xxxx
0101
xxxx
0101
xxxx
0110
xxxx
0110
xxxx
0111
xxxx
0111
xxxx
0100
xxxx
0100
xxxx
0101
xxxx
0101
xxxx
0110
xxxx
0110
xxxx
0111
xxxx
0111
xxxx
1000
xxxx
1000
xxxx
1001
xxxx
1001
xxxx
1010
xxxx
1010
xxxx
1011
xxxx
1011
xxxx
xxxx
0000
xxxx
0000
xxxx
0001
xxxx
0001
xxxx
0010
xxxx
0010
xxxx
0011
xxxx
0011
xxxx
0100
xxxx
0100
xxxx
0101
xxxx
0101
xxxx
0110
xxxx
0110
xxxx
0111
xxxx
0111
xxxx
0100
xxxx
0100
xxxx
0101
xxxx
0101
xxxx
0110
xxxx
0110
xxxx
0111
xxxx
0111
xxxx
1000
xxxx
1000
xxxx
1001
xxxx
1001
xxxx
1010
xxxx
1010
xxxx
1011
xxxx
1011
Table 30: ReadO LUT for Pattern4 (Decimate bytes by 4), offsets 0-7
64-bit mode
32-bit mode
Addr
Dav
Valid
3 only
Muxena
(3:0)
Dav
Valid
3only
Muxena
(3:0)
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
10000
10001
10010
10011
10100
10101
10110
10111
11000
11001
11010
11011
11100
11101
11110
11111
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0000
0000
1100
0011
0000
0000
1100
0011
0000
0000
1100
0011
0000
0000
1100
0011
0000
0000
1100
0011
0000
0000
1100
0011
0000
0000
1100
0011
0000
0000
1100
0011
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
Both modes
Mux
SelO
Mux
sell
Mux
sel2
(3:0)
(3:0)
(3:0)
(3:0)
0000
xxxx
0000
xxxx
0001
xxxx
0001
xxxx
0010
xxxx
0010
xxxx
0011
xxxx
0011
xxxx
0100
xxxx
0100
xxxx
0101
xxxx
0101
xxxx
0110
xxxx
0110
xxxx
0111
xxxx
0111
xxxx
0100
xxxx
0100
xxxx
0101
xxxx
0101
xxxx
0110
xxxx
0110
xxxx
0111
xxxx
0111
xxxx
1000
xxxx
1000
xxxx
1001
xxxx
1001
xxxx
1010
xxxx
1010
xxxx
1011
xxxx
1011
xxxx
xxxx
0000
xxxx
0000
xxxx
0001
xxxx
0001
xxxx
0010
xxxx
0010
xxxx
0011
xxxx
0011
xxxx
0100
xxxx
0100
xxxx
0101
xxxx
0101
xxxx
0110
xxxx
0110
xxxx
0111
xxxx
0111
xxxx
0100
xxxx
0100
xxxx
0101
xxxx
0101
xxxx
0110
xxxx
0110
xxxx
0111
xxxx
0111
xxxx
1000
xxxx
1000
xxxx
1001
xxxx
1001
xxxx
1010
xxxx
1010
xxxx
1011
xxxx
1011
Table 31: Readl LUT for Pattern4 (Decimate bytes by 4), offsets 0-7
Mux
sel3
Addr
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
10000
10001
10010
10011
10100
10101
10110
10111
Dav
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
Valid
3only
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Muxena
(3:0)
1100
0010
0001
0000
0000
0000
1100
0010
0001
0000
0000
0000
1000
0110
0001
0000
0000
0000
1000
0110
0001
0000
0000
0000
Both modes
32-bit mode
64-bit mode
Dav
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
Valid
3only
Muxena
(3:0)
Mux
selO
Mux
sell
Mux
sel2
Mux
sel3
(3:0)
(3:0)
(3:0)
(3:0)
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1100
0010
0001
1100
0010
0001
1100
0010
0001
1100
0010
0001
1000
0110
0001
1000
0110
0001
1000
0110
0001
1000
0110
0001
0000
xxxx
xxxx
0000
xxxx
xxxx
0001
xxxx
xxxx
0001
xxxx
xxxx
0010
xxxx
xxxx
0010
xxxx
xxxx
0011
xxxx
xxxx
0011
xxxx
xxxx
0110
xxxx
xxxx
0110
xxxx
xxxx
0111
xxxx
xxxx
0111
xxxx
xxxx
xxxx
0000
xxxx
xxxx
0000
xxxx
xxxx
0001
xxxx
xxxx
0001
xxxx
xxxx
0100
xxxx
xxxx
0100
xxxx
xxxx
0101
xxxx
xxxx
0101
xxxx
xxxx
0110
xxxx
xxxx
0110
xxxx
xxxx
0111
xxxx
xxxx
0111
xxxx
xxxx
xxxx
0010
xxxx
xxxx
0010
xxxx
xxxx
0011
xxxx
xxxx
0011
xxxx
xxxx
0100
xxxx
xxxx
0100
xxxx
xxxx
0101
xxxx
xxxx
0101
Table 32: ReadO LUT for Pattern5 (Decimate bytes by 6/channels by 2), offset 0-3
64-bit mode
32-bit mode
Addr
Dav
Valid
3only
Muxena
(3:0)
Dav
Valid
3only
Muxena
(3:0)
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
10000
10001
10010
10011
10100
10101
10110
10111
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
0
1
1
1
1000
0100
0011
0000
0000
0000
1000
0100
0011
0000
0000
0000
1000
0100
0011
0000
0000
0000
1000
0100
0011
0000
0000
0000
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
0
1
1
0
1
1
0
1000
0100
0011
1000
0100
0011
1000
0100
0011
1000
0100
0011
1000
0100
0011
1000
0100
0011
1000
0100
0011
1000
0100
0011
Both modes
Mux
selO
Mux
sell
Mux
sel2
Mux
sel3
(3:0)
(3:0)
(3:0)
(3:0)
0100
xxxx
xxxx
0100
xxxx
xxxx
0101
xxxx
xxxx
0101
xxxx
xxxx
0110
xxxx
xxxx
0110
xxxx
xxxx
0111
xxxx
xxxx
0111
xxxx
xxxx
xxxx
0010
xxxx
xxxx
0010
xxxx
xxxx
0011
xxxx
xxxx
0011
xxxx
xxxx
0100
xxxx
xxxx
0100
xxxx
xxxx
0101
xxxx
xxxx
0101
xxxx
xxxx
xxxx
0000
xxxx
xxxx
0000
xxxx
xxxx
0001
xxxx
xxxx
0001
xxxx
xxxx
0010
xxxx
xxxx
0010
xxxx
xxxx
0011
xxxx
xxxx
0011
xxxx
xxxx
0110
xxxx
xxxx
0110
xxxx
xxxx
0111
xxxx
xxxx
0111
xxxx
xxxx
1000
xxxx
xxxx
1000
xxxx
xxxx
1001
xxxx
xxxx
1001
Table 33: ReadO LUT for Pattern5 (Decimate bytes by 6/channels by 2), offsets 4-7
Both modes
32-bit mode
64-bit mode
Addr
Dav
Valid
3only
Muxena
(3:0)
Dav
Valid
3only
Muxena
(3:0)
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
10000
10001
10010
10011
10100
10101
10110
10111
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0000
0000
0000
1100
0010
0001
0000
0000
0000
1100
0010
0001
0000
0000
0000
1000
0110
0001
0000
0000
0000
1000
0110
0001
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1100
0010
0001
1100
0010
0001
1100
0010
0001
1100
0010
0001
1000
0110
0001
1000
0110
0001
1000
0110
0001
1000
0110
0001
Mux
sel3
Mux
selO
Mux
sell
Mux
sel2
(3:0)
(3:0)
(3:0)
(3:0)
0000
xxxx
xxxx
0000
xxxx
xxxx
0001
xxxx
xxxx
0001
xxxx
xxxx
0010
xxxx
xxxx
0010
xxxx
xxxx
0011
xxxx
xxxx
0011
xxxx
xxxx
0110
xxxx
xxxx
0110
xxxx
xxxx
0111
xxxx
xxxx
0111
xxxx
xxxx
xxxx
0000
xxxx
xxxx
0000
xxxx
xxxx
0001
xxxx
xxxx
0001
xxxx
xxxx
0100
xxxx
xxxx
0100
xxxx
xxxx
0101
xxxx
xxxx
0101
xxxx
xxxx
0110
xxxx
xxxx
0110
xxxx
xxxx
0111
xxxx
xxxx
0111
xxxx
xxxx
xxxx
0010
xxxx
xxxx
0010
xxxx
xxxx
0011
xxxx
xxxx
0011
xxxx
xxxx
0100
xxxx
xxxx
0100
xxxx
xxxx
0101
xxxx
xxxx
0101
Table 34: Readl LUT for Pattern5 (Decimate bytes by 6/channels by 2), offsets 0-3
64-bit mode
32-bit mode
Addr
Dav
Valid
3only
Muxena
(3:0)
Dav
Valid
3only
Muxena
(3:0)
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
0111
10000
10001
10010
10011
10100
10101
10110
10111
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
0
0000
0000
0000
1000
0100
0011
0000
0000
0000
1000
0100
0011
0000
0000
0000
1000
0100
0011
0000
0000
0000
1000
0100
0011
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
0
1
1
0
1
1
0
1000
0100
0011
1000
0100
0011
1000
0100
0011
1000
0100
0011
1000
0100
0011
1000
0100
0011
1000
0100
0011
1000
0100
0011
Both modes
Mux
selO
Mux
sell
Mux
sel2
Mux
sel3
(3:0)
(3:0)
(3:0)
(3:0)
0100
xxxx
xxxx
0100
xxxx
xxxx
0101
xxxx
xxxx
0101
xxxx
xxxx
0110
xxxx
xxxx
0110
xxxx
xxxx
0111
xxxx
xxxx
0111
xxxx
xxxx
xxxx
0010
xxxx
xxxx
0010
xxxx
xxxx
0011
xxxx
xxxx
0011
xxxx
xxxx
0100
xxxx
xxxx
0100
xxxx
xxxx
0101
xxxx
xxxx
0101
xxxx
xxxx
xxxx
0000
xxxx
xxxx
0000
xxxx
xxxx
0001
xxxx
xxxx
0001
xxxx
xxxx
0010
xxxx
xxxx
0010
xxxx
xxxx
0011
xxxx
xxxx
0011
xxxx
xxxx
0110
xxxx
xxxx
0110
xxxx
xxxx
0111
xxxx
xxxx
0111
xxxx
xxxx
1000
xxxx
xxxx
1000
xxxx
xxxx
1001
xxxx
xxxx
1001
Table 35: Readl LUT for PatternS (Decimate bytes by 6/channels by 2), offsets 4-7
Both modes
32-bit mode
64-bit mode
Mux
sel2
Mux
sel3
Addr
Dav
Valid
3only
Muxena
(3:0)
Dav
Valid
3only
Muxena
(3:0)
Mux
selO
Mux
sell
(3:0)
(3:0)
(3:0)
(3:0)
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
0000
0000
0001
0001
0010
0010
0011
0011
0100
0100
0101
0101
0110
0110
0111
0111
0001
0001
0010
0010
0011
0011
0100
0100
0101
0101
0110
0110
0111
0111
1000
1000
0100
0100
0101
0101
0110
0110
0111
0111
1000
1000
1001
1001
1010
1010
1011
1011
0101
0101
0110
0110
0111
0111
1000
1000
1001
1001
1010
1010
1011
1011
1100
1100
Table 36: ReadO LUT for Pattern6 (Decimate shorts by 2), offsets 0-7
Both modes
32-bit mode
64-bit mode
Mux
sel3
Addr
Day
Valid
3only
Muxena
(3:0)
Dav
Valid
3only
Muxena
(3:0)
Mux
selO
Mux
sell
Mux
sel2
(3:0)
(3:0)
(3:0)
(3:0)
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
0000
0000
0001
0001
0010
0010
0011
0011
0100
0100
0101
0101
0110
0110
0111
0111
0001
0001
0010
0010
0011
0011
0100
0100
0101
0101
0110
0110
0111
0111
1000
1000
0100
0100
0101
0101
0110
0110
0111
0111
1000
1000
1001
1001
1010
1010
1011
1011
0101
0101
0110
0110
0111
0111
1000
1000
1001
1001
1010
1010
1011
1011
1100
1100
Table 37: Readl LUT for Pattern6 (Decimate shorts by 2), offsets 0-7
64-bit mode
32-bit mode
Both modes
Addr
Dav
Valid
3only
Muxena
(3:0)
Dav
Valid
3only
Muxena
(3:0)
Muxs
el0
Mux
sell
Mux
sel2
(3:0)
(3:0)
(3:0)
(3:0)
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
10000
10001
10010
10011
10100
10101
10110
10111
11000
11001
11010
11011
11100
11101
11110
11111
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
1100
0011
0000
0000
1100
0011
0000
0000
1100
0011
0000
0000
1100
0011
0000
0000
1100
0011
0000
0000
1100
0011
0000
0000
1100
0011
0000
0000
1100
0011
0000
0000
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
0000
xxxx
0000
xxxx
0001
xxxx
0001
xxxx
0010
xxxx
0010
xxxx
0011
xxxx
0011
xxxx
0100
xxxx
0100
xxxx
0101
xxxx
0101
xxxx
0110
xxxx
0110
xxxx
0111
xxxx
0111
xxxx
0001
xxxx
0001
xxxx
0010
xxxx
0010
xxxx
0011
xxxx
0011
xxxx
0100
xxxx
0100
xxxx
0101
xxxx
0101
xxxx
0110
xxxx
0110
xxxx
0111
xxxx
0111
xxxx
1000
xxxx
1000
xxxx
xxxx
0000
xxxx
0000
xxxx
0001
xxxx
0001
xxxx
0010
xxxx
0010
xxxx
0011
xxxx
0011
xxxx
0100
xxxx
0100
xxxx
0101
xxxx
0101
xxxx
0110
xxxx
0110
xxxx
0111
xxxx
0111
xxxx
0001
xxxx
0001
xxxx
0010
xxxx
0010
xxxx
0011
xxxx
0011
xxxx
0100
xxxx
0100
xxxx
0101
xxxx
0101
xxxx
0110
xxxx
0110
xxxx
0111
xxxx
0111
xxxx
1000
xxxx
1000
Table 38: ReadO LUT for Pattern7 (Decimate shorts by 4), offsets 0-7
Mux
sel3
Addr
Dav
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
10000
10001
10010
10011
10100
10101
10110
10111
11000
11001
11010
11011
11100
11101
11110
11111
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
Valid
3only
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
Both modes
32-bit mode
64-bit mode
Muxena
(3:0)
Dav
Valid
3only
Muxena
(3:0)
0000
0000
1100
0011
0000
0000
1100
0011
0000
0000
1100
0011
0000
0000
1100
0011
0000
0000
1100
0011
0000
0000
1100
0011
0000
0000
1100
0011
0000
0000
1100
0011
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
1100
0011
Mux
sel3
Mux
selO
Mux
sell
Mux
sel2
(3:0)
(3:0)
(3:0)
(3:0)
0000
xxxx
0000
xxxx
0001
xxxx
0001
xxxx
0010
xxxx
0010
xxxx
0011
xxxx
0011
xxxx
0100
xxxx
0100
xxxx
0101
xxxx
0101
xxxx
0110
xxxx
0110
xxxx
0111
xxxx
0111
xxxx
0001
xxxx
0001
xxxx
0010
xxxx
0010
xxxx
0011
xxxx
0011
xxxx
0100
xxxx
0100
xxxx
0101
xxxx
0101
xxxx
0110
xxxx
0110
xxxx
0111
xxxx
0111
xxxx
1000
xxxx
1000
xxxx
xxxx
0000
xxxx
0000
xxxx
0001
xxxx
0001
xxxx
0010
xxxx
0010
xxxx
0011
xxxx
0011
xxxx
0100
xxxx
0100
xxxx
0101
xxxx
0101
xxxx
0110
xxxx
0110
xxxx
0111
xxxx
0111
xxxx
0001
xxxx
0001
xxxx
0010
xxxx
0010
xxxx
0011
xxxx
0011
xxxx
0100
xxxx
0100
xxxx
0101
xxxx
0101
xxxx
0110
xxxx
0110
xxxx
0111
xxxx
0111
xxxx
1000
xxxx
1000
Table 39: Readl LUT for Pattern7 (Decimate shorts by 4), offsets 0-7
64-bit mode
32-bit mode
Addr
Dav
Valid
3only
Muxena
(3:0)
Dav
Valid
3only
Muxena
(3:0)
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
10000
10001
10010
10011
10100
10101
10110
10111
11000
11001
11010
11011
11100
11101
11110
11111
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
1
0
1
1
1
0
1
1
1
0
1
1
1
0
1
0
1
1
1
0
1
1
1
0
1
1
1
0
1
1
1
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
1
1
0
0
1
1
1
0
0
1
1
0
0
0
1
1
0
0
1
1
0
0
0
1
1
0
0
1
1
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
Both modes
Mux
selO
Mux
sell
(3:0)
0000
0110
0100
0010
0001
0111
0101
0011
0010
0000
0110
0100
0011
0001
0111
0101
0100
0010
0000
0110
0101
0011
0001
0111
0110
0100
0010
0000
0111
0101
0011
0001
Mux
sel2
Mux
sel3
(3:0)
(3:0)
(3:0)
0001
0111
0101
0011
0010
1000
0110
0100
0011
0001
0111
0101
0100
0010
1000
0110
0101
0011
0001
0111
0110
0100
0010
1000
0111
0101
0011
0001
1000
0110
0100
0010
0011
1001
0111
0101
0100
1010
1000
0110
0101
0011
1001
0111
0110
0100
1010
1000
0111
0101
0011
1001
1000
0110
0100
1010
1001
0111
0101
0011
1010
1000
0110
0100
0100
1010
1000
0110
0101
1011
1001
0111
0110
0100
1010
1000
0111
0101
1011
1001
1000
0110
0100
1010
1001
0111
0101
1011
1010
1000
0110
0100
1011
1001
0111
0101
Table 40: ReadO LUT for Pattern8 (Extract two channels), offsets 0-7
Both modes
32-bit mode
64-bit mode
Addr
Dav
Valid
3only
Muxena
(3:0)
Dav
Valid
3only
Muxena
(3:0)
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
10000
10001
10010
10011
10100
10101
10110
10111
11000
11001
11010
11011
11100
11101
11110
11111
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
1
0
1
1
1
0
1
1
1
1
1
0
1
1
1
0
1
1
1
0
1
1
1
0
1
0
1
1
1
0
1
1
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
1
1
0
0
1
1
1
0
0
1
1
0
0
0
1
1
0
0
1
1
0
0
0
1
1
0
0
1
1
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
Mux
selO
(3:0)
0000
0110
0100
0010
0001
0111
0101
0011
0010
0000
0110
0100
0011
0001
0111
0101
0100
0010
0000
0110
0101
0011
0001
0111
0110
0100
0010
0000
0111
0101
0011
0001
Mux
sell
(3:0)
0001
0111
0101
0011
0010
1000
0110
0100
0011
0001
0111
0101
0100
0010
1000
0110
0101
0011
0001
0111
0110
0100
0010
1000
0111
0101
0011
0001
1000
0110
0100
0010
Mux
sel2
(3:0)
0011
1001
0111
0101
0100
1010
1000
0110
0101
0011
1001
0111
0110
0100
1010
1000
0111
0101
0011
1001
1000
0110
0100
1010
1001
0111
0101
0011
1010
1000
0110
0100
Table 41: Readl LUT for PatternS (Extract two channels), offsets 0-7
Mux
sel3
(3:0)
0100
1010
1000
0110
0101
1011
1001
0111
0110
0100
1010
1000
0111
0101
1011
1001
1000
0110
0100
1010
1001
0111
0101
1011
1010
1000
0110
0100
1011
1001
0111
0101
64-bit mode
32-bit mode
Addr
Dav
Valid
3only
Muxena
(3:0)
Dav
Valid
3only
Muxena
(3:0)
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
10000
10001
10010
10011
10100
10101
10110
10111
11000
11001
11010
11011
11100
11101
11110
11111
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
0
x
1
x
1
x
1
x
0
x
1
x
1
x
1
x
1
x
0
x
1
x
1
x
1
x
0
x
1
x
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
0
1
1
1
1
0
1
1
1
1
0
1
1
1
1
0
1
1
0
1
0
1
1
0
1
1
1
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
Both modes
Mux
selO
Mux
sell
Mux
sel2
Mux
sel3
(3:0)
(3:0)
(3:0)
(3:0)
0000
0011
0110
0001
0100
0111
0010
0101
0001
0100
0111
0010
0101
0000
0011
0110
0010
0101
0000
0011
0110
0001
0100
0111
0011
0110
0001
0100
0111
0010
0101
0000
0001
0100
0111
0010
0101
1000
0011
0110
0010
0101
1000
0011
0110
0001
0100
0111
0011
0110
0001
0100
0111
0010
0101
1000
0100
0111
0010
0101
1000
0011
0110
0001
0010
0101
1000
0011
0110
1001
0100
0111
0011
0110
1001
0100
0111
0010
0101
1000
0100
0111
0010
0101
1000
0011
0110
1001
0101
1000
0011
0110
1001
0100
0111
0010
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
Table 42: ReadO LUT for Pattern9 (Select Every Pixel), offsets 0-3
Addr
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
10000
10001
10010
10011
10100
10101
10110
10111
11000
11001
11010
11011
11100
11101
11110
11111
Dav
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
Valid
3only
1
x
1
x
1
x
0
x
1
x
1
x
1
x
0
x
0
x
1
x
1
x
1
x
0
x
1
x
1
x
1
x
Muxena
(3:0)
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
Both modes
32-bit mode
64-bit mode
Dav
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Valid
3only
1
0
1
1
1
1
0
1
1
1
1
0
0
1
1
0
1
0
1
1
0
1
1
1
1
0
1
1
1
0
1
1
Muxena
(3:0)
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
Mux
sel2
Mux
sel3
(3:0)
(3:0)
(3:0)
0101
1000
0011
0110
0001
0100
0111
0010
0110
0001
0100
0111
0010
0101
1000
0011
0111
0010
0101
1000
0011
0110
0001
0100
1000
0011
0110
0001
0100
0111
0010
0101
0110
1001
0100
0111
0010
0101
1000
0011
0111
0010
0101
1000
0011
0110
1001
0100
1000
0011
0110
1001
0100
0111
0010
0101
1001
0100
0111
0010
0101
1000
0011
0110
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
Mux
selO
Mux
sell
(3:0)
0100
0111
0010
0101
0000
0011
0110
0001
0101
0000
0011
0110
0001
0100
0111
0010
0110
0001
0100
0111
0010
0101
0000
0011
0111
0010
0101
0000
0011
0110
0001
0100
Table 43: ReadO LUT for Pattern9 (Select every pixel), offsets 4-7
64-bit mode
32-bit mode
Both modes
Addr
Dav
Valid
3only
Muxena
(3:0)
Dav
Valid
3only
Muxena
(3:0)
Mux
selO
Mux
sell
Mux
sel2
(3:0)
(3:0)
(3:0)
(3:0)
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
10000
10001
10010
10011
10100
10101
10110
10111
11000
11001
11010
11011
11100
11101
11110
11111
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
0
x
1
x
1
x
1
x
1
x
0
x
1
x
1
x
1
x
0
x
0
x
1
x
1
x
1
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
0
1
1
1
1
0
1
1
1
1
0
1
1
1
1
0
1
1
0
1
0
1
1
0
1
1
1
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
0000
0011
0110
0001
0100
0111
0010
0101
0001
0100
0111
0010
0101
0000
0011
0110
0010
0101
0000
0011
0110
0001
0100
0111
0011
0110
0001
0100
0111
0010
0101
0000
0001
0100
0111
0010
0101
1000
0011
0110
0010
0101
1000
0011
0110
0001
0100
0111
0011
0110
0001
0100
0111
0010
0101
1000
0100
0111
0010
0101
1000
0011
0110
0001
0010
0101
1000
0011
0110
1001
0100
0111
0011
0110
1001
0100
0111
0010
0101
1000
0100
0111
0010
0101
1000
0011
0110
1001
0101
1000
0011
0110
1001
0100
0111
0010
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
Table 44: Readl LUT for Pattern9 (Select Every Pixel), offsets 0-3
Mux
sel3
Addr
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
10000
10001
10010
10011
10100
10101
10110
10111
11000
11001
11010
11011
11100
11101
11110
11111
Dav
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
Valid
3only
x
0
x
1
x
1
x
1
x
1
x
0
x
1
x
0
x
0
x
1
x
1
x
1
x
0
x
1
x
0
x
1
Muxena
(3:0)
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
Both modes
32-bit mode
64-bit mode
Dav
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Valid
3only
1
0
1
1
1
1
0
1
1
1
1
0
0
1
1
0
1
0
1
1
0
1
1
1
1
0
1
1
1
0
1
1
Muxena
(3:0)
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
Mux
sell
Mux
sel2
Mux
sel3
(3:0)
(3:0)
(3:0)
(3:0)
0100
0111
0010
0101
0000
0011
0110
0001
0101
0000
0011
0110
0001
0100
0111
0010
0110
0001
0100
0111
0010
0101
0000
0011
0111
0010
0101
0000
0011
0110
0001
0100
0101
1000
0011
0110
0001
0100
0111
0010
0110
0001
0100
0111
0010
0101
1000
0011
0111
0010
0101
1000
0011
0110
0001
0100
1000
0011
0110
0001
0100
0111
0010
0101
0110
1001
0100
0111
0010
0101
1000
0011
0111
0010
0101
1000
0011
0110
1001
0100
1000
0011
0110
1001
0100
0111
0010
0101
1001
0100
0111
0010
0101
1000
0011
0110
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
Mux
selO
Table 45: Readl LUT for Pattern9 (Select every pixel), offsets 4-7
64-bit mode
32-bit mode
Addr
Dav
Valid
3only
Muxena
(3:0)
Dav
Valid
3only
Muxena
(3:0)
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
10000
10001
10010
10011
10100
10101
10110
10111
11000
11001
11010
11011
11100
11101
11110
11111
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
0
x
1
x
0
x
1
x
1
x
1
x
0
x
1
x
0
x
1
x
1
x
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
0
1
1
1
1
0
1
1
1
0
1
1
1
1
0
1
1
0
1
1
1
0
1
1
1
1
1
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
Both modes
Muxs
el0
Mux
sell
Mux
sel2
(3:0)
(3:0)
(3:0)
(3:0)
0000
0110
0100
0010
0001
0111
0101
0011
0010
0000
0110
0100
0011
0001
0111
0101
0100
0010
0000
0110
0101
0011
0001
0111
0110
0100
0010
0000
0111
0101
0011
0001
0001
0111
0101
0011
0010
1000
0110
0100
0011
0001
0111
0101
0100
0010
1000
0110
0101
0011
0001
0111
0110
0100
0010
1000
0111
0101
0011
0001
1000
0110
0100
0010
0010
1001
0110
0100
0011
1001
0111
0101
0100
0010
1000
0110
0101
0011
1001
0111
0110
0100
0010
1000
0111
0101
0011
1001
1000
0110
0100
0010
1001
0111
0101
0011
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
Table 46: ReadO LUT for PatternlO (Decimate pixels by 2), offsets 0-7
100
Mux
sel3
Addr
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
10000
10001
10010
10011
10100
10101
10110
10111
11000
11001
11010
11011
11100
11101
11110
11111
Dav
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
Both modes
32-bit mode
64-bit mode
Valid
3only
Muxena
(3:0)
Dav
Valid
3only
Muxena
(3:0)
x
0
x
1
x
0
x
1
x
1
x
1
x
1
x
1
x
1
x
0
x
1
x
1
x
1
x
1
x
1
x
1
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
xxxx
1110
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
0
1
1
1
1
0
1
1
1
0
1
1
1
1
0
1
1
0
1
1
1
0
1
1
1
1
1
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
1110
Mux
selO
Mux
sell
Mux
sel2
(3:0)
(3:0)
(3:0)
(3:0)
0000
0110
0100
0010
0001
0111
0101
0011
0010
0000
0110
0100
0011
0001
0111
0101
0100
0010
0000
0110
0101
0011
0001
0111
0110
0100
0010
0000
0111
0101
0011
0001
0001
0111
0101
0011
0010
1000
0110
0100
0011
0001
0111
0101
0100
0010
1000
0110
0101
0011
0001
0111
0110
0100
0010
1000
0111
0101
0011
0001
1000
0110
0100
0010
0010
1001
0110
0100
0011
1001
0111
0101
0100
0010
1000
0110
0101
0011
1001
0111
0110
0100
0010
1000
0111
0101
0011
1001
1000
0110
0100
0010
1001
0111
0101
0011
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
Table 47: Readl LUT for PatternlO (Decimate pixels by 2), offsets 0-7
101
Mux
sel3
64-bit mode
32-bit mode
Both modes
Addr
Dav
Valid
3only
Muxena
(3:0)
Dav
Valid
3only
Muxena
(3:0)
Mux
selO
(3:0)
(3:0)
(3:0)
(3:0)
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
10000
10001
10010
10011
10100
10101
10110
10111
1
1
0
0
0
0
1
1
0
0
0
0
1
1
0
0
0
0
1
1
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
0
1
1
1
1
1110
1110
0000
0000
0000
0000
1110
1110
0000
0000
0000
0000
1110
1110
0000
0000
0000
0000
1110
1110
0000
0000
0000
0000
1
1
0
x
x
x
1
1
0
x
x
x
1
1
0
x
x
x
1
1
0
x
x
x
1
1
1
x
x
x
1
1
1
x
x
x
1
0
1
x
x
x
1
0
1
x
x
x
1110
1110
0000
xxxx
xxxx
xxxx
1110
1110
0000
xxxx
xxxx
xxxx
1110
1110
0000
xxxx
xxxx
xxxx
1110
1110
0000
xxxx
xxxx
xxxx
0000
0100
xxxx
0000
0100
xxxx
0001
0101
xxxx
0001
0101
xxxx
0010
0110
xxxx
0010
0110
xxxx
0011
0111
xxxx
0011
0111
xxxx
0001
0101
xxxx
0001
0101
xxxx
0010
0110
xxxx
0010
0110
xxxx
001
0111
xxxx
001
0111
xxxx
0100
1000
xxxx
0100
1000
xxxx
0010
0110
xxxx
0010
0110
xxxx
0011
0111
xxxx
0011
0111
xxxx
0100
1000
xxxx
0100
1000
xxxx
0101
1001
xxxx
0101
1001
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
Mux
sell
Mux
sel2
Table 48: ReadO LUT for Pattern1l (Decimate Pixels by 4), offsets 0-3
102
Mux
sel3
Addr
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
10000
10001
10010
10011
10100
10101
10110
10111
Both modes
32-bit mode
64-bit mode
Dav
Valid
3only
Muxena
(3:0)
Dav
Valid
3only
Muxena
(3:0)
1
1
0
0
0
0
1
1
0
0
0
0
1
1
0
0
0
0
1
1
0
0
0
0
1
0
1
1
0
1
1
0
1
1
0
1
0
0
1
0
0
1
0
0
1
0
0
1
1110
1110
0000
0000
0000
0000
1110
1110
0000
0000
0000
0000
1110
1110
0000
0000
0000
0000
1110
1110
0000
0000
0000
0000
1
1
0
x
x
x
1
1
0
x
x
x
1
1
0
x
x
x
1
1
0
x
x
x
1
0
1
x
x
x
1
0
1
x
x
x
0
0
1
x
x
x
0
0
1
x
x
x
1110
1110
0000
0000
0000
0000
1110
1110
0000
0000
0000
0000
1110
1110
0000
0000
0000
0000
1110
1110
0000
0000
0000
0000
Mux
sel2
Mux
sel3
(3:0)
(3:0)
(3:0)
0101
1001
xxxx
0101
1001
xxxx
0110
1010
xxxx
0110
1010
xxxx
0111
1011
xxxx
0111
1011
xxxx
1000
1100
xxxx
1000
1100
xxxx
0110
1010
xxxx
0110
1010
xxxx
0111
1011
xxxx
0111
1011
xxxx
1000
1100
xxxx
1000
1100
xxxx
1001
1101
xxxx
1001
1101
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
Mux
selO
Mux
sell
(3:0)
0100
1000
xxxx
0100
1000
xxxx
0101
1001
xxxx
0101
1001
xxxx
0110
1010
xxxx
0110
1010
xxxx
0111
1011
xxxx
0111
1011
xxxx
Table 49: ReadO LUT for Pattern1l (Decimate pixels by 4), offsets 4-7
103
Addr
Dav
64-bit mode
Valid
Muxena
3only (3:0)
Dav
32-bit mode
Valid
Muxena
3only (3:0)
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
10000
10001
10010
10011
10100
10101
10110
10111
0
0
0
1
1
0
0
0
0
1
1
0
0
0
0
1
1
0
0
0
0
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
0
1
1
0
1
1
0
1
1
1
0
x
x
x
1
1
0
x
x
x
1
1
0
x
x
x
1
1
0
x
x
x
1
1
1
x
x
x
1
1
1
x
x
x
1
0
1
x
x
x
1
0
1
x
x
x
0000
0000
0000
1110
1110
0000
0000
0000
0000
1110
1110
0000
0000
0000
0000
1110
1110
0000
0000
0000
0000
1110
1110
0000
1110
1110
0000
xxxx
xxxx
xxxx
1110
1110
0000
xxxx
xxxx
xxxx
1110
1110
0000
xxxx
xxxx
xxxx
1110
1110
0000
xxxx
xxxx
xxxx
Mux
selO
Both modes
Mux Mux
sell
sel2
(3:0)
(3:0)
(3:0)
(3:0)
0000
0100
xxxx
0000
0100
xxxx
0001
0101
xxxx
0001
0101
xxxx
0010
0110
xxxx
0010
0110
xxxx
0011
0111
xxxx
0011
0111
xxxx
0001
0101
xxxx
0001
0101
xxxx
0010
0110
xxxx
0010
0110
xxxx
0011
0111
xxxx
0011
0111
xxxx
0100
1000
xxxx
0100
1000
xxxx
0010
0110
xxxx
0010
0110
xxxx
0011
0111
xxxx
0011
0111
xxxx
0100
1000
xxxx
0100
1000
xxxx
0101
1001
xxxx
0101
1001
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
Table 50: Readl LUT for Pattern1l (Decimate pixels by 4), offsets 0-3
104
Mux
sel3
Addr
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
10000
10001
10010
10011
10100
10101
10110
10111
Dav
0
0
0
1
1
0
0
0
0
1
1
0
0
0
0
1
1
0
0
0
0
1
1
0
64-bit mode
Valid Muxena
(3:0)
3only
1
0
1
1
0
1
1
0
1
1
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0000
0000
0000
1110
1110
0000
0000
0000
0000
1110
1110
0000
0000
0000
0000
1110
1110
0000
0000
0000
0000
1110
1110
0000
Dav
32-bit mode
Muxena
Valid
3only (3:0)
1
1
0
x
x
x
1
1
0
x
x
x
1
1
0
x
x
x
1
1
0
x
x
x
1
0
1
x
x
x
1
0
1
x
x
x
0
0
1
x
x
x
0
0
1
x
x
x
1110
1110
0000
0000
0000
0000
1110
1110
0000
0000
0000
0000
1110
1110
0000
0000
0000
0000
1110
1110
0000
0000
0000
0000
Mux
selO
Both modes
Mux Mux
sel2
sell
(3:0)
(3:0)
(3:0)
(3:0)
0100
1000
xxxx
0100
1000
xxxx
0101
1001
xxxx
0101
1001
xxxx
0110
1010
xxxx
0110
1010
xxxx
0111
1011
xxxx
0111
1011
xxxx
0101
1001
xxxx
0101
1001
xxxx
0110
1010
xxxx
0110
1010
xxxx
0111
1011
xxxx
0111
1011
xxxx
1000
1100
xxxx
1000
1100
xxxx
0110
1010
xxxx
0110
1010
xxxx
0111
1011
xxxx
0111
1011
xxxx
1000
1100
xxxx
1000
1100
xxxx
1001
1101
xxxx
1001
1101
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
Table 51: Readl LUT for Patternil (Decimate pixels by 4), offsets 4-7
105
Mux
sel3
64-bit mode
32-bit mode
Both modes
Addr
Dav
Valid
3only
Muxena
(3:0)
Dav
Valid
3only
Muxena
(3:0)
Mux
selO
Mux
sell
(3:0)
(3:0)
(3:0)
(3:0)
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
0
x
0
x
0
x
1111
xxxx
1111
xxxx
1111
xxxx
1111
xxxx
1111
xxxx
1111
xxxx
1111
xxxx
1111
xxxx
0000
0000
0001
0001
0010
0010
0011
0011
0100
0100
0101
0101
0110
0110
0111
0111
0001
0001
0010
0010
0011
0011
0100
0100
0101
0101
0110
0110
0111
0111
1000
1000
0010
0010
0011
0011
0100
0100
0101
0101
0110
0110
0111
0111
1000
1000
1001
1001
0011
0011
0100
0100
0101
0101
0110
0110
0111
0111
1000
1000
1001
1001
1010
1010
Mux
sel2
Mux
sel3
Table 52: ReadO LUT for Patternl2 (Decimate words by 2), offsets 0-7
64-bit mode
32-bit mode
Addr
Dav
Valid
3only
Muxena
(3:0)
Dav
Valid
3only
Muxena
(3:0)
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
1
x
0
x
0
x
0
x
1111
xxxx
1111
xxxx
1111
xxxx
1111
xxxx
1111
xxxx
1111
xxxx
1111
xxxx
1111
xxxx
Both modes
Mux
selO
Mux
sell
Mux
sel2
(3:0)
(3:0)
(3:0)
(3:0)
0000
0000
0001
0001
0010
0010
0011
0011
0100
0100
0101
0101
0110
0110
0111
0111
0001
0001
0010
0010
0011
0011
0100
0100
0101
0101
0110
0110
0111
0111
1000
1000
0010
0010
0011
0011
0100
0100
0101
0101
0110
0110
0111
0111
1000
1000
1001
1001
0011
0011
0100
0100
0101
0101
0110
0110
0111
0111
1000
1000
1001
1001
1010
1010
Table 53: Readl LUT for Patternl2 (Decimate words by 2), offsets 0-7
106
Mux
sel3
Both modes
32-bit mode
64-bit mode
Addr
Dav
Valid
3only
Muxena
(3:0)
Dav
Valid
3only
Muxena
(3:0)
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
01110
01111
10000
10001
10010
10011
10100
10101
10110
10111
1
x
0
1
x
0
1
x
0
1
x
0
1
x
0
1
x
0
1
x
0
1
x
0
1
x
1
1
x
1
1
x
1
1
x
1
1
x
1
0
x
1
0
x
1
0
x
1
1111
xxxx
0000
1111
xxxx
0000
1111
xxxx
0000
1111
xxxx
0000
1111
xxxx
0000
1111
xxxx
0000
1111
xxxx
0000
1111
xxxx
0000
1
1
0
1
1
0
1
1
0
1
1
0
1
1
0
1
1
0
1
1
0
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
0
1
1
0
1
1
1111
1111
0000
1111
1111
0000
1111
1111
0000
1111
1111
0000
1111
1111
0000
1111
1111
0000
1111
1111
0000
1111
1111
0000
Mux
sel2
Mux
sell
(3:0)
(3:0)
(3:0)
(3:0)
0000
0100
xxxx
0001
0101
xxxx
0010
0110
xxxx
0011
0111
xxxx
0100
1000
xxxx
0101
1001
xxxx
0110
1010
xxxx
0111
1011
xxxx
0001
0101
xxxx
0010
0110
xxxx
0011
0111
xxxx
0100
1000
xxxx
0101
1001
xxxx
0110
1010
xxxx
0111
1011
xxxx
1000
1100
xxxx
0010
0110
xxxx
0011
0111
xxxx
0100
1000
xxxx
0101
1001
xxxx
0110
1010
xxxx
0111
1011
xxxx
1000
1100
xxxx
1001
1101
xxxx
0011
0111
xxxx
0100
1000
xxxx
0101
1001
xxxx
0110
1010
xxxx
0111
1011
xxxx
1000
1100
xxxx
1001
1101
xxxx
1010
1110
xxxx
Table 54: ReadO LUT for Patternl3 (Decimate double words by 2), offsets 0-7
107
Mux
sel3
Mux
selO
64-bit mode
32-bit mode
Addr
Dav
Valid
3only
Muxena
(3:0)
Dav
Valid
3only
Muxena
(3:0)
00000
00001
00010
00011
00100
00101
x
1
0
x
1
0
x
1
1
x
1
1
xxxx
1111
0000
xxxx
1111
0000
1
1
0
1
1
0
1
1
1
1
1
1
1111
1111
0000
1111
1111
0000
00110
x
x
xxxx
1
1
00111
01000
01001
01010
01011
01100
01101
01110
01111
10000
10001
10010
10011
10100
10101
10110
10111
1
0
x
1
0
x
1
0
x
1
0
x
1
0
x
1
0
1
1
x
1
1
x
1
1
x
1
1
x
1
1
x
1
1
1111
0000
xxxx
1111
0000
xxxx
1111
0000
xxxx
1111
0000
xxxx
1111
0000
xxxx
1111
0000
1
0
1
1
0
1
1
0
1
1
0
1
1
0
1
1
0
1
1
1
1
1
1
1
1
0
1
1
0
1
1
0
1
1
Both modes
Mux
SelO
Mux
Sell
Mux
sel2
Mux
sel3
(3:0)
(3:0)
(3:0)
(3:0)
0000
0100
xxxx
0001
0101
xxxx
0001
0101
xxxx
0010
0110
xxxx
0010
0110
xxxx
0011
0111
xxxx
0011
0111
xxxx
0100
1000
xxxx
1111
0010
0011
0100
0101
1111
0000
1111
1111
0000
1111
1111
0000
1111
1111
0000
1111
1111
0000
1111
1111
0000
0110
xxxx
0011
0111
xxxx
0100
1000
xxxx
0101
1001
xxxx
0110
1010
xxxx
0111
1011
xxxx
0111
xxxx
0100
1000
xxxx
0101
1001
xxxx
0110
1010
xxxx
0111
1011
xxxx
1000
1100
xxxx
1000
xxxx
0101
1001
xxxx
0110
1010
xxxx
0111
1011
xxxx
1000
1100
xxxx
1001
1101
xxxx
1001
xxxx
0110
1010
xxxx
0111
1011
xxxx
1000
1100
xxxx
1001
1101
xxxx
1010
1110
xxxx
Table 55: Readl LUT for Patternl3 (Decimate double words by 2), offsets 0-7
108
10.3
RP Subsystem Parts List
Name
RP
SAG
DS
SRAM
FIFOs
LVDS Transmitter
LVDS Receiver
LVDS Sync/Receiver
Manufacturer
Altera
Altera
Altera
Micron
IDT
National
National
National
Part Number
EPF1OK100GC503-3
EPF10K50BC356-3
EPF10K50BC356-3
MT58LC256K16/18B3
IDT723611
DS90C363
DS90CF364
DS90C402
Table 56: RP Subsystem Parts List
109
Quantity
1
1
1
4
4
2
2
1
Package Type
PGA
BGA
BGA
TQFP
PQFP
TSSOP
TSSOP
SOIC