Application Note: Virtex-II Families R XAPP688 (v1.2) May 3, 2004 Creating High-Speed Memory Interfaces with Virtex-II and Virtex-II Pro FPGAs Author: Nagesh Gupta, Maria George Summary Designing high-speed memory interfaces is a challenging task. Xilinx has invested time and effort to make it simple to design such interfaces using the Virtex-II™ and Virtex-II Pro™ FPGAs. This application note discusses the challenges presented by this task together with various techniques that can be used to overcome them, while illustrating the key concepts in implementing any memory interface. All examples used in this application note assume a DDR-1 interface on an XC2VP20FF1152-6 Virtex-II Pro FPGA. The interface speed is 200 MHz. Introduction High-speed memory interfaces are typically source-synchronous and double-data rate. In a source-synchronous design, the clocks are generated and transmitted along with the data, as shown in Figure 1. The data is typically received using the received clock and then transferred into the receivers' clock domain. Read Data Read Clock Read Data Read Clock Xilinx FPGA Read Data Read Clock Data Sources Data Sink x688_01_090403 Figure 1: Source Synchronous Memory Interface A memory interface can be modularly represented as shown in Figure 2. Creating a modular interface has many advantages. It allows designs to be ported easily. It also makes it possible to share parts of the design across different types of memory interfaces. © 2004 Xilinx, Inc. All rights reserved. All Xilinx trademarks, registered trademarks, patents, and further disclaimers are as listed at http://www.xilinx.com/legal.htm. All other trademarks and registered trademarks are the property of their respective owners. All specifications are subject to change without notice. NOTICE OF DISCLAIMER: Xilinx is providing this design, code, or information "as is." By providing the design, code, or information as one possible implementation of this feature, application, or standard, Xilinx makes no representation that this implementation is free from any claims of infringement. You are responsible for obtaining any rights you may require for your implementation. Xilinx expressly disclaims any warranty whatsoever with respect to the adequacy of the implementation, including but not limited to any warranties or representations that this implementation is free from claims of infringement and any implied warranties of merchantability or fitness for a particular purpose. XAPP688 (v1.2) May 3, 2004 www.xilinx.com 1-800-255-7778 1 R Key Challenges Xilinx FPGA Arbitration Layer Application Interface Layer Control Layer Physical Layer Memories x688_02_090403 Figure 2: Modular Memory Interface Representation Key Challenges High-speed controllers and interfaces are challenging to design. Designing high-speed memory interfaces are particularly challenging due to various factors. Some of the key challenges are • Source-synchronous data transmit (data write function). • Source-synchronous data receive (data read function). While these functions are common in all high-speed interfaces, memory interfaces make it particularly challenging due to the following factors: • Single-ended standards. Memories typically use HSTL or SSTL type input/output standard. • Non-free-running clocks. Some memories such as DDR SDRAM provide a non freerunning strobe clock for data reads. • Timing parameters. The input and output timing specifications for memories take away significant amount of the available data valid window. We have solved these key problems with patent-pending innovations. These innovations have been proven in hardware and are available for use in the form of reference designs. Physical Layer The Physical layer is responsible for transmitting and receiving all the signals to and from the memories. The major functions of the physical layer include: • Write data into the memory. • Read data from the memory. • Provide all the necessary control signals • Transfer the read data clock domain from memory domain to FPGA domain. The physical layer constitutes a major challenge in memory interface design. This section elaborates on the different aspects of the physical layer. 2 www.xilinx.com 1-800-255-7778 XAPP688 (v1.2) May 3, 2004 R Physical Layer Clocks, Address and Control Signals The FPGA generates all the clocks and control signals for reads and writes to memory. The memory clocks are typically generated using a Double Data Rate (DDR) register. Controller & Transmit Data Path IOB VCC BUFG Clock0 Clock to Memory Clock180 DCM BUFG x688_03_090803 Figure 3: Clock Generation for Memory Device As shown in Figure 3, a Digital Clock Manger (DCM) generates the clock and its inverted version. Generating the clock this way has a couple advantages: • The data, control and clock signals all go through similar delay elements while exiting the FPGA. • Clock duty cycle distortion is minimal when global clock nets are used for the clock and the 180° phase-shifted clock. All of the address and control signals are registered and output at the IOB. The address and control signals are registered using a clock that is 180° shifted from the clock signal to the memory. Such an approach enables the address and control signals to have additional time margin before they are registered. The address and control signals meet the required timing with ease. An example timing analysis for a DDR-1 interface implemented using an XC2VP20FF1152 FPGA, -6 speed grade, is shown inTable 1. Table 1: Address and Control Signal Margins Parameter TCLOCK Value Leading-Edge Uncertainties Trailing-Edge Uncertainties 5000 Meaning Clock period TCLOCK_SKEW 50 50 50 Minimal skew, since right/left sides are being used and the bits are close together TPACKAGE_SKEW 65 65 65 Using same bank reduces package skew TSETUP 750 750 0 Setup time from memory data sheet THOLD 750 0 750 Hold time from memory data sheet TPCB_LAYOUT_SKEW 50 50 50 Skew between layout lines on the board TPHASE_OFFSET_ERROR 140 140 140 Offset between different phases of the clock. Taken from the parameter CLKOUT_PHASE TDUTY_CYCLE_DISTORTION 0 0 0 Duty-cycle distortion does not apply TJITTER 0 0 0 Since the clock and address are generated using the same clock, the same jitter exists in both. Therefore, it does not need to be included 1055 1055 1445 3945 Total Uncertainties Command Window XAPP688 (v1.2) May 3, 2004 2500 www.xilinx.com 1-800-255-7778 Worst-case window of 2500 ps 3 R Physical Layer Figure 4 illustrates the address and control signal margin. Clock 180 Address/Control Memory Clock Leading Edge Uncertainties Trailing Edge Uncertainties Leading Edge Margin Trailing Edge Margin 0 1445 3945 5000 time, ps x688_04_011404 Figure 4: Address and Control Signal Margins Data Write The write data and strobe are clocked out of the FPGA. The strobe is center-aligned with respect to the data. For DDR-1, the strobe is required to be non-free-running. A strobe for DDR-1 is required only when valid data is being written into the memory. In addition, the DDR-1 strobe is bi-directional. The same strobe is used for reads and writes. = 0˚ Memory Clock Generated using clock180 Command/Address Generated using clock90 & clock270 Write Data In phase with memory clock DQS x688_05_092403 Figure 5: Write Data, Command, Address, Strobe, and Clock Signals To meet the requirements specified above, the write data is clocked out using a clock that is 90 degrees and 270 degrees shifted from the primary clock going to the memory. The DDR registers in the IOB are used to clock the write data out. This is shown in Figure 6. d0 clock270 dn clock90 x688_06_090803 Figure 6: Using IOB DDR Registers to Drive the Write Data 4 www.xilinx.com 1-800-255-7778 XAPP688 (v1.2) May 3, 2004 R Physical Layer Table 2 shows a calculation of the extra margin available for write data capture using the techniques just described. An illustration of the calculated time margin window is shown in Figure 7. Table 2: Write Data Timing Margin Calculation Parameter Value Leading-Edge Uncertainties Trailing-Edge Uncertainties Meaning TCLOCK 5000 Clock period TCLOCK_PHASE 2500 Clock phase TDCD 250 Duty cycle distortion of clock to memory TDATA_PERIOD 2250 Total data period, TCLOCK_PHASE – TDCD TCLOCK_SKEW 50 50 50 Minimal skew, since right/left sides are being used and the bits are close together TPACKAGE_SKEW 65 65 65 Skew due to package pins and board layout. This can be reduced further with tighter layout TSETUP 450 450 0 Setup time from memory data sheet THOLD 450 0 450 Hold time from memory data sheet TPCB_LAYOUT_SKEW 50 50 50 Skew between layout lines on the board TPHASE_OFFSET_ERROR 140 140 140 Offset error between different clocks from the same DCM. Taken from the parameter CLKOUT_PHASE 0 0 0 Total Uncertainties 1205 755 755 Worst case for leading and trailing can never happen simultaneously Window 740 755 1495 Total worst-case window is 740 ps TJITTER The same DCM is used to generate the clock and data. Hence they jitter together Leading Edge Margin Trailing Edge Margin Write Data DQS Leading Edge Uncertainties Trailing Edge Uncertainties 0 755 1545 2250 time, ps x688_07_011404 Figure 7: Write Data Margin Strobe Delay Circuit The strobe is delayed such that it falls within the data valid window. There are several mechanisms that can be used to delay the strobe. 1. Introduce an external board trace delay. This mechanism has been explained in earlier Xilinx Application Notes such as XAPP253. 2. Use a delay line on the board as opposed to a trace delay. This mechanism is similar to the external trace delay, but will use a delay line to create the required delay. XAPP688 (v1.2) May 3, 2004 www.xilinx.com 1-800-255-7778 5 R Physical Layer 3. Delay a free-running strobe using the DCM. This mechanism can be used in memories such as QDR-2, FCRAM-2 and RLDRAM-2. The disadvantage of this mechanism is it requires additional DCM and clocking resources for each strobe. A QDR-2 interface using this mechanism is explained in XAPP 262. 4. Use an internal delay element with precisely controlled delays. An internal delay element has been built of LUTs(Look Up Tables) and other resources in the Xilinx FPGA fabric. Selecting the elements and all routing resources used within the elements defines the delay values precisely. Methods (3) and (4) above are the preferred methods. Method (3) cannot be used if the FPGA is short on clocking resources or if a free-running strobe is not available. Method (4) has been demonstrated successfully for DDR-1 interface at 200 MHz. Input Output LUT LUT LUT LUT LUT LUT x688_08_090803 Figure 8: Building Delay Elements Using LUTs in a Xilinx FPGA Figure 8 shows the way in which the LUTs are configured to create selectable delay elements. The delay value of each MUX element is between 200 and 300 picoseconds. The extra delay of the strobe with respect to data is calculated using a calibration circuit that ensures that the strobe is within the worst-case data valid window. The total extra delay in the strobe path (compared to the data path) should fall within the valid data window. Calculation of the valid data window is described in data read section. Data Read Receiving the source-synchronous data from memory is the most interesting part of the memory interface. There are several novel techniques we have developed in this area. Memory read data is edge aligned with a source-synchronous clock. In case of DDR-1, the clock is a non-free running strobe. The challenge is to receive the data using the strobe and transfer it to the FPGA clock domain. Since the memory clock is not free running, there can only be one stage of registers using this clock. The input side of the data bits uses exactly the same resources as used by the input side of the strobe. This ensures matched delays on the strobe and data signals until the strobe is delayed in the strobe delay circuit. That is one of the reason why this design captures the data directly into registers in the FPGA fabric. A separate circuit transfers the output of these registers into FIFOs clocked using the FPGA clock. Data Capture Data is captured using the delayed strobe mentioned previously. Data is directly captured in the FPGA fabric slices. By using novel techniqies, it is possible to keep the captured data valid for two entire clock cycles. The timing calculation for data capture is shown in Table 3. An illustration of the calculated data valid window is shown in Figure 9. 6 www.xilinx.com 1-800-255-7778 XAPP688 (v1.2) May 3, 2004 R Physical Layer Table 3: Data Valid Window Calculation Parameter Value Leading-Edge Uncertainties Trailing-Edge Uncertainties Meaning TCLOCK 5000 Clock period TPHASE 2500 Clock phase TMEM_DCD 250 Duty cycle distortion from memory DLL TDATA_PERIOD 2250 Total data period, TPHASE – TMEM_DCD TDQSQ 500 500 0 Strobe-to-data distortion TPACKAGE_SKEW 65 65 65 This parameter depends on the exact package. Since the 8 data bits are close together, skew is less than this TSETUP 240 240 0 Setup time from Virtex-II Pro data sheet for -6 part. Taken from TDICK parameter. THOLD –50 0 –50 Hold time from Virtex-II Pro data sheet. Taken from TCKDI parameter. TJITTER 100 0 0 Data and strobe jitter together, since they are generated off the same clock TLOCAL_CLOCK_LINE 25 25 25 Observed skew is lower than this value, since loading is light and all bits are close together TPCB_LAYOUT_SKEW 50 50 50 Skew between data lines on the board TQHS 450 0 450 Hold skew factor for DQ 880 540 880 1710 Uncertainties Window 830 Worst-case window of 830 ps Leading Edge Margin Trailing Edge Margin Read Data Read DQS Leading Edge Uncertainties 0 880 1710 2250 time, ps Trailing Edge Uncertainties x688_10_050304 Figure 9: Data Valid Window XAPP688 (v1.2) May 3, 2004 www.xilinx.com 1-800-255-7778 7 R Physical Layer A timing diagram of data capture is shown in Figure 10. Delayed Strobe Data word0 Register 0 word1 word2 word3 word4 word5 word0 Register 1 word6 word7 word8 word9 word10 word4 word1 Register 2 word5 word2 Register 3 word6 word3 word7 x688_18a_011504 Figure 10: Data Capture Timing Data Recapture Data recapture is the process of transferring the data into the FPGA clock domain. The recapture circuit works for any system-level timing. No system-level calculations are required for the circuit to work. The extended period for which each piece of data is valid makes the task of recapture easier. The data is directly transferred from the data capture registers into separate FIFOs. Delay Calibration Circuit The delay calibration circuit selects the number of delay elements to be used to create extra delay through the strobe line. The delay calibration circuit calculates the delay of a circuit that is identical in all respects to the strobe delay circuit. All aspects of delay are considered for calibration. This includes all the component delays, as well as the route delays. The calibration circuit selects the number of delay elements to be used at any given time. After the required number of delay elements is determined, the calibration circuit asserts the appropriate select lines to the strobe delay circuit. The select lines of the strobe delay circuit can be changed only when no read data is being received. Selecting the required number of delay elements as shown avoids any uncertainty due to process, voltage, and temperature (PVT) variance. The positive and negative hysterisys for transition from selecting one set of delay elements to another set is configurable. The number of contiguous selection transitions required to make the transition is also configurable. During operation, the temperature of the die can cause the delay elements to get slower. If this happens, the number of selected delay elements changes automatically. 8 www.xilinx.com 1-800-255-7778 XAPP688 (v1.2) May 3, 2004 R Control Layer and Application Interface Layer Control Layer and Application Interface Layer The control and application layers are simple to design in the Xilinx FPGA fabric. The control layer is responsible to generate all control signals according to a memory protocol. The application interface layer is used to transfer data and commands between the user application and the memory interface blocks. A simple user application interface is as shown in Figure 11. User Application Read Data FIFO Write Data FIFO Command FIFO (Could Be Separate for Read and Write Commands) x688_16_090803 Figure 11: User Application Interface Tools Implementation of the interface described in this application note is quite intricate. There are several instances where very precise routing delays and placements are required. To enable customers to design effectively using the methods described, a tool is provided that works with various Virtex-II and Virtex-II Pro devices and generates all the required constraints for the design. The tool can also generate the recommended pinouts for the data bus. The constraints generated by the tool must be consistent with the RTL and the RTL hierarchy. Implementation This section describes various implementation considerations. Some differences exist between different memory types. The techniques described in this application are applicable in general to all of the different memory types. This section talks about the differences between different memories and how the techniques can be applied to different memories. Board Design Considerations Board design considerations for a memory interface are not very different from designing other high-speed interfaces. Several Xilinx application notes describe the intricacies of building highspeed boards. XAPP623 describes advanced techniques in power system design. Xilinx user guide UG060 describes a memory board designed for Virtex-II Pro using most of the techniques described in this application note. DDR-1 SDRAM Interface DDR-1 is arguably the most complex memory interface. All of the examples in this application note were for DDR-1. DDR-1 controller has been illustrated in other Xilinx application notes such as XAPP253. XAPP688 (v1.2) May 3, 2004 www.xilinx.com 1-800-255-7778 9 R References QDR-2 SRAM Interface QDR-2/DDR-2 SRAM is different in many aspects compared to DDR-1 SDRAM. This interface is described in detail in XAPP262. The controller is much simpler, since there is no need for separate row and column addresses. The physical layer is simpler since these memories have a contiguous or free-running strobe. A contiguous strobe can be easily used to capture the data in multiple stages and finally into an asynchronous FIFO. One complexity is the lack of a data valid signal. The absence of a data valid signal leaves the state machine to predict when valid data will be available to register. The state machine predicts the return of valid data by counting the number of cycles after the read command is issued. However, the input and output buffer speeds of different FPGAs or different speed grades of FPGAs can potentially cause the data to return at different cycle boundaries. This reduces the maximum frequency of operation for this kind of interface. Another novel innovation solves this problem, and makes it possible to determine the exact time when valid data will be received. RLDRAM-2/FCRAM-2 Interfaces Both RLDRAM-2 and FCRAM-2 interfaces are relatively simple. The controller design for either of these memories is comparable in complexity to QDR-2 controller. There are no separate row and column addresses to deal with. In addition, the strobes are contiguous and there is a separate valid signal that indicates the arrival of valid data. Xilinx will publish an RLDRAM-2 interface application note shortly. References 1. Xilinx Application Notes XAPP253, XAPP609, XAPP262, XAPP678C, and XAPP688C. 2. Micron data sheet for DDR-1 DIMMs, MT4VDDT1664 A. 3. Micron data sheets for DDR-1 MT46V16M16. Conclusion The innovative techniques described in this application note make it easy to design any memory interface with Xilinx Virtex-II and Virtex-II Pro FPGAs. Revision History The following table shows the revision history for this document. 10 Date Version Revision 01/15/04 1.0 Initial Xilinx release. 02/02/04 1.1 Removed Figure 12 (screenshot of unreleased tool). 03/01/04 1.1.1 Table 2: Corrected description for TSETUP and THOLD. 05/03/04 1.2 • Table 1 and Table 2: Added derivation of parameter to "Meaning" column for TPHASE_OFFSET_ERROR . • Table 3: Added derivation of parameters to "Meaning" column and revised parameter values for TSETUP and THOLD. Recalculated derived values (Uncertainties and Window). • Figure 9: Modified data window parameters in accord with changes in Table 3. www.xilinx.com 1-800-255-7778 XAPP688 (v1.2) May 3, 2004