Tandem Configuration for UltraScale and UltraScale+ Vivado 2017.3, October 2017 © Copyright 2017 Xilinx . Tandem Configuration Addresses Two PCIe & Configuration Requirements 1. PCIe Specification Requirement – Need 120 ms PCIe response time for enumeration in open system 2. Cost Reduction – Want small & inexpensive flash (QSPI, BPI) for configuration - or - – Use flash or Memory already present in the system PCIe Solution: – Split configuration into two stages (Tandem) • 1st: Configure the PCIe interface and clocking • 2nd: Configure the remainder of FPGA Page 3 Tandem Bitstream © Copyright 2017 Xilinx . User App PCI Express Configuration Solution Summary Solution Tandem PROM Tandem PCIe Tandem with Field Updates Tandem with Partial Reconfig Device Support 7 Series, Zynq UltraScale, US+ 7 Series, UltraScale, US+ UltraScale, US+ UltraScale, US+ Delivery IP Catalog IP Catalog IP Catalog IP Catalog Complexity Moderate Moderate Advanced Advanced System Solutions • 120ms PCIe config • 120ms PCIe config • PROM size reduced • 120ms PCIe config • PROM size reduced • Field Updates • 120ms PCIe config • PROM size reduced • Flexible reconfiguration PR license required? No No No Yes Xilinx PCIe IP cores supported: – UltraScale PCI Express Gen3 Integrated Block (streaming) for UltraScale – PG156 – AXI Bridge for PCI Express for UltraScale – PG194 – DMA Subsystem for PCI Express for UltraScale and UltraScale+ – PG195 – PCI Express Gen4 Integrated Block for UltraScale+ – PG213 * Always use the most recent software version available Page 4 © Copyright 2017 Xilinx . Tandem PROM vs. Tandem PCIe in UltraScale/+ Solution Tandem PROM Tandem PCIe Helps meet 120ms spec Yes Yes Reduces PROM size No Yes Stage 1 bitstream size 1-2 MB 1-2 MB IP Core modifications Minor Minor Field Update Support UltraScale only UltraScale and UltraScale+ UltraScale supports additional features beyond 7 Series – Field Updates of the user application while PCIe stays up • Reconfigurable region is vast majority of stage 2 region – Tandem followed by PR • Combination of Tandem Configuration and Partial Reconfiguration is possible – Dedicated links to configuration engine and PERST (US only) reduce frame count • Tandem PROM and Tandem PCIe utilize the same IP (US only) • Absolute maximum bitstream size will be similar across all devices – Required PCIe location is X0Y0, with associated transceivers, on monolithic devices • For SSI devices, location is bottom right corner of master SLR; XY value varies Page 5 © Copyright 2017 Xilinx . UltraScale+ Support Rollout Most devices are in production in 2017.3 – AXI Stream core supports all devices (with PCIe hard blocks) • Except VU+ HBM and ZU+ RFSoC parts – those will arrive as beta in 2018 – DMA core supports six devices; remainder arriving in 2018.1 Improvements compared to UltraScale – Support of all configurations up to Gen4x8, Gen3x16 – No clearing bitstream required for Field Updates (PR) – Plan to support “multiple stage 2 images” for a fixed stage 1 • Not yet supported in 2017.3 Must select MCAP-enabled PCIe hard block – Required PCIe location varies by device, see PG213 or PG195 for list – No dedicated reset pin, but bank 65 is recommended for greatest efficiency Page 6 © Copyright 2017 Xilinx . Customer Application BittWare creates FPGA Platforms for HPC, Network Packet Processing and Signal Processing Applications – COTS PCIe platforms built with Kintex UltraScale or Virtex UltraScale devices – Customizable to meet customer needs • Up to 4 PCIe Gen3 x8 interfaces • Variety of interfaces for high-speed serial I/O • Wide range of memory interfaces, optional HMC module Tandem PROM is available as an optional configuration scheme – Allows board to seamlessly plug into open systems – Exploring expanded functionality with Field Updates soon Page 8 BittWare XUSP3R © Copyright 2017 Xilinx . Customer Application ZDS builds signal acquisition/generation cards – Multiple cards in the product family use Kintex UltraScale devices – End user adds their own application details and compiles through Vivado Tandem PCIe allows them to comply within any environment – Cards designed to fit into off-the-shelf server and desktop computers – Supports PCIe Gen2 x8 or Gen3 x4 interfaces Page 9 © Copyright 2017 Xilinx . Configuration Flow Options © Copyright 2017 Xilinx . UltraScale Configuration Options for PCIe 120ms Guarantee None Standard configuration Tandem with Field Updates PR Over PCIe Vivado Flow Vivado Flow Vivado Flow • Project • Non-project • Project* • Non-project • Project* • Non-project Initial Configuration Initial Configuration Initial Configuration • Tandem PROM • Tandem PCIe • Tandem PROM • Tandem PCIe • Standard configuration Updates after Initial Configuration Updates after Initial Configuration Updates after Initial Configuration • Not Available • Via PCIe • Via PCIe Tandem First stage bitstreams are not compatible Page 11 © Copyright 2017 Xilinx . * See UG909 for Project mode details Tandem PROM in UltraScale and UltraScale+ CPU Other IO System dependent ROOT COMPLEX Memory PCIe Links PCIe PROM SWITCH (OPTIONAL) ENDPOINT PROM (2nd) (FPGA) Design #1 ENDPOINT ENDPOINT (FPGA) Page 12 (FPGA) • Tandem PROM 120ms – Compliance • • • • 1st Stage Loads from PROM/Flash PCIe Activates 2nd Stage Loads from PROM/Flash FPGA starts to operate © Copyright 2017 Xilinx . Tandem PCIe in UltraScale and UltraScale+ CPU Other IO System dependent ROOT COMPLEX Memory PCIe Links PCIe PROM SWITCH (OPTIONAL) ENDPOINT (FPGA) Design #1 ENDPOINT (FPGA) Page 13 ENDPOINT (FPGA) • • Tandem PCIe 120ms – Compliance Remote bitstream – Security, BOM Cost • • 1st Stage from PROM/Flash 2nd Stage loaded over PCIe link © Copyright 2017 Xilinx . Tandem with Field Updates in UltraScale CPU Other IO System dependent ROOT COMPLEX Memory PCIe Links PCIe PROM SWITCH (OPTIONAL) ENDPOINT (FPGA) Design #1 Design #1 Clear ENDPOINT (FPGA) Design #2 Page 14 ENDPOINT (FPGA) • Tandem with Field Updates 120ms – Compliance • • • • Initial load via Tandem PROM or Tandem PCIe FPGA updates over PCIe Must load “clear” bitstreams PCIe bus stays up © Copyright 2017 Xilinx . PR over PCIe in UltraScale and UltraScale+ CPU Other IO System dependent ROOT COMPLEX Memory PCIe Links PROM PCIe SWITCH (OPTIONAL) ENDPOINT (FPGA) Design #1 Region #1 ClearENDPOINT (FPGA) Region #1 Design • • • • Page 15 ENDPOINT (FPGA) PR over PCIe Allows for PR customers to load PR regions over PCIe Not guaranteed for 120ms (standard configuration used) Customers responsible for isolating PCIe core during PR PCIe Isolation mux controlled by system software © Copyright 2017 Xilinx . Tandem with Field Updates in UltraScale Pre-defined use case of Tandem + PR Both technologies permitted in the same design in UltraScale – Field Updates is Partial Reconfiguration for a specific use case and pre-defined floorplan Configuration events should be considered independent 1. Two stage Tandem Configuration occurs (via PROM or PCIe) 2. Partial Reconfiguration is done (via PCIe or any config port) • Clearing bitstream precedes new partial bitstream PROM Clear 0 Clear 1 Clear 2 Partial User App 0 User App 00 1 2 Clear 1 2 Partial User App 1 PCIe static frames FPGA Startup CFG PORT Stage 1 (PCIe) UltraScale FPGA Page 16 Stage 2 User App 0 © Copyright 2017 Xilinx . PCIe link Partial User App 2 Tandem with Field Updates in UltraScale+ Planned for release in Vivado 2018.1 New features in silicon improve the solution Events are independent, but bitstreams are consolidated 1. Two stage Tandem Configuration occurs (via Tandem PCIe only) • Any compatible Stage 2 bitstream can be used 2. Partial Reconfiguration is done (via PCIe or any config port) • Using the same set of Stage 2 bitstreams – these ARE partial bitstreams (and no clearing!) FPGA Startup PROM CFG PORT Stage 1 (PCIe) Stage 2 / Partial User App 0 User App 0 1 2 Stage 2 / Partial User App 1 PCIe UltraScale+ FPGA Page 17 © Copyright 2017 Xilinx . PCIe link Stage 2 / Partial User App 2 Tandem with Field Updates UltraScale+ Status Vivado 2017.3 does not include Multiple Stage Two bitstreams – Field Updates have unique stage 2 and partial bitstreams, just like UltraScale • But no clearing bitstream requirement – Bitstream generation is gated by a parameter so users understand the format change in 2018 When supported, users can pick any compatible stage 2 bitstream to complete the initial configuration, then reload with a different stage 2 bitstream to update the application – Minimizes the number of bitstreams to manage – Tandem PCIe is required – For the DMA version of the IP, the DMA itself will be reset as it is part of stage 2 Page 18 © Copyright 2017 Xilinx . Software Flow Details © Copyright 2017 Xilinx . Vivado UltraScale Solution Overview IP defines physical fastboot region – Pblocks for floorplan generated as part of IP creation – Satellite Pblocks used for other first stage resources – Clocking and IO added for full functionality Implementation determines total frameset for first stage – Routing-only frames inferred by tools Two-pass configuration – Each frame of the device configured exactly once • Routing-only frames are configured in 1st stage, logic within reset in 2nd stage – All logic in design initialized immediately before it is active Configuration IO banks must be active for first stage – PCIe reset pin must use standard input pin (can be in config bank or other) – Users can insert IO controls for second stage IO in first stage banks • Control signals connect to IP status pins for synchronized release Page 20 © Copyright 2017 Xilinx . Vivado Tandem Floorplan Green & yellow show stage 1, blue & yellow show stage 2 – Partition pins (red) established within stage 2 region – Yellow frames configured with stage 1, reconfigured with stage 2 – IP creates floorplan for both stages, implementation determines framesets User Application IO PCIe CLK Page 21 © Copyright 2017 Xilinx . Tandem IP Core Modifications How does the Tandem core differ from the standard PCIe IP? Handshaking event used to identify stage 2 completion – Use to coordinate internal completion events – Once user app begins, core function releases internal “done” response – Flag is EOS pin on STARTUP module in UltraScale, from host in UltraScale+ Muxes placed on critical IP core inputs – Internal signals from user app are undriven after stage 1 – muxes ensures these inputs do not float, which could disrupt the PCIe design – mcap_design_switch enables connections from user app to IP when ready Reduced functionality of PCIe core until stage 2 is configured – Holds off read/write requests until user app is ready for them Page 22 © Copyright 2017 Xilinx . PCIe IP Core Generation Set Advanced mode Set MCAP-enabled PCIe instance All configurations supported 3 Xilinx IP supported: • PCIe Gen3 • AXI Bridge for PCIe • DMA Subsystem Simple user interface within the IP Catalog Page 23 © Copyright 2017 Xilinx . Vivado Implementation Flow IP core created with XDC constraints for Tandem set stage1Pblock [create_pblock pcie3_ultrascale_0_Stage1_main] add_cells_to_pblock $stage1Pblock [get_cells] resize_pblock $stage1Pblock -add {SLICE_X84Y0:SLICE_X100Y119 \ ... (repeats for BRAM, DSP, GT_COMMON, etc.) PCIE_3_1_X0Y0} set_property HD.TANDEM 1 [get_cells] User Application RTL and Constraints IP constraints create “satellite” pblocks PCIe Core RTL and Constraints – Pulls critical elements into first stage definition • IO frames, clock resources, etc. Follow the normal implementation flow – Integrate PCIe core RTL and constraints into User Application – Or implement the PIO Example Design Page 24 © Copyright 2017 Xilinx . Integrate User Design and PCIe Normal Synthesize and Implement Bitstream Generation write_bitstream reports the number of bits in each stage – Calculate 1st stage configuration time and storage requirements INFO: [Vivado 12-2358] Enabled Tandem boot bitstream. Creating bitstream... Tandem stage1 bitstream contains 9840960 bits. Tandem stage2 bitstream contains 376792288 bits. Tandem PCIe flow creates two explicit bit files – First .bit file stored in PROM for initial boot – Second .bin file stored in filesystem for PCIe load User can control bitstream generation – set_property HD.TANDEM_BITSTREAMS separate|combined|none [current_design] – For UltraScale devices only, there is one Tandem IP, so there is no differentiation in PROM vs. PCIe until bitstream generation Page 25 © Copyright 2017 Xilinx . Tandem with Field Updates for UltraScale UltraScale devices can update user application on the fly – The 2016.1 release provides full access for this feature Field Updates for UltraScale IS NOT multiple stage 2 bitstreams for a fixed stage 1 bitstream – Field Updates IS partial reconfiguration of the majority of stage 2 region – Flash will not need to be updated as long as PCIe IP is not changed – Supports both Tandem PROM and Tandem PCIe flows No PR license is necessary for Field Updates use case – Project and non-project flows supported, but example design is non-project (for now) – Because it is PR, use of “clearing” bit files is required (UltraScale only) General PR after Tandem load is also supported – Either use Field Updates OR Tandem + PR, but not both in the same design – Reconfigure a smaller region within stage 2, or multiple independent regions Page 26 © Copyright 2017 Xilinx . Improved Tandem with Field Updates Supported for UltraScale+ UltraScale+ devices can update user application on the fly – This release does NOT yet support multiple stage 2s – solution expected by 2017.3 Field Updates for UltraScale+ IS* multiple stage 2 bitstreams for a fixed stage 1 bitstream – Field Updates IS partial reconfiguration of the stage 2 region – Flash will not need to be updated as long as PCIe IP is not changed – Supports ONLY the Tandem PCIe flow for VU+, KU+; both Tandem flows on ZU+ No PR license is necessary for Field Updates use case – Project and non-project flows supported, but example design is non-project (for now) – No clearing bitstreams required, as it is PR of UltraScale+ General PR after Tandem load is also supported – Either use Field Updates OR Tandem + PR, but not both in the same design – Reconfigure a smaller region within stage 2, or multiple independent regions * Will be, after it is released Page 27 © Copyright 2017 Xilinx . Tandem with Field Updates – Hierarchy Top xilinx_pcie3_uscale_ep Reconfigurable User Application Update Region PCIe IP pcie_app_uscale pcie3_ultrascale_0 Top contains only two instantiations plus Bank 65 IO User design is placed in the Update Region – Including all IO instantiated KU040 Page 28 IO Bank 65 Design structure supplied as IP example design PCIe IP © Copyright 2017 Xilinx . Design Flow Summary PCIe IP • IP generation options for Tandem Configuration flows • XDC contains Pblock constraints to floorplan PCIe core • User sets first stage IO bank details Synthesis and P&R • Standard Vivado implementation flow • Implementation segments design automatically • No Partial Reconfiguration license necessary* Bitstream • Bitstream programs two stages separately • write_bitstream creates single Tandem PROM bitstream or two bitstreams for Tandem PCIe * PR license not required for Field Updates as long as delivered floorplan is not modified. General PR over PCIe or Tandem + PR (for smaller or multiple Reconfigurable Partitions) will require a PR license. Page 29 © Copyright 2017 Xilinx . Additional Details © Copyright 2017 Xilinx . 1st Stage Bitstream Size First stage bitstream size depends on the: – IP: Number of frames included in the Tandem Area • Remember, x16 modes require 4 GT quads – Device: Global clock frames, width of device – Compression: Set by default to reduce bitstream size – Design: Has a minor impact in UltraScale Size of 1st stage is about 1-2 MB, depending on device – UltraScale stage 1 bitstreams are much smaller than 7 series (percentage wise) – No difference between Tandem PROM and Tandem PCIe • Starting in 2015.1, there is just a single Tandem IP – Little variability in absolute sizes between devices • Ranges from 1 to 2 MB prior to compression Page 31 © Copyright 2017 Xilinx . Timing Examples for Tandem PROM 1st Stage Device: Virtex UltraScale VU095 Configuration Solution Clock Frequency Config Time for 1st stage bitstream Config Time for standard bitstream SPI 100 MHz 87.4 ms 2734.5 ms QSPI 66 MHz 33.1 ms 1035.8 ms BPI x16 Sync Mode 50 MHz 10.9 ms 341.8 ms BPI x16 Sync Mode 80 MHz 6.8 ms 213.6 ms Estimates based on 8.8 Mb 1st stage bitstream size – Exactly the same for Tandem PROM vs. Tandem PCIe – These numbers are without compression, and are therefore worst case User must also add Tpor to timing budget – Tpor = 50 ms, or 35 ms for fast ramp rate Page 32 © Copyright 2017 Xilinx . Tandem PCIe Software Details Tandem PCIe Bitstream high level flow – User mode application and kernel mode driver required to send bitstream over PCIe to configure 2nd stage – Bitstream transmission is via 1DW Configuration Writes (PIO) • Configuration rate depends on many factors, including PCIe configuration, system latency, and response time for packet write completion Kernel SW driver and User application targeting the VCU107 or KCU105 is available as an example – Delivery of software and documentation is via Answer Record 64761 – Target Vendor ID 16’h10EE and Device ID 16’h8038 Page 33 © Copyright 2017 Xilinx . Silicon & Design Considerations Hardware Considerations – When first stage IO banks become active, all IO in those banks are alive • Second stage IO in those banks are active and outputs float until second stage completes – Users can insert OBUFT or mux to drive Z or constant until second stage is done • Second stage IO in unconfigured banks will pull high until second stage is done – Use PUDC_B to remove these pullups – All GTs in quad are consumed even when x2 or x1 selected in UltraScale • Initialization granularity is per quad in UltraScale – For Tandem PROM, persist is required for all architectures • Dual-mode configuration pins cannot be used as user IO Changing floorplan or constraints – Work with Xilinx support if the cores do not meet your needs out of the box Additional considerations for Field Updates – Consult PG156 for complete details Page 34 © Copyright 2017 Xilinx . Future Enhancements After PCIe work is complete in Vivado, the Tandem solution may be opened up for more general use – Third party PCIe cores could take advantage of this approach – Configuration over different protocols/interfaces could be supported • Third party IP • CAN FD • SRIO • Ethernet – Software approach is the same, key is testing and documentation • Allow users to apply this approach, but guide them to safe practices – Timetable beyond Xilinx PCIe IP has not yet been established • Send requests and customer details to the Tandem Configuration team Page 35 © Copyright 2017 Xilinx . Documentation Tandem Configuration documented in PCIe IP Product Guides – PG054 for 7 series Gen2 PCIe IP – PG023 for Virtex-7 Gen3 PCIe IP – PG156 for UltraScale Gen3 PCIe IP – PG213 for UltraScale+ Gen4 PCIe IP – PG194 and PG195 send users back to PG156 and PG213 for complete details QuickTake Videos review overall solution – UltraScale and UltraScale with Field Updates Page 36 © Copyright 2017 Xilinx . Summary Tandem PROM – Single bitstream divided into two stages with intermediate FPGA Startup – Load from single PROM device Tandem PCIe – Two bitstreams for the two configuration stages – Load first from PROM, second over PCIe link Tandem with Field Updates – Load first stage via Tandem PROM or Tandem PCIe – Use Partial Reconfiguration to dynamically swap vast majority of stage 2 design Vivado Design Flow – Vivado solution handles intersection of silicon and design requirements – Automated scripts manage unique IP requirements Page 37 © Copyright 2017 Xilinx . Appendix © Copyright 2017 Xilinx . FAQ Page 1 of 2 Should I be concerned about the 120 ms requirement for PCI Express? – If your design is an add-in card endpoint, intended to interoperate with systems available on the open-market, then you will likely need to comply to the requirement. If your design is an embedded system and you have full control of the reset, then you likely do not need to comply. Is Partial Reconfiguration required? Will my customer need a PR license? – Most Tandem flows do not require a Partial Reconfiguration license. The first and second stages of a Tandem bitstream pair are two parts to a single whole and do not use PR. The Tandem with Field Updates capability bypasses the license check even though it processes the design and creates partial bitstreams using PR. Only a more general Tandem + PR solution, where the user can modify the hierarchy and floorplan, will require a PR license. Are any configuration options or strategies prohibited? – Thus far in testing, one strict requirement is that Persist is needed for Tandem PROM. Bitstream features such as compression fallback look good in testing. Encryption is supported for both Tandem PROM and Tandem PCIe. Why is the first stage bitstream size not fixed? – The frames required for the first stage will vary depending on the area group range for the IP as well as other logic included, and must also include clock frames and others, as determined by software. The first stage bitstream composition will vary from design to design. Page 39 © Copyright 2017 Xilinx . FAQ Page 2 of 2 What about soft-IP PCIe cores from 3rd parties? – The technology developed for Tandem PROM/PCIe is applicable to soft-IP PCI Express cores. Xilinx will engage with 3rd party partners to enable support for this feature in the future. What PROMs can I use? – Tandem PROM/PCIe puts no restrictions on PROM types. As long as the PROM is supported in general, it will support Tandem PROM/PCIe. However, users must ensure that the selected PROM device will meet configuration time specifications, if that is a goal. For larger devices, BPI flash running at 50MHz+ will still be needed to configure in less than 120ms. When will <my_device> be supported? – As of 2016.1, all UltraScale devices are supported – As of 2017.3, all UltraScale+ devices (except for VU+ HBM and Zynq RFSoC) are supported for the base AXI stream core • Remaining devices and remaining combinations (Field Updates, DMA) are planned for 2018.1 Can I use the Tandem approach for other protocols such as Ethernet? – Eventually, yes. We are starting with PCIe to meet specific customer demand, but a longer term goal is to open this approach for more use cases. Software DRCs and IP-specific enhancements will be needed for a safe working environment. We are engaging with a few key users and market segments right now. Page 40 © Copyright 2017 Xilinx .