Uploaded by Константин Захаров

Tandem Configuration (PROM PCIe) for UltraScale and UltraScale+

advertisement
Tandem Configuration
for UltraScale and UltraScale+
Vivado 2017.3, October 2017
© Copyright 2017 Xilinx
.
Tandem Configuration Addresses
Two PCIe & Configuration Requirements
1. PCIe Specification Requirement
– Need 120 ms PCIe response time for enumeration in open system
2. Cost Reduction
– Want small & inexpensive flash (QSPI, BPI) for configuration
- or -
– Use flash or Memory already present in the system
PCIe
Solution:
– Split configuration into two stages (Tandem)
• 1st: Configure the PCIe interface and clocking
• 2nd: Configure the remainder of FPGA
Page 3
Tandem Bitstream
© Copyright 2017 Xilinx
.
User App
PCI Express Configuration Solution Summary
Solution 
Tandem
PROM
Tandem
PCIe
Tandem with Field
Updates
Tandem with
Partial Reconfig
Device
Support
7 Series, Zynq
UltraScale, US+
7 Series,
UltraScale, US+
UltraScale, US+
UltraScale, US+
Delivery
IP Catalog
IP Catalog
IP Catalog
IP Catalog
Complexity
Moderate
Moderate
Advanced
Advanced
System
Solutions
• 120ms PCIe config
• 120ms PCIe config
• PROM size reduced
• 120ms PCIe config
• PROM size reduced
• Field Updates
• 120ms PCIe config
• PROM size reduced
• Flexible reconfiguration
PR license
required?
No
No
No
Yes
Xilinx PCIe IP cores supported:
– UltraScale PCI Express Gen3 Integrated Block (streaming) for UltraScale – PG156
– AXI Bridge for PCI Express for UltraScale – PG194
– DMA Subsystem for PCI Express for UltraScale and UltraScale+ – PG195
– PCI Express Gen4 Integrated Block for UltraScale+ – PG213
* Always use the most recent software version available
Page 4
© Copyright 2017 Xilinx
.
Tandem PROM vs. Tandem PCIe in UltraScale/+
Solution 
Tandem PROM
Tandem PCIe
Helps meet 120ms spec
Yes
Yes
Reduces PROM size
No
Yes
Stage 1 bitstream size
1-2 MB
1-2 MB
IP Core modifications
Minor
Minor
Field Update Support
UltraScale only
UltraScale and UltraScale+
UltraScale supports additional features beyond 7 Series
– Field Updates of the user application while PCIe stays up
• Reconfigurable region is vast majority of stage 2 region – Tandem followed by PR
• Combination of Tandem Configuration and Partial Reconfiguration is possible
– Dedicated links to configuration engine and PERST (US only) reduce frame count
• Tandem PROM and Tandem PCIe utilize the same IP (US only)
• Absolute maximum bitstream size will be similar across all devices
– Required PCIe location is X0Y0, with associated transceivers, on monolithic devices
• For SSI devices, location is bottom right corner of master SLR; XY value varies
Page 5
© Copyright 2017 Xilinx
.
UltraScale+ Support Rollout
Most devices are in production in 2017.3
– AXI Stream core supports all devices (with PCIe hard blocks)
• Except VU+ HBM and ZU+ RFSoC parts – those will arrive as beta in 2018
– DMA core supports six devices; remainder arriving in 2018.1
Improvements compared to UltraScale
– Support of all configurations up to Gen4x8, Gen3x16
– No clearing bitstream required for Field Updates (PR)
– Plan to support “multiple stage 2 images” for a fixed stage 1
• Not yet supported in 2017.3
Must select MCAP-enabled PCIe hard block
– Required PCIe location varies by device, see PG213 or PG195 for list
– No dedicated reset pin, but bank 65 is recommended for greatest efficiency
Page 6
© Copyright 2017 Xilinx
.
Customer Application
BittWare creates FPGA Platforms for HPC, Network Packet Processing
and Signal Processing Applications
– COTS PCIe platforms built with Kintex UltraScale or Virtex UltraScale devices
– Customizable to meet customer needs
• Up to 4 PCIe Gen3 x8 interfaces
• Variety of interfaces for high-speed serial I/O
• Wide range of memory interfaces, optional HMC module
Tandem PROM is available as
an optional configuration scheme
– Allows board to seamlessly plug into
open systems
– Exploring expanded functionality with
Field Updates soon
Page 8
BittWare XUSP3R
© Copyright 2017 Xilinx
.
Customer Application
ZDS builds signal acquisition/generation cards
– Multiple cards in the product family use Kintex UltraScale devices
– End user adds their own application details and compiles through Vivado
Tandem PCIe allows them to comply within any environment
– Cards designed to fit into off-the-shelf server and desktop computers
– Supports PCIe Gen2 x8 or Gen3 x4 interfaces
Page 9
© Copyright 2017 Xilinx
.
Configuration Flow Options
© Copyright 2017 Xilinx
.
UltraScale Configuration Options for PCIe
120ms Guarantee
None
Standard
configuration
Tandem
with Field
Updates
PR Over
PCIe
Vivado Flow
Vivado Flow
Vivado Flow
• Project
• Non-project
• Project*
• Non-project
• Project*
• Non-project
Initial
Configuration
Initial
Configuration
Initial
Configuration
• Tandem PROM
• Tandem PCIe
• Tandem PROM
• Tandem PCIe
• Standard
configuration
Updates after
Initial
Configuration
Updates after
Initial
Configuration
Updates after
Initial
Configuration
• Not Available
• Via PCIe
• Via PCIe
Tandem
First stage bitstreams
are not compatible
Page 11
© Copyright 2017 Xilinx
.
* See UG909 for Project mode details
Tandem PROM in UltraScale and UltraScale+
CPU
Other IO
System dependent
ROOT COMPLEX
Memory
PCIe Links
PCIe
PROM
SWITCH
(OPTIONAL)
ENDPOINT
PROM
(2nd)
(FPGA)
Design #1
ENDPOINT
ENDPOINT
(FPGA)
Page 12
(FPGA)
•
Tandem PROM
120ms – Compliance
•
•
•
•
1st Stage Loads from PROM/Flash
PCIe Activates
2nd Stage Loads from PROM/Flash
FPGA starts to operate
© Copyright 2017 Xilinx
.
Tandem PCIe in UltraScale and UltraScale+
CPU
Other IO
System dependent
ROOT COMPLEX
Memory
PCIe Links
PCIe
PROM
SWITCH
(OPTIONAL)
ENDPOINT
(FPGA)
Design #1
ENDPOINT
(FPGA)
Page 13
ENDPOINT
(FPGA)
•
•
Tandem PCIe
120ms – Compliance
Remote bitstream – Security,
BOM Cost
•
•
1st Stage from PROM/Flash
2nd Stage loaded over PCIe link
© Copyright 2017 Xilinx
.
Tandem with Field Updates in UltraScale
CPU
Other IO
System dependent
ROOT COMPLEX
Memory
PCIe Links
PCIe
PROM
SWITCH
(OPTIONAL)
ENDPOINT
(FPGA)
Design #1
Design #1 Clear ENDPOINT
(FPGA)
Design #2
Page 14
ENDPOINT
(FPGA)
•
Tandem with Field Updates
120ms – Compliance
•
•
•
•
Initial load via Tandem PROM or Tandem PCIe
FPGA updates over PCIe
Must load “clear” bitstreams
PCIe bus stays up
© Copyright 2017 Xilinx
.
PR over PCIe in UltraScale and UltraScale+
CPU
Other IO
System dependent
ROOT COMPLEX
Memory
PCIe Links
PROM
PCIe
SWITCH
(OPTIONAL)
ENDPOINT
(FPGA)
Design #1
Region #1 ClearENDPOINT
(FPGA)
Region #1 Design
•
•
•
•
Page 15
ENDPOINT
(FPGA)
PR over PCIe
Allows for PR customers to load PR regions over PCIe
Not guaranteed for 120ms (standard configuration used)
Customers responsible for isolating PCIe core during PR
PCIe Isolation mux controlled by system software
© Copyright 2017 Xilinx
.
Tandem with Field Updates in UltraScale
Pre-defined use case of Tandem + PR
Both technologies permitted in the same design in UltraScale
– Field Updates is Partial Reconfiguration for a specific use case and pre-defined floorplan
Configuration events should be considered independent
1. Two stage Tandem Configuration occurs (via PROM or PCIe)
2. Partial Reconfiguration is done (via PCIe or any config port)
• Clearing bitstream precedes new partial bitstream
PROM
Clear 0
Clear 1
Clear 2
Partial
User App 0
User
App
00
1
2
Clear
1
2
Partial
User App 1
PCIe
static frames
FPGA Startup
CFG PORT
Stage 1 (PCIe)
UltraScale FPGA
Page 16
Stage 2
User App 0
© Copyright 2017 Xilinx
.
PCIe link
Partial
User App 2
Tandem with Field Updates in UltraScale+
Planned for release in Vivado 2018.1
New features in silicon improve the solution
Events are independent, but bitstreams are consolidated
1. Two stage Tandem Configuration occurs (via Tandem PCIe only)
•
Any compatible Stage 2 bitstream can be used
2. Partial Reconfiguration is done (via PCIe or any config port)
•
Using the same set of Stage 2 bitstreams – these ARE partial bitstreams (and no clearing!)
FPGA Startup
PROM
CFG PORT
Stage 1 (PCIe)
Stage 2 / Partial
User App 0
User App 0
1
2
Stage 2 / Partial
User App 1
PCIe
UltraScale+ FPGA
Page 17
© Copyright 2017 Xilinx
.
PCIe link
Stage 2 / Partial
User App 2
Tandem with Field Updates UltraScale+ Status
Vivado 2017.3 does not include Multiple Stage Two bitstreams
– Field Updates have unique stage 2 and partial bitstreams, just like UltraScale
• But no clearing bitstream requirement
– Bitstream generation is gated by a parameter so users understand the format change in 2018
When supported, users can pick any compatible stage 2 bitstream to
complete the initial configuration, then reload with a different stage 2
bitstream to update the application
– Minimizes the number of bitstreams to manage
– Tandem PCIe is required
– For the DMA version of the IP, the DMA itself will be reset as it is part of stage 2
Page 18
© Copyright 2017 Xilinx
.
Software Flow Details
© Copyright 2017 Xilinx
.
Vivado UltraScale Solution Overview
IP defines physical fastboot region
– Pblocks for floorplan generated as part of IP creation
– Satellite Pblocks used for other first stage resources – Clocking and IO added for full functionality
Implementation determines total frameset for first stage
– Routing-only frames inferred by tools
Two-pass configuration
– Each frame of the device configured exactly once
• Routing-only frames are configured in 1st stage, logic within reset in 2nd stage
– All logic in design initialized immediately before it is active
Configuration IO banks must be active for first stage
– PCIe reset pin must use standard input pin (can be in config bank or other)
– Users can insert IO controls for second stage IO in first stage banks
• Control signals connect to IP status pins for synchronized release
Page 20
© Copyright 2017 Xilinx
.
Vivado Tandem Floorplan
Green & yellow show stage 1, blue & yellow show stage 2
– Partition pins (red) established within stage 2 region
– Yellow frames configured with stage 1, reconfigured with stage 2
– IP creates floorplan for both stages, implementation determines framesets
User Application
IO
PCIe
CLK
Page 21
© Copyright 2017 Xilinx
.
Tandem IP Core Modifications
How does the Tandem core differ from the standard PCIe IP?
Handshaking event used to identify stage 2 completion
– Use to coordinate internal completion events
– Once user app begins, core function releases internal “done” response
– Flag is EOS pin on STARTUP module in UltraScale, from host in UltraScale+
Muxes placed on critical IP core inputs
– Internal signals from user app are undriven after stage 1 – muxes ensures these inputs do not
float, which could disrupt the PCIe design
– mcap_design_switch enables connections from user app to IP when ready
Reduced functionality of PCIe core until stage 2 is configured
– Holds off read/write requests until user app is ready for them
Page 22
© Copyright 2017 Xilinx
.
PCIe IP Core Generation
Set Advanced mode
Set MCAP-enabled
PCIe instance
All configurations
supported
3 Xilinx IP supported:
• PCIe Gen3
• AXI Bridge for PCIe
• DMA Subsystem
Simple user interface
within the IP Catalog
Page 23
© Copyright 2017 Xilinx
.
Vivado Implementation Flow
IP core created with XDC constraints for Tandem
set stage1Pblock [create_pblock pcie3_ultrascale_0_Stage1_main]
add_cells_to_pblock $stage1Pblock [get_cells]
resize_pblock $stage1Pblock -add {SLICE_X84Y0:SLICE_X100Y119 \
... (repeats for BRAM, DSP, GT_COMMON, etc.)
PCIE_3_1_X0Y0}
set_property HD.TANDEM 1 [get_cells]
User
Application
RTL and
Constraints
IP constraints create “satellite” pblocks
PCIe Core
RTL and
Constraints
– Pulls critical elements into first stage definition
• IO frames, clock resources, etc.
Follow the normal implementation flow
– Integrate PCIe core RTL and constraints into User Application
– Or implement the PIO Example Design
Page 24
© Copyright 2017 Xilinx
.
Integrate
User Design
and PCIe
Normal
Synthesize
and
Implement
Bitstream Generation
write_bitstream reports the number of bits in each stage
– Calculate 1st stage configuration time and storage requirements
INFO: [Vivado 12-2358] Enabled Tandem boot bitstream.
Creating bitstream...
Tandem stage1 bitstream contains 9840960 bits.
Tandem stage2 bitstream contains 376792288 bits.
Tandem PCIe flow creates two explicit bit files
– First .bit file stored in PROM for initial boot
– Second .bin file stored in filesystem for PCIe load
User can control bitstream generation
– set_property HD.TANDEM_BITSTREAMS separate|combined|none [current_design]
– For UltraScale devices only, there is one Tandem IP, so there is no differentiation in PROM vs.
PCIe until bitstream generation
Page 25
© Copyright 2017 Xilinx
.
Tandem with Field Updates for UltraScale
UltraScale devices can update user application on the fly
– The 2016.1 release provides full access for this feature
Field Updates for UltraScale IS NOT multiple stage 2 bitstreams for a
fixed stage 1 bitstream
– Field Updates IS partial reconfiguration of the majority of stage 2 region
– Flash will not need to be updated as long as PCIe IP is not changed
– Supports both Tandem PROM and Tandem PCIe flows
No PR license is necessary for Field Updates use case
– Project and non-project flows supported, but example design is non-project (for now)
– Because it is PR, use of “clearing” bit files is required (UltraScale only)
General PR after Tandem load is also supported
– Either use Field Updates OR Tandem + PR, but not both in the same design
– Reconfigure a smaller region within stage 2, or multiple independent regions
Page 26
© Copyright 2017 Xilinx
.
Improved Tandem with Field Updates
Supported for UltraScale+
UltraScale+ devices can update user application on the fly
– This release does NOT yet support multiple stage 2s – solution expected by 2017.3
Field Updates for UltraScale+ IS* multiple stage 2 bitstreams for a fixed
stage 1 bitstream
– Field Updates IS partial reconfiguration of the stage 2 region
– Flash will not need to be updated as long as PCIe IP is not changed
– Supports ONLY the Tandem PCIe flow for VU+, KU+; both Tandem flows on ZU+
No PR license is necessary for Field Updates use case
– Project and non-project flows supported, but example design is non-project (for now)
– No clearing bitstreams required, as it is PR of UltraScale+
General PR after Tandem load is also supported
– Either use Field Updates OR Tandem + PR, but not both in the same design
– Reconfigure a smaller region within stage 2, or multiple independent regions
* Will be, after it is released
Page 27
© Copyright 2017 Xilinx
.
Tandem with Field Updates – Hierarchy
Top
xilinx_pcie3_uscale_ep
Reconfigurable
User Application
Update Region
PCIe IP
pcie_app_uscale
pcie3_ultrascale_0
Top contains only two
instantiations plus Bank 65 IO
User design is placed in the
Update Region
– Including all IO instantiated
KU040
Page 28
IO Bank 65
Design structure supplied as IP
example design
PCIe IP
© Copyright 2017 Xilinx
.
Design Flow Summary
PCIe IP
• IP generation options for Tandem Configuration flows
• XDC contains Pblock constraints to floorplan PCIe core
• User sets first stage IO bank details
Synthesis
and P&R
• Standard Vivado implementation flow
• Implementation segments design automatically
• No Partial Reconfiguration license necessary*
Bitstream
• Bitstream programs two stages separately
• write_bitstream creates single Tandem PROM bitstream
or two bitstreams for Tandem PCIe
* PR license not required for Field Updates as long as delivered floorplan is not modified.
General PR over PCIe or Tandem + PR (for smaller or multiple Reconfigurable Partitions) will require a PR license.
Page 29
© Copyright 2017 Xilinx
.
Additional Details
© Copyright 2017 Xilinx
.
1st Stage Bitstream Size
First stage bitstream size depends on the:
– IP: Number of frames included in the Tandem Area
• Remember, x16 modes require 4 GT quads
– Device: Global clock frames, width of device
– Compression: Set by default to reduce bitstream size
– Design: Has a minor impact in UltraScale
Size of 1st stage is about 1-2 MB, depending on device
– UltraScale stage 1 bitstreams are much smaller than 7 series (percentage wise)
– No difference between Tandem PROM and Tandem PCIe
• Starting in 2015.1, there is just a single Tandem IP
– Little variability in absolute sizes between devices
• Ranges from 1 to 2 MB prior to compression
Page 31
© Copyright 2017 Xilinx
.
Timing Examples for Tandem PROM 1st Stage
Device: Virtex UltraScale VU095
Configuration
Solution
Clock
Frequency
Config Time for
1st stage
bitstream
Config Time for
standard
bitstream
SPI
100 MHz
87.4 ms
2734.5 ms
QSPI
66 MHz
33.1 ms
1035.8 ms
BPI x16 Sync
Mode
50 MHz
10.9 ms
341.8 ms
BPI x16 Sync
Mode
80 MHz
6.8 ms
213.6 ms
Estimates based on 8.8 Mb 1st stage bitstream size
– Exactly the same for Tandem PROM vs. Tandem PCIe
– These numbers are without compression, and are therefore worst case
User must also add Tpor to timing budget
– Tpor = 50 ms, or 35 ms for fast ramp rate
Page 32
© Copyright 2017 Xilinx
.
Tandem PCIe Software Details
Tandem PCIe Bitstream high level flow
– User mode application and kernel mode driver required to send bitstream over PCIe to
configure 2nd stage
– Bitstream transmission is via 1DW Configuration Writes (PIO)
• Configuration rate depends on many factors, including PCIe configuration, system latency, and
response time for packet write completion
Kernel SW driver and User application targeting the VCU107 or
KCU105 is available as an example
– Delivery of software and documentation is via Answer Record 64761
– Target Vendor ID 16’h10EE and Device ID 16’h8038
Page 33
© Copyright 2017 Xilinx
.
Silicon & Design Considerations
Hardware Considerations
– When first stage IO banks become active, all IO in those banks are alive
• Second stage IO in those banks are active and outputs float until second stage completes
– Users can insert OBUFT or mux to drive Z or constant until second stage is done
• Second stage IO in unconfigured banks will pull high until second stage is done
– Use PUDC_B to remove these pullups
– All GTs in quad are consumed even when x2 or x1 selected in UltraScale
• Initialization granularity is per quad in UltraScale
– For Tandem PROM, persist is required for all architectures
• Dual-mode configuration pins cannot be used as user IO
Changing floorplan or constraints
– Work with Xilinx support if the cores do not meet your needs out of the box
Additional considerations for Field Updates
– Consult PG156 for complete details
Page 34
© Copyright 2017 Xilinx
.
Future Enhancements
After PCIe work is complete in Vivado, the Tandem solution may be
opened up for more general use
– Third party PCIe cores could take advantage of this approach
– Configuration over different protocols/interfaces could be supported
• Third party IP
• CAN FD
• SRIO
• Ethernet
– Software approach is the same, key is testing and documentation
• Allow users to apply this approach, but guide them to safe practices
– Timetable beyond Xilinx PCIe IP has not yet been established
• Send requests and customer details to the Tandem Configuration team
Page 35
© Copyright 2017 Xilinx
.
Documentation
Tandem Configuration documented in PCIe IP Product Guides
– PG054 for 7 series Gen2 PCIe IP
– PG023 for Virtex-7 Gen3 PCIe IP
– PG156 for UltraScale Gen3 PCIe IP
– PG213 for UltraScale+ Gen4 PCIe IP
– PG194 and PG195 send users back to PG156 and PG213 for complete details
QuickTake Videos review overall solution
– UltraScale and UltraScale with Field Updates
Page 36
© Copyright 2017 Xilinx
.
Summary
Tandem PROM
– Single bitstream divided into two stages with intermediate FPGA Startup
– Load from single PROM device
Tandem PCIe
– Two bitstreams for the two configuration stages
– Load first from PROM, second over PCIe link
Tandem with Field Updates
– Load first stage via Tandem PROM or Tandem PCIe
– Use Partial Reconfiguration to dynamically swap vast majority of stage 2 design
Vivado Design Flow
– Vivado solution handles intersection of silicon and design requirements
– Automated scripts manage unique IP requirements
Page 37
© Copyright 2017 Xilinx
.
Appendix
© Copyright 2017 Xilinx
.
FAQ
Page 1 of 2
Should I be concerned about the 120 ms requirement for PCI Express?
– If your design is an add-in card endpoint, intended to interoperate with systems available on the open-market, then you
will likely need to comply to the requirement. If your design is an embedded system and you have full control of the reset,
then you likely do not need to comply.
Is Partial Reconfiguration required? Will my customer need a PR license?
– Most Tandem flows do not require a Partial Reconfiguration license. The first and second stages of a Tandem bitstream
pair are two parts to a single whole and do not use PR. The Tandem with Field Updates capability bypasses the license
check even though it processes the design and creates partial bitstreams using PR. Only a more general Tandem + PR
solution, where the user can modify the hierarchy and floorplan, will require a PR license.
Are any configuration options or strategies prohibited?
– Thus far in testing, one strict requirement is that Persist is needed for Tandem PROM. Bitstream features such as
compression fallback look good in testing. Encryption is supported for both Tandem PROM and Tandem PCIe.
Why is the first stage bitstream size not fixed?
– The frames required for the first stage will vary depending on the area group range for the IP as well as other logic
included, and must also include clock frames and others, as determined by software. The first stage bitstream
composition will vary from design to design.
Page 39
© Copyright 2017 Xilinx
.
FAQ
Page 2 of 2
What about soft-IP PCIe cores from 3rd parties?
– The technology developed for Tandem PROM/PCIe is applicable to soft-IP PCI Express cores. Xilinx will engage with 3rd
party partners to enable support for this feature in the future.
What PROMs can I use?
– Tandem PROM/PCIe puts no restrictions on PROM types. As long as the PROM is supported in general, it will support
Tandem PROM/PCIe. However, users must ensure that the selected PROM device will meet configuration time
specifications, if that is a goal. For larger devices, BPI flash running at 50MHz+ will still be needed to configure in less
than 120ms.
When will <my_device> be supported?
– As of 2016.1, all UltraScale devices are supported
– As of 2017.3, all UltraScale+ devices (except for VU+ HBM and Zynq RFSoC) are supported for the base AXI stream core
• Remaining devices and remaining combinations (Field Updates, DMA) are planned for 2018.1
Can I use the Tandem approach for other protocols such as Ethernet?
– Eventually, yes. We are starting with PCIe to meet specific customer demand, but a longer term goal is to open this
approach for more use cases. Software DRCs and IP-specific enhancements will be needed for a safe working
environment. We are engaging with a few key users and market segments right now.
Page 40
© Copyright 2017 Xilinx
.
Download
Study collections