QorIQ T4240 Communications Processor Deep Dive

advertisement
QorIQ T4240 Communications
Processor Deep Dive
FTF-NET-F0031
Sam Siu & Feras Hamdan
APR.2014
TM
External Use
Agenda
•
QorIQ T4240 Communications Processor Overview
•
e6500 Core Enhancement
•
Memory Subsystem and MMU Enhancement
•
QorIQ Power Management features
•
HiGig Interface
•
Interlaken Interface
•
PCI Express® Gen 3 Interfaces (SR-IOV)
•
Serial RapidIO® Manager (RMAN)
•
Data Path Acceleration Architecture Enhancements
•
−
mEMAC
−
Offline Ports and Use Case
−
Storage Profiling
−
Data Center Bridging (FMAN and QMAN)
−
Accelerators: SEC, DCE, PME
Debug
TM
External Use
1
QorIQ T4240 Communications Processor
T1
T1
T2
T1
T2
Power ™
T1
T2
Power ™
e6500
e6500
Power ™
T2
T1
T1
T2
T1
T2
T1
T2
Power ™
Power ™
T1
T2
T1
T2
Power ™
Power ™
e6500
e6500
e6500
Power ™
T2
512KB
Corenet
Platform Cache
T1
T2
Power ™
T1
T2
Power ™
e6500
e6500
Power ™
e6500
Power ™
32 KB
32 KB
32 KB
32 KB
32 KB
32 KB
32 KB
32 KB
e6500
e6500
e6500
e6500
32 KB I-Cache
32 KB D-Cache
32 KB I-Cache
32 KB
32 KB I-Cache
32 KB D-Cache
32 KB I-Cache
32 KB D-Cache
D-Cache
32 KB I-Cache
32 KBD-Cache
32 KB I-Cache
32 KB
32 KB I-Cache
32 KB D-Cache
32 KB I-Cache
32 KB D-Cache
D-Cache
D-Cache
I-Cache
D-Cache
I-Cache
D-Cache
D-Cache
I-Cache
64-bit
64-bit
DDR2/3
DDR3/3L
Memory
Memory
Controller
Controller
64-bit
64-bit
DDR2/3
DDR3/3L
Memory
Memory
Controller
Controller
64-bit
64-bit
DDR2/3
DDR3/3L
Memory
Memory
Controller
Controller
512KB
Corenet
Platform Cache
I-Cache
512KB
Corenet
Platform Cache
2MB Banked L2
2MB Banked L2
2MB Banked L2
Security Fuse Processor
CoreNet™ Coherency Fabric
Buffer 1/ 1/
Mgr. 10G 10G
DCB
1G 1G 1G
1G 1G 1G
HiGig
1/ 1/
10G 10G
DCB
1G 1G 1G
Device
−
−
−
•
Power targets
−
−
Aurora
Data Path Acceleration
−
SEC- crypto acceleration 40 Gbps
−
PME- Reg-ex Pattern Matcher 10Gbps
−
DCE- Data Compression Engine 20Gbps
~54W thermal max at 1.8 GHz
~42W thermal max at 1.5 GHz
TM
External Use
Perf CoreNet
Monitor Trace
16-Lane 10GHz SERDES
•
TSMC 28 HPM process
1932-pin BGA package
42.5x42.5 mm, 1.0 mm pitch
Real Time Debug
Watchpoint
Cross
Trigger
1G 1G 1G
16-Lane 10GHz SERDES
•
SATA 2.0
SPI, GPIO
HiGig
Pattern
Match
RMAN
Engine
2.0
3xDMA
SATA 2.0
2x I2C
FMan
Parse, Classify,
Distribute
sRIO
2x DUART
FMan
Parse, Classify,
Distribute
Peripheral Access
Mgmt Unit
sRIO
Security Queue
5.0
Mgr.
PCIe
SD/MMC
DCE
1.0
PAMU
PAMU
PCIe
IFC
Power Management
PAMU
PCIe
PAMU
PCIe
2x USB 2.0 w/PHY
Interlaken LA
Security Monitor
2
Processor
• 12x e6500, 64-bit, up to 1.8 GHz
• Dual threaded, with128-bit AltiVec engine
• Arranged as 3 clusters of 4 CPUs, with 2
MB L2 per cluster; 256 KB per thread
Memory SubSystem
• 1.5 MB CoreNet platform cache w/ECC
• 3x DDR3 controllers up to 1.87 GHz
• Each with up to 1 TB addressability (40
bit physical addressing)
CoreNet Switch Fabric
High-speed Serial IO
• 4 PCIe controllers, with Gen3
• SR-IOV support
• 2 sRIO controllers
• Type 9 and 11 messaging
• Interworking to DPAA via Rman
• 1 Interlaken Look-Aside at up to10 GHz
• 2 SATA 2.0 3Gb/s
• 2 USB 2.0 with PHY
Network IO
• 2 Frame Managers, each with:
• Up to 25Gbps parse/classify/distribute
• 2x10GE, 6x1GE
• HiGig, Data Center Bridging Support
• SGMII, QSGMII, XAUI, XFI
e6500 Core Enhancement
TM
External Use
3
e6500 Core Complex
32K
32K
e6500
32K
T
T
e6500
32K
32K
Altivec
32K
T
T
e6500
32K
PMC
e6500
T
T
PMC
T
PMC
T
Altivec
Altivec
PMC
Altivec
32K
High Performance
• 64-bit Power Architecture® technology
• Up to 1.8 GHz operation
• Two threads per core
• Dual load/store units, one per thread
• 40-bit Real Address
−
•
•
2MB 16-way Shared L2 Cache, 4 Banks
Hardware Table Walk
L2 in cluster of 4 cores
−
−
CoreNet Interface
40-bit Address Bus 256-bit Rd & Wr Data Busses
1 Terabyte physical addr. space
Supports Share across cluster
Supports L2 memory allocation to core or thread
Energy Efficient Power Management
−
−
−
CoreNet Double Data Processor Port
CoreMark
P4080
(1.5 GHz)
T4240
(1.8 GHz)
from P4080
Single Thread
4708
7828
1.7x
Core (dual T)
4708
15,656
3.3x
37,654
187,873
5.0x
2.4
5.1
2.1x
SoC
DMIPS/Watt
(typ)
TM
External Use
4
Drowsy : Core, Cluster, AltiVec engine
Wait-on-reservation instruction
Traditional modes
Improvement
•
AltiVec SIMD Unit (128b)
−
−
8,16,32-bit signed/unsigned integer
32-bit floating-point

−
•
173 GFLOP (1.8GHz)
8,16,32-bit Boolean
Improve Productivity with Core Virtualization
−
−
Hypervisor
Logical to Real Addr (LRAT). translation
mechanism for improved hypervisor performance
General Core Enhancements
•
Improved branch prediction and additional link stack entries
• Pipeline improvements:
−
−
•
New debug features:
−
−
•
Ability to allocate individual debug events between the internal and external
debuggers
More IAC events
Performance monitor
−
−
•
LR, CTR, mfocrf optimization (LR and CTR are renamed)
16 entry rename/completion buffer
Many more events, six counters per thread
Guest performance monitor interrupt
Private vs. Shared State Registers and other architected state
−
Shared between threads:

There is only one copy of the register or architected state
 A change in one thread affects the other thread if the other thread reads it
−
Private to the thread and are replicated per thread :

There is one copy per thread of the register or architected state
 A change in one thread does not affect the other thread if the thread reads its private copy
TM
External Use
5
Corenet Enhancements in QorIQ T 4240
•
CoreNet Coherency Fabric
−
−
40-bit Real Address
Higher address bandwidth and active transactions

−
−
•
2X BW increase for core, MMU, and peripheral
Improved configuration architecture
Platform Cache
−
−
−
•
1.2 Tbps Read, .6Tbps Write
100%
90%
80%
70%
60%
50%
IP Mark
40%
TCP
Mark
30%
20%
10%
0%
Increased write bandwidth (>600Gbps)
0 2 4 6 8 10 12 14 16 18 20 22 24
Increased buffering for improving throughput
Improved data ownership tracking for performance enhancement
Data PreFetch
−
−
−
−
−
−
Tracks CPC misses
Prefetches from multiple memory regions with configurable sizes
Selective tracking based on requesting device, transaction type,
data/instruction access
Conservative prefetch requests to avoid system overloading with prefetches
“Confidence” based algorithm with feedback mechanism
Performance monitor events to evaluate the performance of Prefetch in the
system
TM
External Use
6
Cache and Memory Subsystem
Enhancements
TM
External Use
7
Shared L2 Cache
•
•
•
Clusters of cores share a 2M byte, 4-bank, 16-way set associative shared L2 cache.
In addition, there is also support for a 1.5M byte corenet platform cache.
Advantages
−
L2 cache is shared among 4 cores allowing lines to be allocated among the 4 cores as
required

Some cores will need more lines and some will need less depending on workloads
−
Faster sharing among cores in the cluster (sharing a line between cores in the cluster does
not require the data to travel on CoreNet)
− Flexible partition of L2 cache base on application cluster group.
•
Trade Offs
−
Longer latency to DRAM and other parts of the system outside the cluster
− Longer latency to L2 cache due to increased cache size and eLink overhead
T1
T1
T2
T1
T2
T2
T1
T2
T1
T2
T1
T2
T1
T2
T1
T2
Power ™
Power ™
Power ™
Power ™
T1
T2
T1
T2
T1
T2
T1
T2
Power ™
Power ™
Power ™
Power ™
e6500
e6500
e6500
e6500
e6500
Power ™
e6500
Power ™
e6500
Power ™
e6500
Power ™
32 KB
32 KB
32 KB
32 KB
32 KB
32 KB
32 KB
32 KB
e6500
e6500
e6500
e6500
32 KB I-Cache
32 KB D-Cache
32 KB I-Cache
32 KB
32 KB I-Cache
32 KB D-Cache
32 KB I-Cache
32 KB D-Cache
D-Cache
32 KB I-Cache
32 KBD-Cache
32 KB I-Cache
32 KB
32 KB I-Cache
32 KB D-Cache
32 KB I-Cache
32 KB D-Cache
D-Cache
D-Cache
I-Cache
D-Cache
I-Cache
D-Cache
I-Cache
D-Cache
I-Cache
512KB
Corenet
Platform Cache
512KB
Corenet
Platform Cache
512KB
Corenet
Platform Cache
2MB Banked L2
2MB Banked L2
2MB Banked L2
Security Fuse Processor
CoreNet™ Coherency Fabric
Security Monitor
2x USB 2.0 w/PHY
PAMU
PAMU
TM
External Use
8
PAMU
PAMU
Peripheral Access
Mgmt Unit
64-bit
64-bit
DDR2/3
DDR3/3L
Memory
Memory
Controller
Controller
64-bit
64-bit
DDR2/3
DDR3/3L
Memory
Memory
Controller
Controller
64-bit
64-bit
DDR2/3
DDR3/3L
Memory
Memory
Controller
Controller
Memory Subsystem Enhancements
•
The e6500 core has a larger store queue than the e5500 core
•
Additional registers are provided for L2 cache partitioning controls similar to
how partitioning is done in the CPC
•
Cache locking is supported, however, if a line is unable to be
locked, that status is not posted. Cache lock query instructions are
provided for determining whether a line is locked
• The load store unit contains store gather buffers to collect stores to
cache lines before sending them on eLink to the L2 cache
• There are no more Line Fill Buffers (LFB) associated with the L1
data cache
− These
are replaced with Load Miss Queue (LMQ) entries for each thread
− They function in a manner very similar to LFBs
•
Note there are still LFBs for L1 instruction cache
TM
External Use
9
MMU Enhancements
TM
External Use
10
MMU – TLB Enhancements
•
e6500 core implements MMU architecture version 2 (V2)
−
•
MMU architecture V2 is denoted by bits in the MMUCFG register
Translation Look-aside Buffers (TLB1),
−
Variable size pages, supports power of two page sizes (previous cores used power of 4
page sizes)
− 4 KB to 1 TB page sizes
•
Translation Look-aside Buffers (TLB0) increased to 1024 entries
−
8 way associativity (from 512, 4 way)
− Supports HES (hardware entry select) when written to with tlbwe
•
PID register is increased to 14 bits (from 8 bits)
−
•
•
Now the operating system can have 16K simultaneous contexts
Real address increased to 40 bits (from 36 bits)
In general, it is backward compatible with MMU operations from e5500 core, except:
−
some of the configuration registers have different organization (TLBnCFG for example)
− There are new config registers for TLB page size (TLBnPS) and LRAT page size
(LRATPS)
− tlbwe can be executed by guest supervisor (but can be turned off with an EPCR bit)
LPID
PID(14bits)
AS
(14bit)
0=Hypervisor Access MSR
1=guest
GS
Effective Address (EA) (64bit )
Effective Page #(0-52 bits)
Byte Addr (12-32bits )
Real Page Number
(0-28bits)
Byte Address (12-40bits)
TM
External Use
11
Real Address (40bits)
MMU – Virtualization Enhancements (LRAT)
•
e6500 core contains an LRAT (logical to real address translation)
− The
LRAT converts logical addresses (an address the guest operating
system thinks are real) and converts them to true real addresses
− Translation occurs when the guest executes tlbwe and tries to write TLB0 or
during hardware tablewalk for a guest translation
− Does not require hypervisor to intervene unless the LRAT incurs a miss (the
hypervisor writes entries into the LRAT)
− 8 entry fully associative supporting variable size pages from 4 KB to 1 TB (in
powers of two)
•
Prior to the LRAT, the hypervisor had to intervene each time the guest
tried to write a TLB entry
Application MMU
Page
Instr1
Instr2
Fault
Guest OS
Instr3
---
VA -> Guest RA
Trap
Writes TLB
Hypervisor
Guest RA -> RA
Writes TLB
TM
External Use
12
Implemented
in HW with LRAT
QorIQ Power Management
Features
TM
External Use
13
Full
Mid
Light
T4
Advanced
Power Mgt
Light to Mid
Full
Standby
Always on
Today’s
Energy
Strategy
Cyclical
Valued
Workload
Dynamic T4 Family Energy/Power Total Cost of Ownership
Dynamic
Clk Gating
Energy Savings
Cluster
Drowsy
TM
External Use
14
Dual Cluster
Drowsy
+ Tj
Core Cascaded
SoC
Sleep
Cascaded Power Management
Today: All CPUs in pool channel dequeue
until all FQs empty
DPAA uses task queue thresholds to
inform CPUs they are not needed.
CPUs selectively awakened as needed.
Broadcast notification when work arrives
C1
C2
C3
C0
C1
C2
C3
Drowsy
Drowsy
Shared L2
Shared L2
Active CPUs
Task Queue
QMan
T5 T4 T3 T2 T1
Threshold 2
Threshold 1
TM
External Use
15
Power/Performance
Core: C0
12
11
10
9
8
7
6
5
4
3
2
1
Burst
Day
Night
• CPU’s run software that
drops into polling loop
when DPAA is not sending
it work.
• Polling loop should include
a wait w/ drowsy instruction
that puts the core into
drowsy
e6500 Core Intelligent Power Management
e6500
L1
Cluster State
Core State
Run, Doze, Nap
Wait
Altivec Drowsy
•
Auto and SW controlled – maintain state
Core Drowsy
•
Auto and SW controlled – maintain state
Dynamic Clock gating
PMC
T
T
PMC
Altivec
2048KB Banked L2
L1
NEW
NEW
Run, Nap
• Cores and L2
Dynamic Frequency Scaling
(DFS)of the Cluster
Drowsy Cluster (cores)
Dynamic clock gating
NEW
Full On
Full On
Full On
Full On
Full On
Nap
PCL00
PCL00
PCL00
PCL00
PCL00
PCL10
Run
Doze
Nap
Global Clk stop Nap (Pwr Gated)
Core glb clk stop
PH00
PH10/PW10
PH15
PW20
PH20
PH20
Cluster Voltage
Core Voltage
Cluster Clock
On
On
On
On
On
Off
Core Clock
On
On
Off
Off
Off
Off
L2 Cache
SW Flushed
L1 Cache
SW Invalidated HW Invalidated
SW Invalidated
SW Invalidated
•
Wakeup Time
Active
Immediate
Power
TM
External Use
16
< 30 ns
< 200 ns
< 600 ns
< 1us
SoC Sleep with
state retention
• SoC Sleep with
RST
• Cascade Power
Management
• Energy Efficient
Ethernet (EEE)
HiGig Interface Support
TM
External Use
17
HiGigTM/HiGig+/HiGig2 Interface Support
•
•
•
The 10 Gigabit HiGigTM / HiGig+TM / HiGig2TM MAC interface interconnect
standard Ethernet devices to Switch HiGig Ports.
Networking customers can add features like quality of service (QoS), port
trunking, mirroring across devices, and link aggregation at the MAC layer.
The physical signaling across the interface is XAUI, four differential pairs
for receive and transmit (SerDes), each operating at 3.125 Gbit/s. HiGig+
is a higher rate version of HiGig
11111111112222222222333
01234567890123456789012
Typ
MAC_DA MAC_SA
e
123456789
Preamble
Packet Data
FCS
Regular Ethernet Frames
1111111111222222222233333
0123456789012345678901234
Typ
HiGig+ Module Hdr MAC_DA MAC_SA
e
123456789
Preamble
Packet Data
FCS*
Ethernet Frames with HiGig+ Header
123456789
Preamble
11111111112222222222333333333
01234567890123456789012345678
Typ
HiGig2 Module Hdr
MAC_DA MAC_SA
e
Ethernet Frames with HiGig2 Header
TM
External Use
18
Packet Data
FCS*
QorIQ T4240 Processor HiGig Interface
•
•
T4240 FMan Supports HiGig/HiGig+/HiGig2 protocols
In the T4240 processor, the 10G mEMACs can be configured as HiGig
interface. In this configuration two of the 1G mEMACs are used as the
HiGig message interface
TM
External Use
19
SERDES Configuration for HiGig Interface
•
Networking protocols (SerDes 1 and SerDes 2)
• HiGig notation: HiGig[2]m.n means HiGig[2] (4 lanes @ 3.125 or 3.75 Gbps)
−
−
−
•
“m” indicates which Frame Manager (FM1 or FM2)
“n” indicates which MAC on the Frame Manager
E.g. “HiGig[2]1.10,” indicates HiGig[2] using FM1’s MAC 10
When a SerDes protocol is selected with dual HiGigs in one SerDes, both
HiGigs must be configured with the same protocol (for example, both with 12
byte headers or both with 16 byte headers)
TM
External Use
20
HiGig/HiGig2 Control and Configuration
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
TCM
IGNIM
FIMT
FER
MCRC
NPPR
LLF
LLI
LLM
IMG
HiGig/HiGig2 control and Configuration Register (HG_CONFIG)
Name
Description
LLM_MODE
Toggle between HiGig2 Link Level Messages physical link, OR HiGig2 link level
messages logical link (SAFC)
LLM_IGNORE
Ignore HiGig2 link level message quanta
LLM_FWD
Terminate/forward received HiGig2 link level message
IMG[0:7]
Inter Message Gap - spacing between HiGig2 messages
NOPRMP 0
Toggle preemptive transmission of HiGig2 messages
MCRC_FWD
Strip/forward HiGig2 message CRC of received messages
FER
Discard/forward HiGig2 receive message with CRC error
FIMT
Forward OR Discard message with illegal MSG_TYP
IGNIMG
Ignore IMG on receive path
TCM
TC (traffic classes) mapping
TM
External Use
21
Interlaken Interface
TM
External Use
22
Interlaken Look-Aside Interface
•
•
•
T4240
•
10
G
10
G
10
G
10
G
•
Interlaken
•
Use Case: T4240 processor as a data path processor, requiring millions of look-ups per
second. Expected requirement in edge routers.
Interlaken Look-Aside is a new high-speed serial standard for connecting TCAMs “network
search engines”, “Knowledge Based Processors” to host CPUs and NPUs. Replaces Quad
Data Rate (QDR) SRAM interface.
Like Interlaken streaming interfaces (channelized SERDES
link, replacing SPI 4.2), Interlaken look-aside supports configurable number of SERDES
lanes (1-32, granularity of single lane) with linearly increasing bandwidth. Freescale supporst
x4 and x8, up to 10 GHz.
For lowest latency, each vCPU (thread) in T4240 processor will have a portal into the
Interlaken Controller, allowing multiple search requests and results to be returned
concurrently.
Interlaken Look Aside expected to gain traction as interface to other low latency/minimal
data exchange co-processors, such as Traffic Managers. PCIe and sRIO better for higher
latency/high bandwidth applications.
Lane Striping
TCAM
TM
External Use
23
T4240 (LAC) Features:
•
•
•
•
•
•
•
•
•
•
•
•
Supports Interlaken Look-Aside Protocol definition, rev. 1.1
Supports 24 partitioned software portals
Supports in-band per-channel flow control options, with simple xon/xoff
semantics
Supports wide range of SerDes speeds (6.25 and 10.3125 Gbps))
Ability to disable the connection to individual SerDes lanes
A continuous Meta Frame of programmable frequency to guarantee lane
alignment, synchronize the scrambler, perform clock compensation, and
indicate lane health
64B/67B data encoding and scrambling
Programmable BURSTSHORT parameter of 8 or 16 bytes
Error detection illegal burst sizes, bad 64/67 word type and CRC-24 error
Error detection on Transmit command programming error
Built-in statistics counters and error counters
Dynamic power down of each software portal
TM
External Use
24
Look-Aside Controller Block Diagram
TM
External Use
25
Modes of Operation
•
T4240 LA controller can be either in Stashing mode or non stashing.
•
The LAC programming model is based on big Endinan mode, meaning byte 0 on the most
significant byte.
•
In non Stashing mode software has to issue dcbf each time it reads SWPnRSR and RDY bit
is not set.
TM
External Use
26
Interlaken LA Controller Configuration Registers
•
•
•
•
•
•
•
•
•
4KBytes hypervisor space 0x0000-0x0FFF
4KBytes managing core space 0x1000-0x1FFF
in compliant with trusted architecture ,LSRER, LBARE, LBAR, LLIODNRn,
accessed exclusively in hypervisor mode, reserved in managing core
mode.
Statistics, Lane mapping, Interrupt , rate, metaframe, burst, FIFO,
calendar, debug, pattern, Error, Capture Registers
LAC software portal memory, n= 0,1,2,3,….,23 .
SWPnTCR/ SWPnRCR—software portal 0 transmit/Receive command
register
SWPnTER/SWPnRER—software portal 0 transmit/Receive error register
SWPnTDR/SWPnRDR0,1,2,3 —software portal 0,1,2,3 transmit/Receive
data register 0,1,2,3
SWPnRSR—software portal receive status register
TM
External Use
27
TCAM Usage in Routing Example
TM
External Use
28
Interlaken Look-Aside TCAM Board
125 MHz
SYSCLK
VDDC
0.85V @6A
Renesas
Interlaken LA
5Mb TCAM
VDDA
0.85V @ 2A
Config
I2C
EEPROM
VDDHA 1.80V 0.5A
VCC_1.8V
1.8V @ 2A
Filters VDDO 1.80V 1.0A
VPLL
1.80V 0.25A
0-ohm
3.3V/12V
IL-LA
4x
TM
External Use
29
REFCLK
156.25 MHz
SMBus
Misc:
Reset,
JTAG
PCI Express® Gen 3 Interfaces
TM
External Use
30
PCI Express® Gen 3 Interfaces
•
•
Two PCIe Gen 3 controllers can be run at the same time with same
SerDes reference clock source
PCIe Gen 3 bit rates are supported
− When
running more than one PCIe controller at Gen3 rates, the associated
SerDes reference clocks must be driven by the same source on the board
51G
PCIe1
SR-IOV
EP
51G
51G
PCIe4
OCN
51G
X8 Gen2 or
PCIe2
x4 Gen3 RC/EP
EP SRIOV
2 PF/64VF
8xMSI-X per VF/PF X4 Gen2/3 RC/EP
51G
PCIe3
X8 Gen2 or x4 Gen3
Total of 16 lanes
X4 Gen2/3 RC/EP
16 SERES PCIe Configuration
PCIe1
PCIe2
PCIe3
x4gen3
x4gen2
x8gen2
X8gen2
x8gen2
x4gen2
x4gen3
x4gen2
TM
External Use
31
PCIe4
x4gen2
Single Root I/O Virtualization (SR-IOV) End Point
•
With SR-IOV supported in EP, different devices or different software tasks can
share IO resources, such as Gigabit Ethernet controllers.
−
T4240 Supports SR-IOV 1.1 spec version with 2 PFs and 64 VFs per PF
− SR-IOV supports native IOV in existing single-root complex PCI Express topologies
− Address translation services (ATS) supports native IOV across PCI Express via address
translation
− Single Management physical or virtual machine on host handles end-point configuration
•
E.g. T4240 processor as a Converged Network Adapter. Each Virtual Machine
running on Host thinks it has a private version of the services card
VM
1
VM
2
…
VM
N
Host
Translation
Agent
TM
External Use
32
T4240 features single controller (up to x4 Gen 3), 1 PF, 64 VFs
PCI Express Configuration Address Register
•
The PCI Express configuration address register contains address
information for accesses to PCI Express internal and external
configuration registers for End Point (EP) with SR-IOV
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
EN
Type
EXTREGN
VFN
PFN
REGN
PCI Express Address Offset Register
Name
Description
Enable
allows a PCI Express configuration access when PEX_CONFIG_DATA is accessed
TYPE
01, Configuration Register Accesses to PF registers for EP with SR-IOV
11, Configuration Register Accesses to VF registers for EP with SR-IOV
EXTREGN
Extended register number. This field allows access to extended PCI Express configuration
space
VFN
Virtual Function number minus 1. 64-255 is reserved.
PFN
Physical Function number minus 1. 2-15 is reserved.
REGN
Register number. 32-bit register to access within specified device
TM
External Use
33
Message Signaled Interrupts (MSI-X) Support
•
•
MSI-X allows for EP device to send message interrupts to RC device independently for
different Physical or Virtual functions as supported by EP SR-IOV.
Each PF or VF will have eight MSI-X vectors allocated with a total of 256 total MSI-X
vectors supported
−
Supports MSI-X for PF/VF with 8 MSI-X vector per PF or VF
− Supports MSI-X trap operation
− To access a MSI-X PBA structure, the PF, VF, IDX, EIDXare concatenated to form the 4byte aligned address of the register within the MSI-X PBA structure. That is, the register
address is:

PF || VF || IDX || EIDX || 0b00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Type
PF
VF
IDX
EDIX
M
PCI Express Address Offset Register
Name
Description
TYPE
Access to PF or VF MSI-X vector table for EP with SR-IOV.
PF
Physical Function
VF
Virtual Function
IDX
MSI-X Entry Index in each VF.
EIDX Extended index
This field provides which 4-Byte entity within the MSI-X PBA structure to access.
M
Mode=11
TM
External Use
34
Serial RapidIO® Manager (RMAN)
TM
External Use
35
RapidIO Message Manager (RMan)
RMAN supports both inline switching, as well as look aside
forwarding operation.
RapidIO PDU
… Ftype Target ID Src ID Address
Packet Data Unit
CRC
RMan
Reassembly
Contexts
HW Channel
Pool Channel
WQ7
WQ6
WQ5
WQ4
WQ3
WQ2
Reassembly
Unit
WQ0
WQ1
WQ2
WQ3
WQ4
WQ5
WQ6
WQ7
Classification
Unit
WQ1
Reassembly
Unit
WQ0
Classification
Unit
Segmentation
Unit
HW Channel
WQ0
WQ1
WQ2
WQ3
WQ4
WQ5
WQ6
WQ7
Reassembly
Unit
Disassembly
Contexts
QMan
ARB
Inbound Rule
Matching
Classification
Unit
Segmentation
Unit
DCP
e6500
I$
L2$ D$ Core
D$
I$
TM
External Use
37
SW Portal
PME
I$
D$
SEC
Frame Manager
1GE 1GE
10GE
1GE 1GE
Segmentation
Unit
RapidIO Outbound Traffic
DCP
RapidIO Inbound Traffic
•
RMan: Greater Performance and Functionality
•
Many queues allow multiple inbound/outbound queues per core
− Hardware
•
queue management via QorIQ Data Path Architecture (DPAA)
Supports all messaging-style transaction types
− Type
11 Messaging
− Type 10 Doorbells
− Type 9 Data Streaming
•
Enables low overhead direct core-to-core communication
QorIQ or DSP
Device-to-Device
Transport
QorIQ or DSP
Channelized CPUto-CPU transport
Core Core Core Core
10G
SRIO
10G
Type9
MSG
TM
External Use
38
Core Core Core Core
User PDU
User PDU
SRIO
Data Path Acceleration
Architecture (DPAA)
TM
External Use
39
Data Path Acceleration Architecture (DPAA) Philosophy
•
DPAA is design to balance the performance of multiple P Series
D$ D$ I$ I$
e500mc
CPUs and Accelerators with seamless integrations
e500mc
−
•
ANY packet to ANY core to ANY accelerator or network
interface efficiently WITHOUT locks or semaphores
I$
D$ I$
Core
L2$ L2$
D$ Core
“Infrastructure” components
e6500
e6500
I$
D$ I$
Core
L2$ L2$
D$ Core
…
D$ D$ I$ I$
CoreNet™
Queue Manager (QMan)
− Buffer Manager (BMan)
Coherency Fabric
“Accelerator” Components
−
−
−
−
−
−
−
−
•
…
D$ D$ I$ I$
D$ D$ I$ I$
−
•
T Series
Cores
Frame Manager (FMan)
RapidIO Message Manager (RMan)
Cryptographic accelerator (SEC)
Pattern matching engine (PME)
Decompression/Compression Engine (DCE)
DCB (Data Center Bridging)
RAID Engine (RE)
−
TM
External Use
40
RMan
RE
Sec 4.x
PME 2
Frame Manager
Buffer
Provides the interconnect between the cores and the
DPAA infrastructure as well as access to memory
DCB
Queue
Manager
PCD
CoreNet
DCE
Buffer
Mgr
Frame Manager
…
Parse, Classify,
Distribute
Buffer
1G 1G 1G
1GE 1GE
1/10G 1/10G
10GE
1GE 1GE
1G 1G 1G
…
DPAA Building Block: Frame Descriptor (FD)
Simple Frame
Multi-buffer Frame
(Scatter/gather)
Frame Descriptor
Buffer
PID
BPID
Address
000
Offset
D
Frame Descriptor
D
Length
Status/Cmd
BPID
Address
100
Offset
Length
Status/Cmd
Packet
S/G List
PID
Address
00
Length
BPID
Offset
Address
00
Length
BPID
Offset (=0)
01234567891111111111222222222233
0123456789012345678901
D
D
LIODN
offset
BPID
ELIO - - - DN
offset
addr (cont)
Fmt
Offset
Length
STATUS/CMD
TM
External Use
41
Data
Data
…
addr
Address
01
Length
BPID
Offset (=0)
Data
Frame Descriptor Status/Command Word (FMAN Status)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
-
-
-
-
-
L4CV
BLE
FRDR
PHE
ISP
PTE
FLM
IPP
-
FCL
-
KSO
NSS
EOF
-
DIS
FSE
FPE
-
MS
DME
-
DCL4C
-
-
-
Name
DCL4C
DME
MS
FPE
FSE
Description
L4 (IP/TCP/UDP) Checksum validation Enable/Disable
DMA error
MACSEC Frame. This bit is valid on P1023
Frame Physical Error
Frame Size Error
DIS
Discard. This bit is set only for frames that are supposed to be discarded, but are
enqueued in an error queue for debug purposes.
Extract Out of Frame Error
No Scheme Selection foe KeyGen
Key Size Over flow Error
Frame color as determined by the Policer. 00=green, 01=yellow, 10=red, 11=no reject
Illegal Policer Profile error
Frame Length Mismatch
Parser Time-out
Invalid Soft Parser instruction Error
Header Error
Frame Drop
Block limit is exceeded
L4 Checksum Validation
EOF
NSS
KSO
FCL
IPP
FLM
PTE
ISP
PHE
FRDR
BLE
L4CV
TM
External Use
42
DPAA: mEMAC Controller
TM
External Use
43
Multirate Ethernet MAC (mEMAC) Controller
•
A multirate Ethernet MAC (mEMAC) controller
features 100 Mbps/1G/2.5G/10G :
−
−
−
−
−
−
−
−
−
−
Supports HiGig/HiGig+/HiGig2 protocols
Dynamic configuration for NIC (Network Interface
Card) applications or Switching/Bridging applications
to support 10Gbps or below.
Designed to comply with IEEE Std 802.3®, IEEE
802.3u, IEEE 802.3x, IEEE 802.3z, IEEE 802.3ac,
IEEE 802.3ab, IEEE-1588 v2 (clock synchronization
over Ethernet), IEEE 803.3az and IEEE 802.1QBbb.
RMON statistics
CRC-32 generation and append on transmit or
forwarding of user application provided FCS
selectable on a per-frame basis.
8 MAC address comparison on receive and one MAC
address overwrite on transmit for NIC applications.
Selectable promiscuous frame receive mode and
transparent MAC address forwarding on transmit
Multicast address filtering with 64-bin hash code
lookup table on receive reducing processing load on
higher layers
Support for VLAN tagged frames and double VLAN
Tags (Stacked VLANs)
Dynamic inter packet gap (IPG) calculation for WAN
applications
QorIQ P Series
10GMAC
dTSEC
Frame Manager Interface
1588 Time
Stamping
Tx FIFO
Config
Control
Stat
Tx
Control
MDIO
Master
Phy Mgmt
MDIO
Rx FIFO
Rx
Control
Flow
Control
Reconcilication
Tx
Interface
Rx
Interface
QorIQ T4240 - mEMAC
TM
External Use
44
DPAA: FMAN
TM
External Use
45
FMan Enhancements
•
Storage Profile selection (up to 32 profiles per
port) based on classification
−
FMAN
Up to four buffer pools per Storage Profile
•
Customer Edge Egress Traffic Management
(Egress Shaping)
• Data Center Bridging
−
•
•
•
•
•
−
•
PFC and ETS
muRAM
IEEE802.3az (Energy Efficient Ethernet)
IEEE802.3bf (Time sync)
IP Frag & Re-assembly Offload
HiGig, HiGig2
TX confirmation/error queue enhancements
−
Parse, Classify,
Distribute
Ability to configure separate FQID for normal
confirmations vs errors
Separate FD status for Overflow and physical error
Option to disable S/G on ingress
TM
External Use
46
1/10G
1G
1G
1G
1G
1G
1G
1/10G
Offline Ports
TM
External Use
47
FMAN Ports Types
•
Ethernet receive (Rx) and transmitter (Tx)
−
−
−
•
Offline (O/H)
−
−
−
−
−
•
1 Gbps/2. 5Gbps/10 Gbps
FMan_v3 some ports can be configured ad HiGig
Jumbo frames of up to 9.6 KB (add uboot bootargs "fsl_fm_max_frm=9600" )
FMan_v3: 3.75 Mpps (vs 1.5M pps from the P series)
Supports Parse classify distribute (PCD) function on frames extracted frame
descriptor (FD) from the Qman
Supports frame copy or move from a storage profile to an other
Able to dequeue and enqueue from/to a QMan queue. The FMan applies a Parse
Classify Distribute (PCD) flow and (if configured to do so) enqueues the frame it back
in a Qman queue. In FMan_v3 the FMan is able to copy the frame into new buffers
and enqueue back to the QMan.
Use case: IP fragmentation and reassembly
Host command
−
−
Able to dequeue host commands from a QMan queue. The FMan executes the host
command (such as a table update) and enqueues a response to the QMan. The Host
commands, require a dedicated PortID (one of the O/H ports)
The registers for Offline and Host commands are named O/H port registers
TM
External Use
48
IP Reassembly T4240 Processor Flow
Regular frame: Storage Profile is
chosen according to frame header
classification.
BMI:
Parser:
Parse The Frame Identify fragments
Reassembled frame: Storage
Profile is chosen according to MAC
and IP header classification only.
Fman Controller:
Start reassembly
KeyGen:
Calculate Hash
KeyGen:
Calculate Hash
Fman Controller:
Coarse Classification
Reassembled
Frame
Regular/Fragment
BMI:
Write IC
BMI:
Allocate buffer
Write frame and IC
*Fragments
Non Fragments
Enqueue Frame
Enqueue Frame
TM
External Use
49
*Buffer allocation is done
According to fragment
header only
Fman Controller:
link fragment to the right
reassembly context
Non Completed
reassembly
BMI:
Terminate
Completed
reassembly
IP Reassembly FMAN Memory Usage
•
•
•
FMAN Memory: 386 KBytes
Assumption: MTU = 1500 Bytes
Port FMAN Memory consumption:
−
Each 10G Port = 40 Kbytes
− Each 1G Port = 25 Kbytes
− Each Offline Port = 10 Kbytes
•
Coarse Classification tables memory consumption:
−
•
100 Kbytes for all ports
IP Reassembly:
−
IP Reassembly overhead: 8 Kbytes
− Each flow: 10 Bytes
•
Example:
−
−
−
−
−
Usecase with: 2x10G ports + 2x1G port + 1xOffline Ports.
Port configuration: 2x40 + 2x25 + 10 = 140 Kbytes
Coarse Classification : 100 Kbytes
IP reassembly 10K flows: 10K x 10B + 8KB = 108 Kbytes
Total = 140KB + 108KB + 100KB = 348 KBytes
TM
External Use
50
Storage Profile
TM
External Use
51
Virtual Storage Profiling For Rx and Offline Ports
•
•
•
•
Storage profile enable each partition and virtual interface
enjoy a dedicated buffer pools.
Storage profile selection after distribution function
evaluation or after custom classifier
The same Storage Profile ID (SPID) values from the
classification on different physical ports, may yield to
different storage profile selection.
Up to 64 storage profiles per port are supported.
− 32
•
storage profiles for FMan_v3L
Storage profile contains
− LIODN
offset
− Up to four buffer pools per Storage Profile
− Buffer Start margin/End margin configuration
− S/G disable
− Flow control configuration
TM
External Use
52
Data Center Bridging
TM
External Use
53
Policing and Shaping
•
•
Policing put a cap on the network usage and guarantee bandwidth
Shaping smoothes out the egress traffic
− May
•
require extra memory to store the shaped traffic.
DCB can be used in:
− Between
data center network nodes
− LAN/network traffic
− Storage Area Network (SAN)
− IPC traffic (e.g. Infiniband (low latency))
Time
Time
TM
External Use
54
Time
Support Priority-based Flow Control (802.1Qbb)
Transmit Queues
•
Enables lossless behavior for each class of
service
• PAUSE sent per virtual lane when buffers
limit exceeded
− FQ congestion groups state (on/off)
from QMan



−
Priority vector (8 bits) is assigned to each FQ
congestion group
FQ congestion group(s) are assigned to
each port
Upon receipt of a congestion group state “on”
message, for each Rx port associated with this
congestion group, a PFC Pause frame is
transmitted with priority level(s) configured for
that group
Buffer pool depletion

−
Priority level configured on per port (shared by
all buffer pools used on that port)
Near FMan Rx FIFO full

There is a single Rx FIFO per port for all priorities,
the PFC Pause frame is sent on all priorities
TM
External Use
55
Receive Buffers
Zero
Zero
One
One
Two
Two
Three
•
Ethernet Link
STOP
PAUSE
Three
Four
Four
Five
Five
Six
Six
Seven
Seven
Eight
Virtual
Lanes
PFC Pause frame reception
− QMan provides the ability to flow control 8
different traffic classes; in CEETM each of
the 16 class queues within a class queue
channel can be mapped to one of the 8
traffic classes & this mapping applies to all
channels assigned to the link
Support Bandwidth Management 802.1Qaz
•
•
Hierarchical port scheduling defines
the class-of-service (CoS) properties
of output queues, mapped to IEEE
802.1p priorities
Qman CEETM enables Enhanced Tx
Selection (ETS) 802.1Qaz with
Intelligent sharing of bandwidth
between traffic classes control of
bandwidth
− Strict
priority scheduling of the 8
independent classes. Weighted
bandwidth fairness within 8 grouped
classes
− Priority of the class group can be
independently configured to be
immediately below any of the
independent classes
•
Meets performance requirement for
ETS: bandwidth granularity of 1%
and +/-10% accuracy
TM
External Use
56
Offered Traffic
3G/s
3G/s
2G/s
3G/s
3G/s
3G/s
3G/s
4G/s
6G/s
t1
•
t2
t3
10 GE Realized Traffic Utilization
3G/s
HPC Traffic
3G/s
2G/s
3G/s
Storage Traffic
3G/s
3G/s
3G/s
LAN Traffic
4G/s
5G/s
t1
t2
t3
Supports 32 channels available
for allocation across a single
FMan
− e.g.
for two10G links, could
allocate 16 channels (virtual links)
per link
− Supports weighted bandwidth
fairness amongst channels
− Shaping is supporting on per
channel basis
QMAN CEETM
TM
External Use
57
CEETM Scheduling Hierarchy (QMAN 1.2)
−
−
−
•
Green denotes logic units and signal paths
that relate to the request and fulfillment of
Committed Rate (CR) packet transmission
opportunities
Yellow denotes the same for Excess Rate
(ER)
Black denotes logic units and signal paths
that are used for unshaped opportunities or
that operate consistently whether used for
CR or ER opportunities
Strict Priority
Weighted
Scheduling
StrictPriority
Scheduler
−
−
Channel Scheduler: channels are selected
to send frame from Class Queues
Class scheduler: frames are selected
from Class Queues . Class 0 has
highest priority
Algorithm
−
−
−
−
Strict Priority (SP)
Weighted Scheduling
Shaped Aware Fair Scheduling (SAFS)
Weighted Bandwidth Fair Scheduling
(WBFS)
TM
External Use
58
WBFS
WBFS
Shape
Aware
Shape
Aware
Fair
Scheduling
Fair Scheduling
StrictPriority
StrictPriority
WBFS
WBFS
StrictPriority
StrictPriority
WBFS
CQ15
CQ14
CQ13
CQ12
CQ11
CQ10
CQ9
CQ8
CQ7
CQ6
CQ5
CQ4
CQ3
CQ2
CQ1
CQ0
•
Network IF
Logics
Channel Scheduler
for LNI #9
•
Class Scheduler Ch6
unshaped
8 Indpt, 8 grp Classes
Class Scheduler Ch7 Class Scheduler Ch8
Shaped
Shaped
3 Indpt, 7 grp
2 Indpt, 8 grp
Token Bucket Shaper for Committed Rate
Token Bucket Shaper for Excess Rate
Weighted Bandwidth Fair Scheduling (WBFS)
•
Weighted Bandwidth Fair Scheduling (WBFS) is used to schedule packets
from queues within a priority group such that each gets a “fair” amount of
bandwidth made available to that priority group
• The premises for fairness for algorithm is:
−
−
available bandwidth is divided and offered equally to all classes
offered bandwidth in excess of a class’s demand is to be re-offered equally to classes
with unmet demand
10G
First
ReDistribution
1.5G
Second
Redistribution
.2G
5
3
2
2G
.5G
.1G
Initial Distribution
BW available
Number of classes
with unmet demand
Bandwidth to be
offer to each class
Demand
Class 0
Class 1
Class 2
Class 3
Class 4
Total Consumption
.5G
2G
2.3G
3G
4G
11.8G
Offered & Unmet Offered & Unmet Offered &
Retained Demand Retained Demand Retained
.5G
0
2G
0
2G
.3G
.3G
0
2G
1G
.5G
.5G
.1G
2G
2G
.5G
1.5G
.1G
8.5G
1.3G
.2G
TM
External Use
59
Total BW
Attained
0G
.5G
2G
2.3G
2.6G
2.6G
10G
DPAA: SEC Engine
TM
External Use
60
Security Engine
•
Black Keys
−
•
Blobs
−
−
•
Blobs protect data confidentiality and integrity across power cycles, but do not protect
against unauthorized decapsulation or substitution of another user’s blobs
In addition to protecting data confidentiality and integrity across power cycles, Blobs
cryptographically protect against blob snooping/substitution between security
domains
Trusted Descriptors
−
−
•
In addition to protecting against external bus snooping, Black Keys cryptographically
protect against key snooping between security domains
Trusted Descriptors protect descriptor integrity, but do not distinguish between
Trusted Descriptors created by different users
In addition to protecting Trusted Descriptor integrity, Trusted Descriptors now
cryptographically distinguish between Trusted Descriptors created in different security
domains
DECO Request Source Register
−
Register added
TM
External Use
61
QorIQ T4240 Processor SEC 5.0 Features
Queue
Interface
DMA
Job Queue
Controller
Job Ring I/F
RTIC
Descriptor
Controllers
CHAs
PKHA
STHA
AFHA
KFHA
RNG4
MDHA
AESA
ZHA
TM
External Use
62
DESA
Header & Trailer off-load for the following Security Protocols:
− IPSec, SSL/TLS, 3G RLC, PDCP, SRTP, 802.11i, 802.16e, 802.1ae
(3) Public Key Hardware Accelerator (PKHA)
− RSA and Diffie-Hellman (to 4096b)
− Elliptic curve cryptography (1024b)
− Supports Run Time Equalization
(1) Random Number Generators (RNG4)
− NIST Certified
(4) Snow 3G Hardware Accelerators (STHA)
− Implements Snow 3.0
− Two for Encryption (F8), two for Integrity (F9)
(4) ZUC Hardware Accelerators (ZHA)
− Two for Encryption, two for Integrity
(2) ARC Four Hardware Accelerators (AFHA)
− Compatible with RC4 algorithm
(8) Kasumi F8/F9 Hardware Accelerators (KFHA)
− F8 , F9 as required for 3GPP
− A5/3 for GSM and EDGE
− GEA-3 for GPRS
(8) Message Digest Hardware Accelerators (MDHA)
− SHA-1, SHA-2 256,384,512-bit digests
− MD5 128-bit digest
− HMAC with all algorithms
(8) Advanced Encryption Standard Accelerators (AESA)
− Key lengths of 128-, 192-, and 256-bit
− ECB, CBC, CTR, CCM, GCM, CMAC, OFB, CFB, and XTS
(8) Data Encryption Standard Accelerators (DESA)
− DES, 3DES (2K, 3K)
− ECB, CBC, OFB modes
(8) CRC Unit
− CRC32, CRC32C, 802.16e OFDMA CRC
Life of a Job Descriptor
•
•
•
•
•
•
•
•
•
•
•
•
DDR/CoreNet
Buffer
Mgr
(Shared Desc, Frame)
SP Status
SP1
SP2
SP3
SP4
SP5
R
0
0
0
0
1
FQ ID List
FDs
000
001
101
011
111
FQ
1E
2D
3E
TM
External Use
63
Arbiter
Arbiter
AFHA
AFHA
STHA f9
STHA f9
STHA f8
STHA f8
FQ
E
E
E
FQ
D
D
E
FQ
E
E
E
Holding Tank Pool
Holding
Tank 7
. . . . . . .
DECO 7
. . . . . . .
MDHA
CRCA
KFHA
DESA
FQ
E
E
E
Descriptor
Buffer
CCB 7
Arbiter
RNG4
Arbiter
PKHA
PKHA
PKHA
Arbiter
ZUCE
ZUCE
ZUEA
ZUEA
AESA
MDHA
CRCA
KFHA
DESA
DMA
•
QI has room for more work, issues dequeue request for 1
Queue
or 3 FDs
Manager FD1
Qman selects FQ and provides 1 FD along with
FQ Information
QI creates [internal] Job Descriptor and if
Queue Interface
necessary, obtains output buffers
Job Prep Logic
QI transfers completed Job Descriptor into one of
the Holding Tanks
JD1
Job Queue Controller finds an available DECO, transfers
JD1 to it
DECO initiates DMA of Shared Descriptor from
Job Queue Controller
system memory, places it in Descriptor Buffer with
JD from Holding Tank
JR 0
Job Queues
DECO executes descriptor commands, loading
JR 1
registers and FIFOs in its CCB
Holding
JR 2
CCB obtains and controls CHA(s) to process the
Tank 0
data per DECO commands
JR 3
DECO commands DMA to store results and any updated
context to system memory
DECO Pool
As input buffers are being emptied, DECO tells QI, which
may release them back to BMan
DECO 0
Upon completion of all processing through CCB,
Descriptor
DECO resets CCB
Buffer
DECO informs QI that JD1 has completed with status code
X, data of length Y has been written to address Z
CCB 0
QI creates outbound FD, enqueues to Qman using FQID
from Ctx B field
AESA
DPAA: DCE
TM
External Use
64
DPAA Interaction: Frame Descriptor Status/CMD
•
•
The Status/Command word in the dequeued FD allows software to modify the
processing of individual frames while retaining the performance advantages of
enqueuing to a FQ for flow based processing
The three most significant bits of the Command /Status field of the Frame Descriptor
have the following meaning:
0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
DD
LIODN offset
BPID
ELIODN
offset
- - - -
addr
addr (cont)
Format
Offset
CMD
Token: Pass through data that is echoed with the returned Frame.
Command Encoding
65
-
SCUS
USDC
USPC
UHC
CE
Token
CF
B64
RB
I
R
SCRF
Z
Flush
Token
TM
External Use
0123456789111111111122222 2 2 2 2 2 3 3
012345678901234 5 6 7 8 9 0 1
OO
Description
Process Command
Reserved
Reserved
Reserved
Context Invalidate Command
Reserved
Reserved
NOP Command
CMD
3 MSB
000
001
010
011
100
101
110
111
Length
Status
(output Frame)
DCE Inputs
TM
External Use
66
Context_A
FD3
PID
BPID
Offset
FQs
Buffer
Data
Length
FD2
PID
BPID
Offset
Buffer
Data
Length
Status/Cmd
WQ6
channel
Addr
Addr
WQ7
DCE
Addr
Addr
Status/Cmd
FD1
WQ5
PID
BPID
WQ4
Addr
Buffer
Addr
Offset
Data
Length
WQ3
Status/Cmd
WQ2
Comp
Decomp
WQ1
WQ0
FD3
PID
BPID
Addr
Addr
Offset
Buffer
Data
Length
Status/Cmd
FQs
FD2
PID
WQ7
BPID
WQ5
Addr
Buffer
Addr
Offset
Data
Length
Status/Cmd
WQ6
channel
SW enqueues work to
DCE via Frame Queues.
FQs define the flow for
stateful processing
• FQ initialization creates
a location for the DCE to
use when storing flow
stream context
• Each work item within
the flow is defined by a
Frame Descriptor, which
includes length, pointer,
offsets, and commands
• DCE has separate
channels for compress
and decompress
DCP Portal
•
Flow
Stream
Context
FD1
PID
BPID
WQ4
Addr
Addr
Offset
WQ3
Length
Buffer
Data
Status/Cmd
WQ2
WQ1
WQ0
Flow Context_A
Stream
Context
Command
DCE Outputs
•
FD3
PID
BPID
External Use
67
Length
Status/Cmd
FD2
PID
BPID
Addr
Context_A
DCE
Addr
Offset
Length
Status/Cmd
FQ
s
FD1
PID
BPID
Addr
Addr
Offset
Length
Status/Cmd
FQ
s
FD3
PID
BPID
Addr
Addr
Offset
Length
Status/Cmd
FD2
PID
BPID
Addr
Addr
Offset
Context_A
Length
Status/Cmd
FD1
PID
BPID
Addr
Addr
Offset
Length
Status/Cmd
Status
TM
Addr
Addr
Offset
Flow
Stream
Context
DCP Portal
DCE enqueues results
Buffer
to SW via Frame
Data
Queues as defined by Buffer
Data
FQ Context_B field.
When buffers obtained Buffer
Data
from Bman, buffer pool
ID defined by Input FQ
• Each result is defined by
Buffer
a Frame Descriptor,
Data
which includes a Status
Buffer
field
Data
• DCE updates flow
Buffer
Data
stream context located
at Context_A as needed
Flow
Stream
Context
Comp
Decomp
PME
TM
External Use
68
Frame Descriptor: STATUS/CMD Treatment
•
PME Frame Descriptor Commands
− b111
NOP
− b101 FCR
− b100 FCW
− b001 PMTCC
− b000 SCAN
NOP Command
Flow Context Read Command
Flow Context Write Command
Table Configuration Command
Scan Command
0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
DD
LIODN offset
BPID
ELIODN
offset
- - - -
addr
addr (cont)
Format
Offset
Length
Status/CMD
Scan
b000
SRV F S/ E
M
R
SET
TM
External Use
69
Subset
Life of a Packet inside Pattern Matching Engine
FD1
192.168.1.1:80
192.168.1.1:25
192.168.1.1:1863
TCP
TCP
TCP
10.10.10.100:16734
10.10.10.100:17784
10.10.10.100:16855
Frame Queue: A
flowA:FD1: 192.168.1.1:80->10.10.10.100:16734 “I want to search free “
flowA:FD2: 192.168.1.1:80->10.10.10.100:16734 “scale FTF 2014 event schedule”
DDR
Patt1 /free/
Memory
tag=0x0001
I
W
A
N
T
T
O
S
E
A
R
C
H
F
R
E
E
BMan QMan
On-Chip
System
Bus
Interface
CoreNet
Access to Pattern Descriptors and State
Pattern
Matcher
Frame
Agent
(PMFA)
Key
Element
Scanning
Engine
(KES)
Data
Examination
Engine
FD2
(DXE)
Hash
Tables
Cache
Stateful
Rule
Engine
(SRE)
•
•
•
•
Combined hash/NFA technology
9.6 Gbps raw performance
Max 32K patterns of up to 128B length
Patterns
−
−
•
KES
−
Cache
•
•
External Use
70
Retrieve the pattern with matched hash
value for a final comparison.
SRE
−
TM
Compare hash value of incoming
data(frames) against all patterns
DXE
−
User Definable Reports
Patt1 /free/ tag=0x0001
Patt2 /freescale/ tag=0x0002
Optionally post process match result before
sending the report to the CPU
Debug
TM
External Use
71
Core Debug in Multi-Thread Environment
•
Almost all resources are private. Internal debug works as if they are
separate cores
•
External debug is private per thread. An option exists to halt both threads
when one thread halts
− While threads can be debug halted individually, it is generally not very
useful if the debug session will care about the contents of the MMU
and caches
− Halting both threads prevents the other thread from continuing to
compute and essentially clean the L1 caches and the MMU of the state
of the thread which initiated the debug halt
TM
External Use
72
DPAA Debug trace
•
During packet processing, FMan can trace packet processing flow
through each of the FMan modules and trap a packet.
01234567891111111111222222222233
0123456789012345678901
D
D
LIODN
offset
BPID
ELIO - - - DN
offset
addr (cont)
Fmt
Offset
Length
STATUS/CMD
TM
External Use
73
addr
Summary
TM
External Use
74
QorIQ T4 Series Advance Features Summary
Feature
High perf/watt
Highly integrated
SOC
Sophisticated
PCIe capability
Advanced
Ethernet
Secure Boot
Altivec
Power
Management
Advanced
virtualization
Hardware offload
3x Scalability
Benefit
• 188k CoreMark in 55W = 3.4 CM/W
• Compare to Intel E5-2650: 146k CM in 95W = 1.5 CW/W;
• Or: Intel E5-2687W: 200k MC in 150W = 1.3 CM/W
• T4 is more than 2x better than E5
• 2x perf/watt compared to P4080, FSL’s previous flagship
Integration of 4x 10GE interfaces, local bus, Interlaken, SRIO mean that few chips
(takes at least four chips with Intel) and higher performance density
• SR-IOV for showing VMs a virtual NIC, 128 VFs (Virtual Functions)
• Four ports with ability to be root complex or endpoint for flexible configurations
• Data Center Bridging for lossless Ethernet and QoS
• 10GBase-KR for backplane connections
Prevents code theft, system hacking, and reverse engineering
On-board SIMD engine – sonar/radar and imaging
• Thread, core, and cluster deep sleep modes
• Automatic deep sleep of unused resources
• Hypervisor privilege level enables safe guest OS at high performance
• IOMMU ensures memory accesses are restricted to correct area
• Virtualization of I/O blocks
• Packet handling to 50Gb/s
• Security engine to 40Gb/s
• Data compression and decompression to 20Gb/s
• Pattern matching to 10Gb/s
• 1-, 2-, and 3- cluster solution is 3x performance range over T4080 – T4240
• Enables customer to develop multiple SKUs from on PCB
TM
External Use
75
Other Sessions And Useful Information
•
FTF2014 Sessions for QorIQ T4 Devices
− FTF-NET-F0070_QorIQ
Platforms Trust Arch Overview
− FTF-NET-F0139_AltiVec_Programming
− FTF-NET-F0146_Introduction_to_DPAA
− FTF-NET-F0147-DPAAusage
− FTF-NET-F0148_DPAA_Debug
− FTF-NET-F0157_QorIQ Platforms Trust Arch Demo & Deep Dive
•
T4240 Product Website
− http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=T4240
•
Online Training
− http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=T4240&
tab=Design_Support_Tab
TM
External Use
76
Introducing The
QorIQ LS2 Family
Breakthrough,
software-defined
approach to advance
the world’s new
virtualized networks
New, high-performance architecture built with ease-of-use in mind
Groundbreaking, flexible architecture that abstracts hardware complexity and
enables customers to focus their resources on innovation at the application level
Optimized for software-defined networking applications
Balanced integration of CPU performance with network I/O and C-programmable
datapath acceleration that is right-sized (power/performance/cost) to deliver
advanced SoC technology for the SDN era
Extending the industry’s broadest portfolio of 64-bit multicore SoCs
Built on the ARM® Cortex®-A57 architecture with integrated L2 switch enabling
interconnect and peripherals to provide a complete system-on-chip solution
TM
External Use
77
QorIQ LS2 Family
Key Features
High performance cores with leading
interconnect and memory bandwidth
•
SDN/NFV
Switching
•
•
8x ARM Cortex-A57 cores, 2.0GHz, 4MB L2
cache, w Neon SIMD
1MB L3 platform cache w/ECC
2x 64b DDR4 up to 2.4GT/s
A high performance datapath designed
with software developers in mind
Data
Center
•
•
Wireless
Access
•
New datapath hardware and abstracted
acceleration that is called via standard Linux
objects
40 Gbps Packet processing performance with
20Gbps acceleration (crypto, Pattern
Match/RegEx, Data Compression)
Management complex provides all
init/setup/teardown tasks
Leading network I/O integration
Unprecedented performance and
ease of use for smarter, more
capable networks
TM
External Use
78
•
•
•
•
8x1/10GbE + 8x1G, MACSec on up to 4x 1/10GbE
Integrated L2 switching capability for cost savings
4 PCIe Gen3 controllers, 1 with SR-IOV support
2 x SATA 3.0, 2 x USB 3.0 with PHY
See the LS2 Family First in the Tech Lab!
4 new demos built on QorIQ LS2 processors:
Performance Analysis Made Easy
Leave the Packet Processing To Us
Combining Ease of Use with Performance
Tools for Every Step of Your Design
TM
External Use
79
TM
www.Freescale.com
© 2014 Freescale Semiconductor, Inc. | External Use
QorIQ T4240 SerDes Options
Total of four x8 banks
Ethernet options:
• 10Gbps Ethernet MACs with XAUI
or XFI
• 1Gbps Ethernet MACs with SGMII
(1 lane at 1.25 GHz with 3.125
GHz option for 2.5Gbps Ethernet)
• 2 MACs can be used with
RGMII
• 4 x1Gbps Ethernet MACs can be
supported using a single lane at 5
GHz (QSGMII)
• HiGig is supported with 4 lines at
3.125 GHz or 3.75 GHz (HiGig+)
High speed serial
• 2.5 , 5, 8 GHz for PCIe
• 2.5, 3.125, and 5 GHz for sRIO
• 3.125, 6.25, and 10.3125 GHz for
Interlaken
• 1.5, 3.0 GHz for SATA
• 1.25, 2.5, 3.125, and 5 GHz for
debug
TM
External Use
81
Decompression Compression Engine
•
•
•
•
Zlib: As specified in RFC1950
Deflate: As specified as in RFC1951
GZIP: As specified in RFC1952
Encoding
−
•
•
•
supports Base 64 encoding and decoding
(RFC4648)
ZLIB, GZIP and DEFLATE header insertion
ZLIB and GZIP CRC computation and insertion
4 modes of compression
−
No compression (just add DEFLATE header)
− Encode only using static/dynamic Huffman
codes
− Compress and encode using static OR dyamic
Huffman codes
− at least 2.5:1 compression ratio on the Calgary
Corpus
•
4KB
History
Decompressor
All standard modes of decompression
−
No compression
− Static Huffman codes
− Synamic Huffman codes
•
Compressor
32KB
History
Provides option to return original compressed
Frame along with the uncompressed Frame or
release the buffers to BMAN
TM
External Use
82
Frame
Agent
Bus
I/F
To
Corenet
QMan
I/F
QMan
Portal
BMan
I/F
BMan
Portal
TM
www.Freescale.com
© 2014 Freescale Semiconductor, Inc. | External Use
Download