QorIQ T4240 Communications Processor Deep Dive FTF-NET-F0031 Sam Siu & Feras Hamdan APR.2014 TM External Use Agenda • QorIQ T4240 Communications Processor Overview • e6500 Core Enhancement • Memory Subsystem and MMU Enhancement • QorIQ Power Management features • HiGig Interface • Interlaken Interface • PCI Express® Gen 3 Interfaces (SR-IOV) • Serial RapidIO® Manager (RMAN) • Data Path Acceleration Architecture Enhancements • − mEMAC − Offline Ports and Use Case − Storage Profiling − Data Center Bridging (FMAN and QMAN) − Accelerators: SEC, DCE, PME Debug TM External Use 1 QorIQ T4240 Communications Processor T1 T1 T2 T1 T2 Power ™ T1 T2 Power ™ e6500 e6500 Power ™ T2 T1 T1 T2 T1 T2 T1 T2 Power ™ Power ™ T1 T2 T1 T2 Power ™ Power ™ e6500 e6500 e6500 Power ™ T2 512KB Corenet Platform Cache T1 T2 Power ™ T1 T2 Power ™ e6500 e6500 Power ™ e6500 Power ™ 32 KB 32 KB 32 KB 32 KB 32 KB 32 KB 32 KB 32 KB e6500 e6500 e6500 e6500 32 KB I-Cache 32 KB D-Cache 32 KB I-Cache 32 KB 32 KB I-Cache 32 KB D-Cache 32 KB I-Cache 32 KB D-Cache D-Cache 32 KB I-Cache 32 KBD-Cache 32 KB I-Cache 32 KB 32 KB I-Cache 32 KB D-Cache 32 KB I-Cache 32 KB D-Cache D-Cache D-Cache I-Cache D-Cache I-Cache D-Cache D-Cache I-Cache 64-bit 64-bit DDR2/3 DDR3/3L Memory Memory Controller Controller 64-bit 64-bit DDR2/3 DDR3/3L Memory Memory Controller Controller 64-bit 64-bit DDR2/3 DDR3/3L Memory Memory Controller Controller 512KB Corenet Platform Cache I-Cache 512KB Corenet Platform Cache 2MB Banked L2 2MB Banked L2 2MB Banked L2 Security Fuse Processor CoreNet™ Coherency Fabric Buffer 1/ 1/ Mgr. 10G 10G DCB 1G 1G 1G 1G 1G 1G HiGig 1/ 1/ 10G 10G DCB 1G 1G 1G Device − − − • Power targets − − Aurora Data Path Acceleration − SEC- crypto acceleration 40 Gbps − PME- Reg-ex Pattern Matcher 10Gbps − DCE- Data Compression Engine 20Gbps ~54W thermal max at 1.8 GHz ~42W thermal max at 1.5 GHz TM External Use Perf CoreNet Monitor Trace 16-Lane 10GHz SERDES • TSMC 28 HPM process 1932-pin BGA package 42.5x42.5 mm, 1.0 mm pitch Real Time Debug Watchpoint Cross Trigger 1G 1G 1G 16-Lane 10GHz SERDES • SATA 2.0 SPI, GPIO HiGig Pattern Match RMAN Engine 2.0 3xDMA SATA 2.0 2x I2C FMan Parse, Classify, Distribute sRIO 2x DUART FMan Parse, Classify, Distribute Peripheral Access Mgmt Unit sRIO Security Queue 5.0 Mgr. PCIe SD/MMC DCE 1.0 PAMU PAMU PCIe IFC Power Management PAMU PCIe PAMU PCIe 2x USB 2.0 w/PHY Interlaken LA Security Monitor 2 Processor • 12x e6500, 64-bit, up to 1.8 GHz • Dual threaded, with128-bit AltiVec engine • Arranged as 3 clusters of 4 CPUs, with 2 MB L2 per cluster; 256 KB per thread Memory SubSystem • 1.5 MB CoreNet platform cache w/ECC • 3x DDR3 controllers up to 1.87 GHz • Each with up to 1 TB addressability (40 bit physical addressing) CoreNet Switch Fabric High-speed Serial IO • 4 PCIe controllers, with Gen3 • SR-IOV support • 2 sRIO controllers • Type 9 and 11 messaging • Interworking to DPAA via Rman • 1 Interlaken Look-Aside at up to10 GHz • 2 SATA 2.0 3Gb/s • 2 USB 2.0 with PHY Network IO • 2 Frame Managers, each with: • Up to 25Gbps parse/classify/distribute • 2x10GE, 6x1GE • HiGig, Data Center Bridging Support • SGMII, QSGMII, XAUI, XFI e6500 Core Enhancement TM External Use 3 e6500 Core Complex 32K 32K e6500 32K T T e6500 32K 32K Altivec 32K T T e6500 32K PMC e6500 T T PMC T PMC T Altivec Altivec PMC Altivec 32K High Performance • 64-bit Power Architecture® technology • Up to 1.8 GHz operation • Two threads per core • Dual load/store units, one per thread • 40-bit Real Address − • • 2MB 16-way Shared L2 Cache, 4 Banks Hardware Table Walk L2 in cluster of 4 cores − − CoreNet Interface 40-bit Address Bus 256-bit Rd & Wr Data Busses 1 Terabyte physical addr. space Supports Share across cluster Supports L2 memory allocation to core or thread Energy Efficient Power Management − − − CoreNet Double Data Processor Port CoreMark P4080 (1.5 GHz) T4240 (1.8 GHz) from P4080 Single Thread 4708 7828 1.7x Core (dual T) 4708 15,656 3.3x 37,654 187,873 5.0x 2.4 5.1 2.1x SoC DMIPS/Watt (typ) TM External Use 4 Drowsy : Core, Cluster, AltiVec engine Wait-on-reservation instruction Traditional modes Improvement • AltiVec SIMD Unit (128b) − − 8,16,32-bit signed/unsigned integer 32-bit floating-point − • 173 GFLOP (1.8GHz) 8,16,32-bit Boolean Improve Productivity with Core Virtualization − − Hypervisor Logical to Real Addr (LRAT). translation mechanism for improved hypervisor performance General Core Enhancements • Improved branch prediction and additional link stack entries • Pipeline improvements: − − • New debug features: − − • Ability to allocate individual debug events between the internal and external debuggers More IAC events Performance monitor − − • LR, CTR, mfocrf optimization (LR and CTR are renamed) 16 entry rename/completion buffer Many more events, six counters per thread Guest performance monitor interrupt Private vs. Shared State Registers and other architected state − Shared between threads: There is only one copy of the register or architected state A change in one thread affects the other thread if the other thread reads it − Private to the thread and are replicated per thread : There is one copy per thread of the register or architected state A change in one thread does not affect the other thread if the thread reads its private copy TM External Use 5 Corenet Enhancements in QorIQ T 4240 • CoreNet Coherency Fabric − − 40-bit Real Address Higher address bandwidth and active transactions − − • 2X BW increase for core, MMU, and peripheral Improved configuration architecture Platform Cache − − − • 1.2 Tbps Read, .6Tbps Write 100% 90% 80% 70% 60% 50% IP Mark 40% TCP Mark 30% 20% 10% 0% Increased write bandwidth (>600Gbps) 0 2 4 6 8 10 12 14 16 18 20 22 24 Increased buffering for improving throughput Improved data ownership tracking for performance enhancement Data PreFetch − − − − − − Tracks CPC misses Prefetches from multiple memory regions with configurable sizes Selective tracking based on requesting device, transaction type, data/instruction access Conservative prefetch requests to avoid system overloading with prefetches “Confidence” based algorithm with feedback mechanism Performance monitor events to evaluate the performance of Prefetch in the system TM External Use 6 Cache and Memory Subsystem Enhancements TM External Use 7 Shared L2 Cache • • • Clusters of cores share a 2M byte, 4-bank, 16-way set associative shared L2 cache. In addition, there is also support for a 1.5M byte corenet platform cache. Advantages − L2 cache is shared among 4 cores allowing lines to be allocated among the 4 cores as required Some cores will need more lines and some will need less depending on workloads − Faster sharing among cores in the cluster (sharing a line between cores in the cluster does not require the data to travel on CoreNet) − Flexible partition of L2 cache base on application cluster group. • Trade Offs − Longer latency to DRAM and other parts of the system outside the cluster − Longer latency to L2 cache due to increased cache size and eLink overhead T1 T1 T2 T1 T2 T2 T1 T2 T1 T2 T1 T2 T1 T2 T1 T2 Power ™ Power ™ Power ™ Power ™ T1 T2 T1 T2 T1 T2 T1 T2 Power ™ Power ™ Power ™ Power ™ e6500 e6500 e6500 e6500 e6500 Power ™ e6500 Power ™ e6500 Power ™ e6500 Power ™ 32 KB 32 KB 32 KB 32 KB 32 KB 32 KB 32 KB 32 KB e6500 e6500 e6500 e6500 32 KB I-Cache 32 KB D-Cache 32 KB I-Cache 32 KB 32 KB I-Cache 32 KB D-Cache 32 KB I-Cache 32 KB D-Cache D-Cache 32 KB I-Cache 32 KBD-Cache 32 KB I-Cache 32 KB 32 KB I-Cache 32 KB D-Cache 32 KB I-Cache 32 KB D-Cache D-Cache D-Cache I-Cache D-Cache I-Cache D-Cache I-Cache D-Cache I-Cache 512KB Corenet Platform Cache 512KB Corenet Platform Cache 512KB Corenet Platform Cache 2MB Banked L2 2MB Banked L2 2MB Banked L2 Security Fuse Processor CoreNet™ Coherency Fabric Security Monitor 2x USB 2.0 w/PHY PAMU PAMU TM External Use 8 PAMU PAMU Peripheral Access Mgmt Unit 64-bit 64-bit DDR2/3 DDR3/3L Memory Memory Controller Controller 64-bit 64-bit DDR2/3 DDR3/3L Memory Memory Controller Controller 64-bit 64-bit DDR2/3 DDR3/3L Memory Memory Controller Controller Memory Subsystem Enhancements • The e6500 core has a larger store queue than the e5500 core • Additional registers are provided for L2 cache partitioning controls similar to how partitioning is done in the CPC • Cache locking is supported, however, if a line is unable to be locked, that status is not posted. Cache lock query instructions are provided for determining whether a line is locked • The load store unit contains store gather buffers to collect stores to cache lines before sending them on eLink to the L2 cache • There are no more Line Fill Buffers (LFB) associated with the L1 data cache − These are replaced with Load Miss Queue (LMQ) entries for each thread − They function in a manner very similar to LFBs • Note there are still LFBs for L1 instruction cache TM External Use 9 MMU Enhancements TM External Use 10 MMU – TLB Enhancements • e6500 core implements MMU architecture version 2 (V2) − • MMU architecture V2 is denoted by bits in the MMUCFG register Translation Look-aside Buffers (TLB1), − Variable size pages, supports power of two page sizes (previous cores used power of 4 page sizes) − 4 KB to 1 TB page sizes • Translation Look-aside Buffers (TLB0) increased to 1024 entries − 8 way associativity (from 512, 4 way) − Supports HES (hardware entry select) when written to with tlbwe • PID register is increased to 14 bits (from 8 bits) − • • Now the operating system can have 16K simultaneous contexts Real address increased to 40 bits (from 36 bits) In general, it is backward compatible with MMU operations from e5500 core, except: − some of the configuration registers have different organization (TLBnCFG for example) − There are new config registers for TLB page size (TLBnPS) and LRAT page size (LRATPS) − tlbwe can be executed by guest supervisor (but can be turned off with an EPCR bit) LPID PID(14bits) AS (14bit) 0=Hypervisor Access MSR 1=guest GS Effective Address (EA) (64bit ) Effective Page #(0-52 bits) Byte Addr (12-32bits ) Real Page Number (0-28bits) Byte Address (12-40bits) TM External Use 11 Real Address (40bits) MMU – Virtualization Enhancements (LRAT) • e6500 core contains an LRAT (logical to real address translation) − The LRAT converts logical addresses (an address the guest operating system thinks are real) and converts them to true real addresses − Translation occurs when the guest executes tlbwe and tries to write TLB0 or during hardware tablewalk for a guest translation − Does not require hypervisor to intervene unless the LRAT incurs a miss (the hypervisor writes entries into the LRAT) − 8 entry fully associative supporting variable size pages from 4 KB to 1 TB (in powers of two) • Prior to the LRAT, the hypervisor had to intervene each time the guest tried to write a TLB entry Application MMU Page Instr1 Instr2 Fault Guest OS Instr3 --- VA -> Guest RA Trap Writes TLB Hypervisor Guest RA -> RA Writes TLB TM External Use 12 Implemented in HW with LRAT QorIQ Power Management Features TM External Use 13 Full Mid Light T4 Advanced Power Mgt Light to Mid Full Standby Always on Today’s Energy Strategy Cyclical Valued Workload Dynamic T4 Family Energy/Power Total Cost of Ownership Dynamic Clk Gating Energy Savings Cluster Drowsy TM External Use 14 Dual Cluster Drowsy + Tj Core Cascaded SoC Sleep Cascaded Power Management Today: All CPUs in pool channel dequeue until all FQs empty DPAA uses task queue thresholds to inform CPUs they are not needed. CPUs selectively awakened as needed. Broadcast notification when work arrives C1 C2 C3 C0 C1 C2 C3 Drowsy Drowsy Shared L2 Shared L2 Active CPUs Task Queue QMan T5 T4 T3 T2 T1 Threshold 2 Threshold 1 TM External Use 15 Power/Performance Core: C0 12 11 10 9 8 7 6 5 4 3 2 1 Burst Day Night • CPU’s run software that drops into polling loop when DPAA is not sending it work. • Polling loop should include a wait w/ drowsy instruction that puts the core into drowsy e6500 Core Intelligent Power Management e6500 L1 Cluster State Core State Run, Doze, Nap Wait Altivec Drowsy • Auto and SW controlled – maintain state Core Drowsy • Auto and SW controlled – maintain state Dynamic Clock gating PMC T T PMC Altivec 2048KB Banked L2 L1 NEW NEW Run, Nap • Cores and L2 Dynamic Frequency Scaling (DFS)of the Cluster Drowsy Cluster (cores) Dynamic clock gating NEW Full On Full On Full On Full On Full On Nap PCL00 PCL00 PCL00 PCL00 PCL00 PCL10 Run Doze Nap Global Clk stop Nap (Pwr Gated) Core glb clk stop PH00 PH10/PW10 PH15 PW20 PH20 PH20 Cluster Voltage Core Voltage Cluster Clock On On On On On Off Core Clock On On Off Off Off Off L2 Cache SW Flushed L1 Cache SW Invalidated HW Invalidated SW Invalidated SW Invalidated • Wakeup Time Active Immediate Power TM External Use 16 < 30 ns < 200 ns < 600 ns < 1us SoC Sleep with state retention • SoC Sleep with RST • Cascade Power Management • Energy Efficient Ethernet (EEE) HiGig Interface Support TM External Use 17 HiGigTM/HiGig+/HiGig2 Interface Support • • • The 10 Gigabit HiGigTM / HiGig+TM / HiGig2TM MAC interface interconnect standard Ethernet devices to Switch HiGig Ports. Networking customers can add features like quality of service (QoS), port trunking, mirroring across devices, and link aggregation at the MAC layer. The physical signaling across the interface is XAUI, four differential pairs for receive and transmit (SerDes), each operating at 3.125 Gbit/s. HiGig+ is a higher rate version of HiGig 11111111112222222222333 01234567890123456789012 Typ MAC_DA MAC_SA e 123456789 Preamble Packet Data FCS Regular Ethernet Frames 1111111111222222222233333 0123456789012345678901234 Typ HiGig+ Module Hdr MAC_DA MAC_SA e 123456789 Preamble Packet Data FCS* Ethernet Frames with HiGig+ Header 123456789 Preamble 11111111112222222222333333333 01234567890123456789012345678 Typ HiGig2 Module Hdr MAC_DA MAC_SA e Ethernet Frames with HiGig2 Header TM External Use 18 Packet Data FCS* QorIQ T4240 Processor HiGig Interface • • T4240 FMan Supports HiGig/HiGig+/HiGig2 protocols In the T4240 processor, the 10G mEMACs can be configured as HiGig interface. In this configuration two of the 1G mEMACs are used as the HiGig message interface TM External Use 19 SERDES Configuration for HiGig Interface • Networking protocols (SerDes 1 and SerDes 2) • HiGig notation: HiGig[2]m.n means HiGig[2] (4 lanes @ 3.125 or 3.75 Gbps) − − − • “m” indicates which Frame Manager (FM1 or FM2) “n” indicates which MAC on the Frame Manager E.g. “HiGig[2]1.10,” indicates HiGig[2] using FM1’s MAC 10 When a SerDes protocol is selected with dual HiGigs in one SerDes, both HiGigs must be configured with the same protocol (for example, both with 12 byte headers or both with 16 byte headers) TM External Use 20 HiGig/HiGig2 Control and Configuration 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 TCM IGNIM FIMT FER MCRC NPPR LLF LLI LLM IMG HiGig/HiGig2 control and Configuration Register (HG_CONFIG) Name Description LLM_MODE Toggle between HiGig2 Link Level Messages physical link, OR HiGig2 link level messages logical link (SAFC) LLM_IGNORE Ignore HiGig2 link level message quanta LLM_FWD Terminate/forward received HiGig2 link level message IMG[0:7] Inter Message Gap - spacing between HiGig2 messages NOPRMP 0 Toggle preemptive transmission of HiGig2 messages MCRC_FWD Strip/forward HiGig2 message CRC of received messages FER Discard/forward HiGig2 receive message with CRC error FIMT Forward OR Discard message with illegal MSG_TYP IGNIMG Ignore IMG on receive path TCM TC (traffic classes) mapping TM External Use 21 Interlaken Interface TM External Use 22 Interlaken Look-Aside Interface • • • T4240 • 10 G 10 G 10 G 10 G • Interlaken • Use Case: T4240 processor as a data path processor, requiring millions of look-ups per second. Expected requirement in edge routers. Interlaken Look-Aside is a new high-speed serial standard for connecting TCAMs “network search engines”, “Knowledge Based Processors” to host CPUs and NPUs. Replaces Quad Data Rate (QDR) SRAM interface. Like Interlaken streaming interfaces (channelized SERDES link, replacing SPI 4.2), Interlaken look-aside supports configurable number of SERDES lanes (1-32, granularity of single lane) with linearly increasing bandwidth. Freescale supporst x4 and x8, up to 10 GHz. For lowest latency, each vCPU (thread) in T4240 processor will have a portal into the Interlaken Controller, allowing multiple search requests and results to be returned concurrently. Interlaken Look Aside expected to gain traction as interface to other low latency/minimal data exchange co-processors, such as Traffic Managers. PCIe and sRIO better for higher latency/high bandwidth applications. Lane Striping TCAM TM External Use 23 T4240 (LAC) Features: • • • • • • • • • • • • Supports Interlaken Look-Aside Protocol definition, rev. 1.1 Supports 24 partitioned software portals Supports in-band per-channel flow control options, with simple xon/xoff semantics Supports wide range of SerDes speeds (6.25 and 10.3125 Gbps)) Ability to disable the connection to individual SerDes lanes A continuous Meta Frame of programmable frequency to guarantee lane alignment, synchronize the scrambler, perform clock compensation, and indicate lane health 64B/67B data encoding and scrambling Programmable BURSTSHORT parameter of 8 or 16 bytes Error detection illegal burst sizes, bad 64/67 word type and CRC-24 error Error detection on Transmit command programming error Built-in statistics counters and error counters Dynamic power down of each software portal TM External Use 24 Look-Aside Controller Block Diagram TM External Use 25 Modes of Operation • T4240 LA controller can be either in Stashing mode or non stashing. • The LAC programming model is based on big Endinan mode, meaning byte 0 on the most significant byte. • In non Stashing mode software has to issue dcbf each time it reads SWPnRSR and RDY bit is not set. TM External Use 26 Interlaken LA Controller Configuration Registers • • • • • • • • • 4KBytes hypervisor space 0x0000-0x0FFF 4KBytes managing core space 0x1000-0x1FFF in compliant with trusted architecture ,LSRER, LBARE, LBAR, LLIODNRn, accessed exclusively in hypervisor mode, reserved in managing core mode. Statistics, Lane mapping, Interrupt , rate, metaframe, burst, FIFO, calendar, debug, pattern, Error, Capture Registers LAC software portal memory, n= 0,1,2,3,….,23 . SWPnTCR/ SWPnRCR—software portal 0 transmit/Receive command register SWPnTER/SWPnRER—software portal 0 transmit/Receive error register SWPnTDR/SWPnRDR0,1,2,3 —software portal 0,1,2,3 transmit/Receive data register 0,1,2,3 SWPnRSR—software portal receive status register TM External Use 27 TCAM Usage in Routing Example TM External Use 28 Interlaken Look-Aside TCAM Board 125 MHz SYSCLK VDDC 0.85V @6A Renesas Interlaken LA 5Mb TCAM VDDA 0.85V @ 2A Config I2C EEPROM VDDHA 1.80V 0.5A VCC_1.8V 1.8V @ 2A Filters VDDO 1.80V 1.0A VPLL 1.80V 0.25A 0-ohm 3.3V/12V IL-LA 4x TM External Use 29 REFCLK 156.25 MHz SMBus Misc: Reset, JTAG PCI Express® Gen 3 Interfaces TM External Use 30 PCI Express® Gen 3 Interfaces • • Two PCIe Gen 3 controllers can be run at the same time with same SerDes reference clock source PCIe Gen 3 bit rates are supported − When running more than one PCIe controller at Gen3 rates, the associated SerDes reference clocks must be driven by the same source on the board 51G PCIe1 SR-IOV EP 51G 51G PCIe4 OCN 51G X8 Gen2 or PCIe2 x4 Gen3 RC/EP EP SRIOV 2 PF/64VF 8xMSI-X per VF/PF X4 Gen2/3 RC/EP 51G PCIe3 X8 Gen2 or x4 Gen3 Total of 16 lanes X4 Gen2/3 RC/EP 16 SERES PCIe Configuration PCIe1 PCIe2 PCIe3 x4gen3 x4gen2 x8gen2 X8gen2 x8gen2 x4gen2 x4gen3 x4gen2 TM External Use 31 PCIe4 x4gen2 Single Root I/O Virtualization (SR-IOV) End Point • With SR-IOV supported in EP, different devices or different software tasks can share IO resources, such as Gigabit Ethernet controllers. − T4240 Supports SR-IOV 1.1 spec version with 2 PFs and 64 VFs per PF − SR-IOV supports native IOV in existing single-root complex PCI Express topologies − Address translation services (ATS) supports native IOV across PCI Express via address translation − Single Management physical or virtual machine on host handles end-point configuration • E.g. T4240 processor as a Converged Network Adapter. Each Virtual Machine running on Host thinks it has a private version of the services card VM 1 VM 2 … VM N Host Translation Agent TM External Use 32 T4240 features single controller (up to x4 Gen 3), 1 PF, 64 VFs PCI Express Configuration Address Register • The PCI Express configuration address register contains address information for accesses to PCI Express internal and external configuration registers for End Point (EP) with SR-IOV 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 EN Type EXTREGN VFN PFN REGN PCI Express Address Offset Register Name Description Enable allows a PCI Express configuration access when PEX_CONFIG_DATA is accessed TYPE 01, Configuration Register Accesses to PF registers for EP with SR-IOV 11, Configuration Register Accesses to VF registers for EP with SR-IOV EXTREGN Extended register number. This field allows access to extended PCI Express configuration space VFN Virtual Function number minus 1. 64-255 is reserved. PFN Physical Function number minus 1. 2-15 is reserved. REGN Register number. 32-bit register to access within specified device TM External Use 33 Message Signaled Interrupts (MSI-X) Support • • MSI-X allows for EP device to send message interrupts to RC device independently for different Physical or Virtual functions as supported by EP SR-IOV. Each PF or VF will have eight MSI-X vectors allocated with a total of 256 total MSI-X vectors supported − Supports MSI-X for PF/VF with 8 MSI-X vector per PF or VF − Supports MSI-X trap operation − To access a MSI-X PBA structure, the PF, VF, IDX, EIDXare concatenated to form the 4byte aligned address of the register within the MSI-X PBA structure. That is, the register address is: PF || VF || IDX || EIDX || 0b00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Type PF VF IDX EDIX M PCI Express Address Offset Register Name Description TYPE Access to PF or VF MSI-X vector table for EP with SR-IOV. PF Physical Function VF Virtual Function IDX MSI-X Entry Index in each VF. EIDX Extended index This field provides which 4-Byte entity within the MSI-X PBA structure to access. M Mode=11 TM External Use 34 Serial RapidIO® Manager (RMAN) TM External Use 35 RapidIO Message Manager (RMan) RMAN supports both inline switching, as well as look aside forwarding operation. RapidIO PDU … Ftype Target ID Src ID Address Packet Data Unit CRC RMan Reassembly Contexts HW Channel Pool Channel WQ7 WQ6 WQ5 WQ4 WQ3 WQ2 Reassembly Unit WQ0 WQ1 WQ2 WQ3 WQ4 WQ5 WQ6 WQ7 Classification Unit WQ1 Reassembly Unit WQ0 Classification Unit Segmentation Unit HW Channel WQ0 WQ1 WQ2 WQ3 WQ4 WQ5 WQ6 WQ7 Reassembly Unit Disassembly Contexts QMan ARB Inbound Rule Matching Classification Unit Segmentation Unit DCP e6500 I$ L2$ D$ Core D$ I$ TM External Use 37 SW Portal PME I$ D$ SEC Frame Manager 1GE 1GE 10GE 1GE 1GE Segmentation Unit RapidIO Outbound Traffic DCP RapidIO Inbound Traffic • RMan: Greater Performance and Functionality • Many queues allow multiple inbound/outbound queues per core − Hardware • queue management via QorIQ Data Path Architecture (DPAA) Supports all messaging-style transaction types − Type 11 Messaging − Type 10 Doorbells − Type 9 Data Streaming • Enables low overhead direct core-to-core communication QorIQ or DSP Device-to-Device Transport QorIQ or DSP Channelized CPUto-CPU transport Core Core Core Core 10G SRIO 10G Type9 MSG TM External Use 38 Core Core Core Core User PDU User PDU SRIO Data Path Acceleration Architecture (DPAA) TM External Use 39 Data Path Acceleration Architecture (DPAA) Philosophy • DPAA is design to balance the performance of multiple P Series D$ D$ I$ I$ e500mc CPUs and Accelerators with seamless integrations e500mc − • ANY packet to ANY core to ANY accelerator or network interface efficiently WITHOUT locks or semaphores I$ D$ I$ Core L2$ L2$ D$ Core “Infrastructure” components e6500 e6500 I$ D$ I$ Core L2$ L2$ D$ Core … D$ D$ I$ I$ CoreNet™ Queue Manager (QMan) − Buffer Manager (BMan) Coherency Fabric “Accelerator” Components − − − − − − − − • … D$ D$ I$ I$ D$ D$ I$ I$ − • T Series Cores Frame Manager (FMan) RapidIO Message Manager (RMan) Cryptographic accelerator (SEC) Pattern matching engine (PME) Decompression/Compression Engine (DCE) DCB (Data Center Bridging) RAID Engine (RE) − TM External Use 40 RMan RE Sec 4.x PME 2 Frame Manager Buffer Provides the interconnect between the cores and the DPAA infrastructure as well as access to memory DCB Queue Manager PCD CoreNet DCE Buffer Mgr Frame Manager … Parse, Classify, Distribute Buffer 1G 1G 1G 1GE 1GE 1/10G 1/10G 10GE 1GE 1GE 1G 1G 1G … DPAA Building Block: Frame Descriptor (FD) Simple Frame Multi-buffer Frame (Scatter/gather) Frame Descriptor Buffer PID BPID Address 000 Offset D Frame Descriptor D Length Status/Cmd BPID Address 100 Offset Length Status/Cmd Packet S/G List PID Address 00 Length BPID Offset Address 00 Length BPID Offset (=0) 01234567891111111111222222222233 0123456789012345678901 D D LIODN offset BPID ELIO - - - DN offset addr (cont) Fmt Offset Length STATUS/CMD TM External Use 41 Data Data … addr Address 01 Length BPID Offset (=0) Data Frame Descriptor Status/Command Word (FMAN Status) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 - - - - - L4CV BLE FRDR PHE ISP PTE FLM IPP - FCL - KSO NSS EOF - DIS FSE FPE - MS DME - DCL4C - - - Name DCL4C DME MS FPE FSE Description L4 (IP/TCP/UDP) Checksum validation Enable/Disable DMA error MACSEC Frame. This bit is valid on P1023 Frame Physical Error Frame Size Error DIS Discard. This bit is set only for frames that are supposed to be discarded, but are enqueued in an error queue for debug purposes. Extract Out of Frame Error No Scheme Selection foe KeyGen Key Size Over flow Error Frame color as determined by the Policer. 00=green, 01=yellow, 10=red, 11=no reject Illegal Policer Profile error Frame Length Mismatch Parser Time-out Invalid Soft Parser instruction Error Header Error Frame Drop Block limit is exceeded L4 Checksum Validation EOF NSS KSO FCL IPP FLM PTE ISP PHE FRDR BLE L4CV TM External Use 42 DPAA: mEMAC Controller TM External Use 43 Multirate Ethernet MAC (mEMAC) Controller • A multirate Ethernet MAC (mEMAC) controller features 100 Mbps/1G/2.5G/10G : − − − − − − − − − − Supports HiGig/HiGig+/HiGig2 protocols Dynamic configuration for NIC (Network Interface Card) applications or Switching/Bridging applications to support 10Gbps or below. Designed to comply with IEEE Std 802.3®, IEEE 802.3u, IEEE 802.3x, IEEE 802.3z, IEEE 802.3ac, IEEE 802.3ab, IEEE-1588 v2 (clock synchronization over Ethernet), IEEE 803.3az and IEEE 802.1QBbb. RMON statistics CRC-32 generation and append on transmit or forwarding of user application provided FCS selectable on a per-frame basis. 8 MAC address comparison on receive and one MAC address overwrite on transmit for NIC applications. Selectable promiscuous frame receive mode and transparent MAC address forwarding on transmit Multicast address filtering with 64-bin hash code lookup table on receive reducing processing load on higher layers Support for VLAN tagged frames and double VLAN Tags (Stacked VLANs) Dynamic inter packet gap (IPG) calculation for WAN applications QorIQ P Series 10GMAC dTSEC Frame Manager Interface 1588 Time Stamping Tx FIFO Config Control Stat Tx Control MDIO Master Phy Mgmt MDIO Rx FIFO Rx Control Flow Control Reconcilication Tx Interface Rx Interface QorIQ T4240 - mEMAC TM External Use 44 DPAA: FMAN TM External Use 45 FMan Enhancements • Storage Profile selection (up to 32 profiles per port) based on classification − FMAN Up to four buffer pools per Storage Profile • Customer Edge Egress Traffic Management (Egress Shaping) • Data Center Bridging − • • • • • − • PFC and ETS muRAM IEEE802.3az (Energy Efficient Ethernet) IEEE802.3bf (Time sync) IP Frag & Re-assembly Offload HiGig, HiGig2 TX confirmation/error queue enhancements − Parse, Classify, Distribute Ability to configure separate FQID for normal confirmations vs errors Separate FD status for Overflow and physical error Option to disable S/G on ingress TM External Use 46 1/10G 1G 1G 1G 1G 1G 1G 1/10G Offline Ports TM External Use 47 FMAN Ports Types • Ethernet receive (Rx) and transmitter (Tx) − − − • Offline (O/H) − − − − − • 1 Gbps/2. 5Gbps/10 Gbps FMan_v3 some ports can be configured ad HiGig Jumbo frames of up to 9.6 KB (add uboot bootargs "fsl_fm_max_frm=9600" ) FMan_v3: 3.75 Mpps (vs 1.5M pps from the P series) Supports Parse classify distribute (PCD) function on frames extracted frame descriptor (FD) from the Qman Supports frame copy or move from a storage profile to an other Able to dequeue and enqueue from/to a QMan queue. The FMan applies a Parse Classify Distribute (PCD) flow and (if configured to do so) enqueues the frame it back in a Qman queue. In FMan_v3 the FMan is able to copy the frame into new buffers and enqueue back to the QMan. Use case: IP fragmentation and reassembly Host command − − Able to dequeue host commands from a QMan queue. The FMan executes the host command (such as a table update) and enqueues a response to the QMan. The Host commands, require a dedicated PortID (one of the O/H ports) The registers for Offline and Host commands are named O/H port registers TM External Use 48 IP Reassembly T4240 Processor Flow Regular frame: Storage Profile is chosen according to frame header classification. BMI: Parser: Parse The Frame Identify fragments Reassembled frame: Storage Profile is chosen according to MAC and IP header classification only. Fman Controller: Start reassembly KeyGen: Calculate Hash KeyGen: Calculate Hash Fman Controller: Coarse Classification Reassembled Frame Regular/Fragment BMI: Write IC BMI: Allocate buffer Write frame and IC *Fragments Non Fragments Enqueue Frame Enqueue Frame TM External Use 49 *Buffer allocation is done According to fragment header only Fman Controller: link fragment to the right reassembly context Non Completed reassembly BMI: Terminate Completed reassembly IP Reassembly FMAN Memory Usage • • • FMAN Memory: 386 KBytes Assumption: MTU = 1500 Bytes Port FMAN Memory consumption: − Each 10G Port = 40 Kbytes − Each 1G Port = 25 Kbytes − Each Offline Port = 10 Kbytes • Coarse Classification tables memory consumption: − • 100 Kbytes for all ports IP Reassembly: − IP Reassembly overhead: 8 Kbytes − Each flow: 10 Bytes • Example: − − − − − Usecase with: 2x10G ports + 2x1G port + 1xOffline Ports. Port configuration: 2x40 + 2x25 + 10 = 140 Kbytes Coarse Classification : 100 Kbytes IP reassembly 10K flows: 10K x 10B + 8KB = 108 Kbytes Total = 140KB + 108KB + 100KB = 348 KBytes TM External Use 50 Storage Profile TM External Use 51 Virtual Storage Profiling For Rx and Offline Ports • • • • Storage profile enable each partition and virtual interface enjoy a dedicated buffer pools. Storage profile selection after distribution function evaluation or after custom classifier The same Storage Profile ID (SPID) values from the classification on different physical ports, may yield to different storage profile selection. Up to 64 storage profiles per port are supported. − 32 • storage profiles for FMan_v3L Storage profile contains − LIODN offset − Up to four buffer pools per Storage Profile − Buffer Start margin/End margin configuration − S/G disable − Flow control configuration TM External Use 52 Data Center Bridging TM External Use 53 Policing and Shaping • • Policing put a cap on the network usage and guarantee bandwidth Shaping smoothes out the egress traffic − May • require extra memory to store the shaped traffic. DCB can be used in: − Between data center network nodes − LAN/network traffic − Storage Area Network (SAN) − IPC traffic (e.g. Infiniband (low latency)) Time Time TM External Use 54 Time Support Priority-based Flow Control (802.1Qbb) Transmit Queues • Enables lossless behavior for each class of service • PAUSE sent per virtual lane when buffers limit exceeded − FQ congestion groups state (on/off) from QMan − Priority vector (8 bits) is assigned to each FQ congestion group FQ congestion group(s) are assigned to each port Upon receipt of a congestion group state “on” message, for each Rx port associated with this congestion group, a PFC Pause frame is transmitted with priority level(s) configured for that group Buffer pool depletion − Priority level configured on per port (shared by all buffer pools used on that port) Near FMan Rx FIFO full There is a single Rx FIFO per port for all priorities, the PFC Pause frame is sent on all priorities TM External Use 55 Receive Buffers Zero Zero One One Two Two Three • Ethernet Link STOP PAUSE Three Four Four Five Five Six Six Seven Seven Eight Virtual Lanes PFC Pause frame reception − QMan provides the ability to flow control 8 different traffic classes; in CEETM each of the 16 class queues within a class queue channel can be mapped to one of the 8 traffic classes & this mapping applies to all channels assigned to the link Support Bandwidth Management 802.1Qaz • • Hierarchical port scheduling defines the class-of-service (CoS) properties of output queues, mapped to IEEE 802.1p priorities Qman CEETM enables Enhanced Tx Selection (ETS) 802.1Qaz with Intelligent sharing of bandwidth between traffic classes control of bandwidth − Strict priority scheduling of the 8 independent classes. Weighted bandwidth fairness within 8 grouped classes − Priority of the class group can be independently configured to be immediately below any of the independent classes • Meets performance requirement for ETS: bandwidth granularity of 1% and +/-10% accuracy TM External Use 56 Offered Traffic 3G/s 3G/s 2G/s 3G/s 3G/s 3G/s 3G/s 4G/s 6G/s t1 • t2 t3 10 GE Realized Traffic Utilization 3G/s HPC Traffic 3G/s 2G/s 3G/s Storage Traffic 3G/s 3G/s 3G/s LAN Traffic 4G/s 5G/s t1 t2 t3 Supports 32 channels available for allocation across a single FMan − e.g. for two10G links, could allocate 16 channels (virtual links) per link − Supports weighted bandwidth fairness amongst channels − Shaping is supporting on per channel basis QMAN CEETM TM External Use 57 CEETM Scheduling Hierarchy (QMAN 1.2) − − − • Green denotes logic units and signal paths that relate to the request and fulfillment of Committed Rate (CR) packet transmission opportunities Yellow denotes the same for Excess Rate (ER) Black denotes logic units and signal paths that are used for unshaped opportunities or that operate consistently whether used for CR or ER opportunities Strict Priority Weighted Scheduling StrictPriority Scheduler − − Channel Scheduler: channels are selected to send frame from Class Queues Class scheduler: frames are selected from Class Queues . Class 0 has highest priority Algorithm − − − − Strict Priority (SP) Weighted Scheduling Shaped Aware Fair Scheduling (SAFS) Weighted Bandwidth Fair Scheduling (WBFS) TM External Use 58 WBFS WBFS Shape Aware Shape Aware Fair Scheduling Fair Scheduling StrictPriority StrictPriority WBFS WBFS StrictPriority StrictPriority WBFS CQ15 CQ14 CQ13 CQ12 CQ11 CQ10 CQ9 CQ8 CQ7 CQ6 CQ5 CQ4 CQ3 CQ2 CQ1 CQ0 • Network IF Logics Channel Scheduler for LNI #9 • Class Scheduler Ch6 unshaped 8 Indpt, 8 grp Classes Class Scheduler Ch7 Class Scheduler Ch8 Shaped Shaped 3 Indpt, 7 grp 2 Indpt, 8 grp Token Bucket Shaper for Committed Rate Token Bucket Shaper for Excess Rate Weighted Bandwidth Fair Scheduling (WBFS) • Weighted Bandwidth Fair Scheduling (WBFS) is used to schedule packets from queues within a priority group such that each gets a “fair” amount of bandwidth made available to that priority group • The premises for fairness for algorithm is: − − available bandwidth is divided and offered equally to all classes offered bandwidth in excess of a class’s demand is to be re-offered equally to classes with unmet demand 10G First ReDistribution 1.5G Second Redistribution .2G 5 3 2 2G .5G .1G Initial Distribution BW available Number of classes with unmet demand Bandwidth to be offer to each class Demand Class 0 Class 1 Class 2 Class 3 Class 4 Total Consumption .5G 2G 2.3G 3G 4G 11.8G Offered & Unmet Offered & Unmet Offered & Retained Demand Retained Demand Retained .5G 0 2G 0 2G .3G .3G 0 2G 1G .5G .5G .1G 2G 2G .5G 1.5G .1G 8.5G 1.3G .2G TM External Use 59 Total BW Attained 0G .5G 2G 2.3G 2.6G 2.6G 10G DPAA: SEC Engine TM External Use 60 Security Engine • Black Keys − • Blobs − − • Blobs protect data confidentiality and integrity across power cycles, but do not protect against unauthorized decapsulation or substitution of another user’s blobs In addition to protecting data confidentiality and integrity across power cycles, Blobs cryptographically protect against blob snooping/substitution between security domains Trusted Descriptors − − • In addition to protecting against external bus snooping, Black Keys cryptographically protect against key snooping between security domains Trusted Descriptors protect descriptor integrity, but do not distinguish between Trusted Descriptors created by different users In addition to protecting Trusted Descriptor integrity, Trusted Descriptors now cryptographically distinguish between Trusted Descriptors created in different security domains DECO Request Source Register − Register added TM External Use 61 QorIQ T4240 Processor SEC 5.0 Features Queue Interface DMA Job Queue Controller Job Ring I/F RTIC Descriptor Controllers CHAs PKHA STHA AFHA KFHA RNG4 MDHA AESA ZHA TM External Use 62 DESA Header & Trailer off-load for the following Security Protocols: − IPSec, SSL/TLS, 3G RLC, PDCP, SRTP, 802.11i, 802.16e, 802.1ae (3) Public Key Hardware Accelerator (PKHA) − RSA and Diffie-Hellman (to 4096b) − Elliptic curve cryptography (1024b) − Supports Run Time Equalization (1) Random Number Generators (RNG4) − NIST Certified (4) Snow 3G Hardware Accelerators (STHA) − Implements Snow 3.0 − Two for Encryption (F8), two for Integrity (F9) (4) ZUC Hardware Accelerators (ZHA) − Two for Encryption, two for Integrity (2) ARC Four Hardware Accelerators (AFHA) − Compatible with RC4 algorithm (8) Kasumi F8/F9 Hardware Accelerators (KFHA) − F8 , F9 as required for 3GPP − A5/3 for GSM and EDGE − GEA-3 for GPRS (8) Message Digest Hardware Accelerators (MDHA) − SHA-1, SHA-2 256,384,512-bit digests − MD5 128-bit digest − HMAC with all algorithms (8) Advanced Encryption Standard Accelerators (AESA) − Key lengths of 128-, 192-, and 256-bit − ECB, CBC, CTR, CCM, GCM, CMAC, OFB, CFB, and XTS (8) Data Encryption Standard Accelerators (DESA) − DES, 3DES (2K, 3K) − ECB, CBC, OFB modes (8) CRC Unit − CRC32, CRC32C, 802.16e OFDMA CRC Life of a Job Descriptor • • • • • • • • • • • • DDR/CoreNet Buffer Mgr (Shared Desc, Frame) SP Status SP1 SP2 SP3 SP4 SP5 R 0 0 0 0 1 FQ ID List FDs 000 001 101 011 111 FQ 1E 2D 3E TM External Use 63 Arbiter Arbiter AFHA AFHA STHA f9 STHA f9 STHA f8 STHA f8 FQ E E E FQ D D E FQ E E E Holding Tank Pool Holding Tank 7 . . . . . . . DECO 7 . . . . . . . MDHA CRCA KFHA DESA FQ E E E Descriptor Buffer CCB 7 Arbiter RNG4 Arbiter PKHA PKHA PKHA Arbiter ZUCE ZUCE ZUEA ZUEA AESA MDHA CRCA KFHA DESA DMA • QI has room for more work, issues dequeue request for 1 Queue or 3 FDs Manager FD1 Qman selects FQ and provides 1 FD along with FQ Information QI creates [internal] Job Descriptor and if Queue Interface necessary, obtains output buffers Job Prep Logic QI transfers completed Job Descriptor into one of the Holding Tanks JD1 Job Queue Controller finds an available DECO, transfers JD1 to it DECO initiates DMA of Shared Descriptor from Job Queue Controller system memory, places it in Descriptor Buffer with JD from Holding Tank JR 0 Job Queues DECO executes descriptor commands, loading JR 1 registers and FIFOs in its CCB Holding JR 2 CCB obtains and controls CHA(s) to process the Tank 0 data per DECO commands JR 3 DECO commands DMA to store results and any updated context to system memory DECO Pool As input buffers are being emptied, DECO tells QI, which may release them back to BMan DECO 0 Upon completion of all processing through CCB, Descriptor DECO resets CCB Buffer DECO informs QI that JD1 has completed with status code X, data of length Y has been written to address Z CCB 0 QI creates outbound FD, enqueues to Qman using FQID from Ctx B field AESA DPAA: DCE TM External Use 64 DPAA Interaction: Frame Descriptor Status/CMD • • The Status/Command word in the dequeued FD allows software to modify the processing of individual frames while retaining the performance advantages of enqueuing to a FQ for flow based processing The three most significant bits of the Command /Status field of the Frame Descriptor have the following meaning: 0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 DD LIODN offset BPID ELIODN offset - - - - addr addr (cont) Format Offset CMD Token: Pass through data that is echoed with the returned Frame. Command Encoding 65 - SCUS USDC USPC UHC CE Token CF B64 RB I R SCRF Z Flush Token TM External Use 0123456789111111111122222 2 2 2 2 2 3 3 012345678901234 5 6 7 8 9 0 1 OO Description Process Command Reserved Reserved Reserved Context Invalidate Command Reserved Reserved NOP Command CMD 3 MSB 000 001 010 011 100 101 110 111 Length Status (output Frame) DCE Inputs TM External Use 66 Context_A FD3 PID BPID Offset FQs Buffer Data Length FD2 PID BPID Offset Buffer Data Length Status/Cmd WQ6 channel Addr Addr WQ7 DCE Addr Addr Status/Cmd FD1 WQ5 PID BPID WQ4 Addr Buffer Addr Offset Data Length WQ3 Status/Cmd WQ2 Comp Decomp WQ1 WQ0 FD3 PID BPID Addr Addr Offset Buffer Data Length Status/Cmd FQs FD2 PID WQ7 BPID WQ5 Addr Buffer Addr Offset Data Length Status/Cmd WQ6 channel SW enqueues work to DCE via Frame Queues. FQs define the flow for stateful processing • FQ initialization creates a location for the DCE to use when storing flow stream context • Each work item within the flow is defined by a Frame Descriptor, which includes length, pointer, offsets, and commands • DCE has separate channels for compress and decompress DCP Portal • Flow Stream Context FD1 PID BPID WQ4 Addr Addr Offset WQ3 Length Buffer Data Status/Cmd WQ2 WQ1 WQ0 Flow Context_A Stream Context Command DCE Outputs • FD3 PID BPID External Use 67 Length Status/Cmd FD2 PID BPID Addr Context_A DCE Addr Offset Length Status/Cmd FQ s FD1 PID BPID Addr Addr Offset Length Status/Cmd FQ s FD3 PID BPID Addr Addr Offset Length Status/Cmd FD2 PID BPID Addr Addr Offset Context_A Length Status/Cmd FD1 PID BPID Addr Addr Offset Length Status/Cmd Status TM Addr Addr Offset Flow Stream Context DCP Portal DCE enqueues results Buffer to SW via Frame Data Queues as defined by Buffer Data FQ Context_B field. When buffers obtained Buffer Data from Bman, buffer pool ID defined by Input FQ • Each result is defined by Buffer a Frame Descriptor, Data which includes a Status Buffer field Data • DCE updates flow Buffer Data stream context located at Context_A as needed Flow Stream Context Comp Decomp PME TM External Use 68 Frame Descriptor: STATUS/CMD Treatment • PME Frame Descriptor Commands − b111 NOP − b101 FCR − b100 FCW − b001 PMTCC − b000 SCAN NOP Command Flow Context Read Command Flow Context Write Command Table Configuration Command Scan Command 0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 DD LIODN offset BPID ELIODN offset - - - - addr addr (cont) Format Offset Length Status/CMD Scan b000 SRV F S/ E M R SET TM External Use 69 Subset Life of a Packet inside Pattern Matching Engine FD1 192.168.1.1:80 192.168.1.1:25 192.168.1.1:1863 TCP TCP TCP 10.10.10.100:16734 10.10.10.100:17784 10.10.10.100:16855 Frame Queue: A flowA:FD1: 192.168.1.1:80->10.10.10.100:16734 “I want to search free “ flowA:FD2: 192.168.1.1:80->10.10.10.100:16734 “scale FTF 2014 event schedule” DDR Patt1 /free/ Memory tag=0x0001 I W A N T T O S E A R C H F R E E BMan QMan On-Chip System Bus Interface CoreNet Access to Pattern Descriptors and State Pattern Matcher Frame Agent (PMFA) Key Element Scanning Engine (KES) Data Examination Engine FD2 (DXE) Hash Tables Cache Stateful Rule Engine (SRE) • • • • Combined hash/NFA technology 9.6 Gbps raw performance Max 32K patterns of up to 128B length Patterns − − • KES − Cache • • External Use 70 Retrieve the pattern with matched hash value for a final comparison. SRE − TM Compare hash value of incoming data(frames) against all patterns DXE − User Definable Reports Patt1 /free/ tag=0x0001 Patt2 /freescale/ tag=0x0002 Optionally post process match result before sending the report to the CPU Debug TM External Use 71 Core Debug in Multi-Thread Environment • Almost all resources are private. Internal debug works as if they are separate cores • External debug is private per thread. An option exists to halt both threads when one thread halts − While threads can be debug halted individually, it is generally not very useful if the debug session will care about the contents of the MMU and caches − Halting both threads prevents the other thread from continuing to compute and essentially clean the L1 caches and the MMU of the state of the thread which initiated the debug halt TM External Use 72 DPAA Debug trace • During packet processing, FMan can trace packet processing flow through each of the FMan modules and trap a packet. 01234567891111111111222222222233 0123456789012345678901 D D LIODN offset BPID ELIO - - - DN offset addr (cont) Fmt Offset Length STATUS/CMD TM External Use 73 addr Summary TM External Use 74 QorIQ T4 Series Advance Features Summary Feature High perf/watt Highly integrated SOC Sophisticated PCIe capability Advanced Ethernet Secure Boot Altivec Power Management Advanced virtualization Hardware offload 3x Scalability Benefit • 188k CoreMark in 55W = 3.4 CM/W • Compare to Intel E5-2650: 146k CM in 95W = 1.5 CW/W; • Or: Intel E5-2687W: 200k MC in 150W = 1.3 CM/W • T4 is more than 2x better than E5 • 2x perf/watt compared to P4080, FSL’s previous flagship Integration of 4x 10GE interfaces, local bus, Interlaken, SRIO mean that few chips (takes at least four chips with Intel) and higher performance density • SR-IOV for showing VMs a virtual NIC, 128 VFs (Virtual Functions) • Four ports with ability to be root complex or endpoint for flexible configurations • Data Center Bridging for lossless Ethernet and QoS • 10GBase-KR for backplane connections Prevents code theft, system hacking, and reverse engineering On-board SIMD engine – sonar/radar and imaging • Thread, core, and cluster deep sleep modes • Automatic deep sleep of unused resources • Hypervisor privilege level enables safe guest OS at high performance • IOMMU ensures memory accesses are restricted to correct area • Virtualization of I/O blocks • Packet handling to 50Gb/s • Security engine to 40Gb/s • Data compression and decompression to 20Gb/s • Pattern matching to 10Gb/s • 1-, 2-, and 3- cluster solution is 3x performance range over T4080 – T4240 • Enables customer to develop multiple SKUs from on PCB TM External Use 75 Other Sessions And Useful Information • FTF2014 Sessions for QorIQ T4 Devices − FTF-NET-F0070_QorIQ Platforms Trust Arch Overview − FTF-NET-F0139_AltiVec_Programming − FTF-NET-F0146_Introduction_to_DPAA − FTF-NET-F0147-DPAAusage − FTF-NET-F0148_DPAA_Debug − FTF-NET-F0157_QorIQ Platforms Trust Arch Demo & Deep Dive • T4240 Product Website − http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=T4240 • Online Training − http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=T4240& tab=Design_Support_Tab TM External Use 76 Introducing The QorIQ LS2 Family Breakthrough, software-defined approach to advance the world’s new virtualized networks New, high-performance architecture built with ease-of-use in mind Groundbreaking, flexible architecture that abstracts hardware complexity and enables customers to focus their resources on innovation at the application level Optimized for software-defined networking applications Balanced integration of CPU performance with network I/O and C-programmable datapath acceleration that is right-sized (power/performance/cost) to deliver advanced SoC technology for the SDN era Extending the industry’s broadest portfolio of 64-bit multicore SoCs Built on the ARM® Cortex®-A57 architecture with integrated L2 switch enabling interconnect and peripherals to provide a complete system-on-chip solution TM External Use 77 QorIQ LS2 Family Key Features High performance cores with leading interconnect and memory bandwidth • SDN/NFV Switching • • 8x ARM Cortex-A57 cores, 2.0GHz, 4MB L2 cache, w Neon SIMD 1MB L3 platform cache w/ECC 2x 64b DDR4 up to 2.4GT/s A high performance datapath designed with software developers in mind Data Center • • Wireless Access • New datapath hardware and abstracted acceleration that is called via standard Linux objects 40 Gbps Packet processing performance with 20Gbps acceleration (crypto, Pattern Match/RegEx, Data Compression) Management complex provides all init/setup/teardown tasks Leading network I/O integration Unprecedented performance and ease of use for smarter, more capable networks TM External Use 78 • • • • 8x1/10GbE + 8x1G, MACSec on up to 4x 1/10GbE Integrated L2 switching capability for cost savings 4 PCIe Gen3 controllers, 1 with SR-IOV support 2 x SATA 3.0, 2 x USB 3.0 with PHY See the LS2 Family First in the Tech Lab! 4 new demos built on QorIQ LS2 processors: Performance Analysis Made Easy Leave the Packet Processing To Us Combining Ease of Use with Performance Tools for Every Step of Your Design TM External Use 79 TM www.Freescale.com © 2014 Freescale Semiconductor, Inc. | External Use QorIQ T4240 SerDes Options Total of four x8 banks Ethernet options: • 10Gbps Ethernet MACs with XAUI or XFI • 1Gbps Ethernet MACs with SGMII (1 lane at 1.25 GHz with 3.125 GHz option for 2.5Gbps Ethernet) • 2 MACs can be used with RGMII • 4 x1Gbps Ethernet MACs can be supported using a single lane at 5 GHz (QSGMII) • HiGig is supported with 4 lines at 3.125 GHz or 3.75 GHz (HiGig+) High speed serial • 2.5 , 5, 8 GHz for PCIe • 2.5, 3.125, and 5 GHz for sRIO • 3.125, 6.25, and 10.3125 GHz for Interlaken • 1.5, 3.0 GHz for SATA • 1.25, 2.5, 3.125, and 5 GHz for debug TM External Use 81 Decompression Compression Engine • • • • Zlib: As specified in RFC1950 Deflate: As specified as in RFC1951 GZIP: As specified in RFC1952 Encoding − • • • supports Base 64 encoding and decoding (RFC4648) ZLIB, GZIP and DEFLATE header insertion ZLIB and GZIP CRC computation and insertion 4 modes of compression − No compression (just add DEFLATE header) − Encode only using static/dynamic Huffman codes − Compress and encode using static OR dyamic Huffman codes − at least 2.5:1 compression ratio on the Calgary Corpus • 4KB History Decompressor All standard modes of decompression − No compression − Static Huffman codes − Synamic Huffman codes • Compressor 32KB History Provides option to return original compressed Frame along with the uncompressed Frame or release the buffers to BMAN TM External Use 82 Frame Agent Bus I/F To Corenet QMan I/F QMan Portal BMan I/F BMan Portal TM www.Freescale.com © 2014 Freescale Semiconductor, Inc. | External Use